[jira] [Commented] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Mehrdad Nurolahzade (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862579#comment-15862579
 ] 

Mehrdad Nurolahzade commented on AURORA-1837:
-

I like the idea of a {{RateLimitedBatchWorker}}. 

I'm hoping the submitted patch addresses some of the efficiency issues here. 
However, it does not address the problem with high workload that should be rate 
limited to ensure it does not interfere with scheduling and does not cause 
bursty garbage collection pressure.

> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Mehrdad Nurolahzade (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862581#comment-15862581
 ] 

Mehrdad Nurolahzade commented on AURORA-1837:
-

https://reviews.apache.org/r/56575/

> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1837:
---

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DB GC ({{RowGarbageCollector}}) can be 
characterized as house-keeping work that is not in the critical scheduling 
path, it would make sense to rate-limit these ambient activities, so that the 
scheduler is protected from bursts of non-critical work (like - job updates 
with large of instances, network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1837) Improve task history pruning

2017-02-10 Thread Mehrdad Nurolahzade (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862230#comment-15862230
 ] 

Mehrdad Nurolahzade commented on AURORA-1837:
-

The current implementation is also inefficient in the sense that it tries to 
delete an expired task multiple times. The asynchronous nature of 
{{BatchWorker}} which used to process task deletions introduces some delay 
between delete enqueue and delete execution. As a result, tasks already deemed 
deleted in a previous evaluation round might get picked up, evaluated and 
enqueued for deletion multiple times (note that if a task is not found 
{{deleteTasks.deleteTasks()}} does not fail).

This is evident in {{tasks_pruned}} metric which reflects numbers much higher 
than the actual number of expired tasks deleted.

> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1837) Improve task history pruning

2017-02-10 Thread Mehrdad Nurolahzade (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861981#comment-15861981
 ] 

Mehrdad Nurolahzade commented on AURORA-1837:
-

As I mentioned above, we have noticed that most task history pruning happens 
after scheduler restarts and can severely hamper scheduler performance (or 
cause consecutive fail-overs on test clusters when we put load test on 
scheduler).

The reason is that scheduler loses its in-memory state of operations scheduled 
with {{DelayExecutor}} upon restart/failure. {{TaskHistoryPruner}} learns about 
these dead task upon restart when it replays log and these dead tasks are 
picked up by the second call to {{executor.execute()}} that performs job level 
pruning immediately (i.e., without delay).

> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)