[jira] [Commented] (AURORA-1837) Improve task history pruning
[ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862579#comment-15862579 ] Mehrdad Nurolahzade commented on AURORA-1837: - I like the idea of a {{RateLimitedBatchWorker}}. I'm hoping the submitted patch addresses some of the efficiency issues here. However, it does not address the problem with high workload that should be rate limited to ensure it does not interfere with scheduling and does not cause bursty garbage collection pressure. > Improve task history pruning > > > Key: AURORA-1837 > URL: https://issues.apache.org/jira/browse/AURORA-1837 > Project: Aurora > Issue Type: Task >Reporter: Reza Motamedi >Assignee: Mehrdad Nurolahzade >Priority: Minor > Labels: scheduler > > Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks > upon terminal _state_ change for pruning. > {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to > schedule the process of pruning _task_s. However, we have noticed most of > pruning takes place after scheduler recovers from a fail-over. > Modify {{TaskHistoryPruner}} to a design similar to > {{JobUpdateHistoryPruner}}: > # Instead of registering delay executor's upon terminal task state > transitions, have it wake up on preconfigured intervals, find all terminal > state tasks that meet pruning criteria and delete them. > # Make the initial task history pruning delay configurable so that it does > not hamper scheduler upon start. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1837) Improve task history pruning
[ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862581#comment-15862581 ] Mehrdad Nurolahzade commented on AURORA-1837: - https://reviews.apache.org/r/56575/ > Improve task history pruning > > > Key: AURORA-1837 > URL: https://issues.apache.org/jira/browse/AURORA-1837 > Project: Aurora > Issue Type: Task >Reporter: Reza Motamedi >Assignee: Mehrdad Nurolahzade >Priority: Minor > Labels: scheduler > > Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks > upon terminal _state_ change for pruning. > {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to > schedule the process of pruning _task_s. However, we have noticed most of > pruning takes place after scheduler recovers from a fail-over. > Modify {{TaskHistoryPruner}} to a design similar to > {{JobUpdateHistoryPruner}}: > # Instead of registering delay executor's upon terminal task state > transitions, have it wake up on preconfigured intervals, find all terminal > state tasks that meet pruning criteria and delete them. > # Make the initial task history pruning delay configurable so that it does > not hamper scheduler upon start. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1837) Improve task history pruning
[ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862566#comment-15862566 ] Santhosh Kumar Shanmugham commented on AURORA-1837: --- Looks like the {{CallOrderEnforcingStorage}} [publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100] {{TaskStateChange}} event for every known task on startup. Note: how the {{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on startup); this causes the delay to become ZERO. Due to the inefficiency in the implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although {{BatchWorker}} is designed to reduce lock-contention it does not provide any rate-limiting and suffers from bursty workloads. Responsiveness to bursty workload makes sense for scheduling work, however the same cannot be said for house-keeping work. Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning ({{JobUpdateHistoryPruner}}) and DB GC ({{RowGarbageCollector}}) can be characterized as house-keeping work that is not in the critical scheduling path, it would make sense to rate-limit these ambient activities, so that the scheduler is protected from bursts of non-critical work (like - job updates with large of instances, network-partition, cleaning up after scale-test). One possible design would involve creating a new {{RateLimitedBatchWorker}} that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide priority to critical (scheduling) work from {{JobUpdateController}}, {{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a {{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, {{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the new {{RateLimitedBatchWorker}} which will be released into the underlying {{BatchWorker}} at a steady rate. We can take advantage of Java's [PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html] and Guava's [RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] > Improve task history pruning > > > Key: AURORA-1837 > URL: https://issues.apache.org/jira/browse/AURORA-1837 > Project: Aurora > Issue Type: Task >Reporter: Reza Motamedi >Assignee: Mehrdad Nurolahzade >Priority: Minor > Labels: scheduler > > Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks > upon terminal _state_ change for pruning. > {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to > schedule the process of pruning _task_s. However, we have noticed most of > pruning takes place after scheduler recovers from a fail-over. > Modify {{TaskHistoryPruner}} to a design similar to > {{JobUpdateHistoryPruner}}: > # Instead of registering delay executor's upon terminal task state > transitions, have it wake up on preconfigured intervals, find all terminal > state tasks that meet pruning criteria and delete them. > # Make the initial task history pruning delay configurable so that it does > not hamper scheduler upon start. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1837) Improve task history pruning
[ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862230#comment-15862230 ] Mehrdad Nurolahzade commented on AURORA-1837: - The current implementation is also inefficient in the sense that it tries to delete an expired task multiple times. The asynchronous nature of {{BatchWorker}} which used to process task deletions introduces some delay between delete enqueue and delete execution. As a result, tasks already deemed deleted in a previous evaluation round might get picked up, evaluated and enqueued for deletion multiple times (note that if a task is not found {{deleteTasks.deleteTasks()}} does not fail). This is evident in {{tasks_pruned}} metric which reflects numbers much higher than the actual number of expired tasks deleted. > Improve task history pruning > > > Key: AURORA-1837 > URL: https://issues.apache.org/jira/browse/AURORA-1837 > Project: Aurora > Issue Type: Task >Reporter: Reza Motamedi >Assignee: Mehrdad Nurolahzade >Priority: Minor > Labels: scheduler > > Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks > upon terminal _state_ change for pruning. > {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to > schedule the process of pruning _task_s. However, we have noticed most of > pruning takes place after scheduler recovers from a fail-over. > Modify {{TaskHistoryPruner}} to a design similar to > {{JobUpdateHistoryPruner}}: > # Instead of registering delay executor's upon terminal task state > transitions, have it wake up on preconfigured intervals, find all terminal > state tasks that meet pruning criteria and delete them. > # Make the initial task history pruning delay configurable so that it does > not hamper scheduler upon start. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1837) Improve task history pruning
[ https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861981#comment-15861981 ] Mehrdad Nurolahzade commented on AURORA-1837: - As I mentioned above, we have noticed that most task history pruning happens after scheduler restarts and can severely hamper scheduler performance (or cause consecutive fail-overs on test clusters when we put load test on scheduler). The reason is that scheduler loses its in-memory state of operations scheduled with {{DelayExecutor}} upon restart/failure. {{TaskHistoryPruner}} learns about these dead task upon restart when it replays log and these dead tasks are picked up by the second call to {{executor.execute()}} that performs job level pruning immediately (i.e., without delay). > Improve task history pruning > > > Key: AURORA-1837 > URL: https://issues.apache.org/jira/browse/AURORA-1837 > Project: Aurora > Issue Type: Task >Reporter: Reza Motamedi >Assignee: Mehrdad Nurolahzade >Priority: Minor > Labels: scheduler > > Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks > upon terminal _state_ change for pruning. > {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to > schedule the process of pruning _task_s. However, we have noticed most of > pruning takes place after scheduler recovers from a fail-over. > Modify {{TaskHistoryPruner}} to a design similar to > {{JobUpdateHistoryPruner}}: > # Instead of registering delay executor's upon terminal task state > transitions, have it wake up on preconfigured intervals, find all terminal > state tasks that meet pruning criteria and delete them. > # Make the initial task history pruning delay configurable so that it does > not hamper scheduler upon start. -- This message was sent by Atlassian JIRA (v6.3.15#6346)