Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-11 Thread Mehrdad Nurolahzade

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/
---

Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
Stephan Erb.


Bugs: AURORA-1837
https://issues.apache.org/jira/browse/AURORA-1837


Repository: aurora


Description
---

This patch addressed efficiency issues in the current implementation of 
`TaskHistoryPruner`. The new design is similar to that of 
`JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run upon 
terminal task state transitions, it runs on preconfigured intervals, finds all 
terminal state tasks that meet pruning criteria and deletes them. (b) Makes the 
initial task history pruning delay configurable so that it does not hamper 
scheduler upon start.

The new design addressed the following two efficiecy problems:

1. Upon scheduler restart/failure, the in-memory state of task history pruning 
scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns about these 
dead tasks upon restart when log is replayed. These expired tasks are picked up 
by the second call to `executor.execute()` that performs job level pruning 
immediately (i.e., without delay). Hence, most task history pruning happens 
after scheduler restarts and can severely hamper scheduler performance (or 
cause consecutive fail-overs on test clusters when we put load test on 
scheduler).

2. Expired tasks can be picked up for pruning multiple times. The asynchronous 
nature of `BatchWorker` which used to process task deletions introduces some 
delay between delete enqueue and delete execution. As a result, tasks already 
queued for deletion in a previous evaluation round might get picked up, 
evaluated and enqueued for deletion again. This is evident in `tasks_pruned` 
metric which reflects numbers much higher than the actual number of expired 
tasks deleted.


Diffs
-

  src/main/java/org/apache/aurora/scheduler/base/Query.java 
c76b365f43eb6a3b9b0b63a879b43eb04dcd8fac 
  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
735199ac1ab343c24471890aa330d6635c26 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f77849498ff23616f1d56d133eb218f837ac3413 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
14e4040e0b94e96f77068b41454311fa3bf53573 

Diff: https://reviews.apache.org/r/56575/diff/


Testing
---

Manual testing under Vagrant


Thanks,

Mehrdad Nurolahzade



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-11 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165257
---


Ship it!




Master (ad3377a) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Feb. 11, 2017, 11:12 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 11, 2017, 11:12 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/base/Query.java 
> c76b365f43eb6a3b9b0b63a879b43eb04dcd8fac 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>