> On Feb. 12, 2017, 12:14 a.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, 
> > line 62
> > <https://reviews.apache.org/r/56575/diff/1/?file=1630791#file1630791line62>
> >
> >     It is worthwhile to note that we are moving from a workload that was 
> > spread over a duration to a bursty instanteous workload (saw-tooth like), 
> > which can potentially make the situation worse by causing a thundering-herd 
> > at regular intervals.

That's a valid concern; testing can better clarify this.

I agree that the existing algorithm offers a better best/average case behavior 
(due to its scheduled pruning strategy). However, I still think the worst case 
behavior of this implementation is better for two reasons (1) every task/job is 
evaluated only once and (2) first prune after restart is similar to other 
prunes and is not burstier. The burst can better be tamed by reducing the 
pruning interval (e.g., 5 minutes).

I believe the key to get this bursty workload under control is extending 
`org.apache.aurora.scheduler.base.Query` abstraction. If we add something like 
`.limit(int)` then we can control the max volume of tasks retrieved == load to 
be processed == garbage to be collected.


- Mehrdad


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165260
-----------------------------------------------------------


On Feb. 11, 2017, 3:12 p.m., Mehrdad Nurolahzade wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> -----------------------------------------------------------
> 
> (Updated Feb. 11, 2017, 3:12 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
>     https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/base/Query.java 
> c76b365f43eb6a3b9b0b63a879b43eb04dcd8fac 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ccccab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> -------
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>

Reply via email to