> On Feb. 14, 2017, 4:02 p.m., Mehrdad Nurolahzade wrote: > > Looking at it as is, I'm not sure if there is much value to be gained from > > pushing this down to `TaskStore`. > > Do you see any value in pursuing this idea any further? Or shall I restore > > it to previous state? > > Zameer Manji wrote: > I think there is value in moving operations like this down to storage. > First, it mirrors the update store for pruning old updates. > Second, I think we can benefit from more precise storage operations. It > allows us to leverage the power of DbTaskStore (ie we don't need to hydrate a > full object for deletion) and ensures that callers don't do excessive work. > > Mehrdad Nurolahzade wrote: > Right, that was the motivation. > > However the way task deletion is handled now, it needs to go through > `StateManager` to (a) verify transition to `DELETED` is allowrd and (b) > publish `TaskDeleted` events to other subscribers like `TaskGroups`, > `TaskVars`, and so on. For this to work at storage level, like it's done for > job updates, we need to address its rather inefficient dependency on > `TaskManager`. > > Santhosh Kumar Shanmugham wrote: > Looking how there might be more work needed here, why don't we use this > RB to fix the O(N^2) issue on startup and handle the other improvements in a > separate RB. The way the current `TaskHistoryPruner` works on startup is > going need a fix.
Fair point, I am going to restore this RB to previous state unblocking it from getting merged. We can address the general inefficiencies in task delete workflow separately. - Mehrdad ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/56575/#review165599 ----------------------------------------------------------- On Feb. 14, 2017, 3:38 p.m., Mehrdad Nurolahzade wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/56575/ > ----------------------------------------------------------- > > (Updated Feb. 14, 2017, 3:38 p.m.) > > > Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and > Stephan Erb. > > > Bugs: AURORA-1837 > https://issues.apache.org/jira/browse/AURORA-1837 > > > Repository: aurora > > > Description > ------- > > This patch addressed efficiency issues in the current implementation of > `TaskHistoryPruner`. The new design is similar to that of > `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run > upon terminal task state transitions, it runs on preconfigured intervals, > finds all terminal state tasks that meet pruning criteria and deletes them. > (b) Makes the initial task history pruning delay configurable so that it does > not hamper scheduler upon start. > > The new design addressed the following two efficiecy problems: > > 1. Upon scheduler restart/failure, the in-memory state of task history > pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns > about these dead tasks upon restart when log is replayed. These expired tasks > are picked up by the second call to `executor.execute()` that performs job > level pruning immediately (i.e., without delay). Hence, most task history > pruning happens after scheduler restarts and can severely hamper scheduler > performance (or cause consecutive fail-overs on test clusters when we put > load test on scheduler). > > 2. Expired tasks can be picked up for pruning multiple times. The > asynchronous nature of `BatchWorker` which used to process task deletions > introduces some delay between delete enqueue and delete execution. As a > result, tasks already queued for deletion in a previous evaluation round > might get picked up, evaluated and enqueued for deletion again. This is > evident in `tasks_pruned` metric which reflects numbers much higher than the > actual number of expired tasks deleted. > > > Diffs > ----- > > src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java > 735199ac1ccccab343c24471890aa330d6635c26 > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java > f77849498ff23616f1d56d133eb218f837ac3413 > src/main/java/org/apache/aurora/scheduler/storage/TaskStore.java > 1094a122fe836e53d0481ee5c097447f1e91fa0a > src/main/java/org/apache/aurora/scheduler/storage/db/DbTaskStore.java > a649a6e3d2f2d0aeaf6d7ac704ed24911c310a1e > src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java > d89e715b1b08faf95f8b5788c9c28cbbb33af093 > > src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java > 14e4040e0b94e96f77068b41454311fa3bf53573 > > src/test/java/org/apache/aurora/scheduler/storage/mem/InMemTaskStoreTest.java > 02719c312294b58525c1fddd3ed096a9b1cef601 > > Diff: https://reviews.apache.org/r/56575/diff/ > > > Testing > ------- > > Manual testing under Vagrant > > > Thanks, > > Mehrdad Nurolahzade > >