Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-03-13 Thread Mehrdad Nurolahzade


> On Feb. 15, 2017, 9:40 a.m., David McLaughlin wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java
> > Lines 134-135 (original), 134-135 (patched)
> > 
> >
> > Can you explain the reasoning behind doing this in a loop per job 
> > rather than a single loop that queries all tasks?
> 
> Mehrdad Nurolahzade wrote:
> That's how I originally did it (see revision 1). The concern expressed by 
> reviewers (read above) was around the following areas:
> 1. We are trying to load/filter all terminal state tasks at once which 
> might cause heap pressure. Looking at one job at a time might be slow and 
> inefficient from a throughput standpoint (which is a secondary concern) but 
> can help with putting less pressure on the heap.
> 2. `MemStore` does not have a scheduling status index therefore it has to 
> sequentially scan all tasks to find matching tasks.
> 
> David McLaughlin wrote:
> Were those concerns backed by data? Do we have performance numbers from 
> before and after that change? It doesn't seem to me like a given that moving 
> from a single query to N+1 queries leads to less work and less objects being 
> created. Especially when the tasks are already on heap (DbTaskStore is still 
> considered experimental and not production ready. We should not optimize for 
> it). 
> 
> For point (2) - this does not prevent us scanning every single task in 
> the store. It just divides the full scan into multiple queries, and each of 
> those queries (and subsequent filtering) has the overhead of objects, 
> streams, sorted copies (yikes) and collections that are used for filtering 
> inside the task processing loop.
> 
> Mehrdad Nurolahzade wrote:
> No, I did not verify performance/heap profile of the original version. 
> I'm going to deploy it to a test cluster and see if there is a noticable 
> difference in run-time behavior.
> 
> I agree on both points regarding `MemTaskStore`. Now, I am seeing value 
> in separating prunable task selection implementations for `MemTaskStore` and 
> `DbTaskStore`. But, that can be addressed in a follow-up ticket.
> 
> Stephan Erb wrote:
> Were you able to run a version of this patch on your scale cluster? Any 
> new insights?

I did naive observations under a test cluster and both implementations (prune 
all vs prune by job) seem to work fine. However, for the final word on this, we 
are still waiting to have our production-like test cluster to be ready (soon) 
before we can verify which implementation produces the least unwanted 
side-effect.


- Mehrdad


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165733
---


On Feb. 15, 2017, 9:07 a.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 15, 2017, 9:07 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in 

Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-03-12 Thread Stephan Erb


> On Feb. 15, 2017, 6:40 p.m., David McLaughlin wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java
> > Lines 134-135 (original), 134-135 (patched)
> > 
> >
> > Can you explain the reasoning behind doing this in a loop per job 
> > rather than a single loop that queries all tasks?
> 
> Mehrdad Nurolahzade wrote:
> That's how I originally did it (see revision 1). The concern expressed by 
> reviewers (read above) was around the following areas:
> 1. We are trying to load/filter all terminal state tasks at once which 
> might cause heap pressure. Looking at one job at a time might be slow and 
> inefficient from a throughput standpoint (which is a secondary concern) but 
> can help with putting less pressure on the heap.
> 2. `MemStore` does not have a scheduling status index therefore it has to 
> sequentially scan all tasks to find matching tasks.
> 
> David McLaughlin wrote:
> Were those concerns backed by data? Do we have performance numbers from 
> before and after that change? It doesn't seem to me like a given that moving 
> from a single query to N+1 queries leads to less work and less objects being 
> created. Especially when the tasks are already on heap (DbTaskStore is still 
> considered experimental and not production ready. We should not optimize for 
> it). 
> 
> For point (2) - this does not prevent us scanning every single task in 
> the store. It just divides the full scan into multiple queries, and each of 
> those queries (and subsequent filtering) has the overhead of objects, 
> streams, sorted copies (yikes) and collections that are used for filtering 
> inside the task processing loop.
> 
> Mehrdad Nurolahzade wrote:
> No, I did not verify performance/heap profile of the original version. 
> I'm going to deploy it to a test cluster and see if there is a noticable 
> difference in run-time behavior.
> 
> I agree on both points regarding `MemTaskStore`. Now, I am seeing value 
> in separating prunable task selection implementations for `MemTaskStore` and 
> `DbTaskStore`. But, that can be addressed in a follow-up ticket.

Were you able to run a version of this patch on your scale cluster? Any new 
insights?


- Stephan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165733
---


On Feb. 15, 2017, 6:07 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 15, 2017, 6:07 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> 

Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-15 Thread Mehrdad Nurolahzade


> On Feb. 15, 2017, 9:40 a.m., David McLaughlin wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, 
> > lines 150-151
> > 
> >
> > Can you explain the reasoning behind doing this in a loop per job 
> > rather than a single loop that queries all tasks?
> 
> Mehrdad Nurolahzade wrote:
> That's how I originally did it (see revision 1). The concern expressed by 
> reviewers (read above) was around the following areas:
> 1. We are trying to load/filter all terminal state tasks at once which 
> might cause heap pressure. Looking at one job at a time might be slow and 
> inefficient from a throughput standpoint (which is a secondary concern) but 
> can help with putting less pressure on the heap.
> 2. `MemStore` does not have a scheduling status index therefore it has to 
> sequentially scan all tasks to find matching tasks.
> 
> David McLaughlin wrote:
> Were those concerns backed by data? Do we have performance numbers from 
> before and after that change? It doesn't seem to me like a given that moving 
> from a single query to N+1 queries leads to less work and less objects being 
> created. Especially when the tasks are already on heap (DbTaskStore is still 
> considered experimental and not production ready. We should not optimize for 
> it). 
> 
> For point (2) - this does not prevent us scanning every single task in 
> the store. It just divides the full scan into multiple queries, and each of 
> those queries (and subsequent filtering) has the overhead of objects, 
> streams, sorted copies (yikes) and collections that are used for filtering 
> inside the task processing loop.

No, I did not verify performance/heap profile of the original version. I'm 
going to deploy it to a test cluster and see if there is a noticable difference 
in run-time behavior.

I agree on both points regarding `MemTaskStore`. Now, I am seeing value in 
separating prunable task selection implementations for `MemTaskStore` and 
`DbTaskStore`. But, that can be addressed in a follow-up ticket.


- Mehrdad


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165733
---


On Feb. 15, 2017, 9:07 a.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 15, 2017, 9:07 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: 

Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-15 Thread David McLaughlin


> On Feb. 15, 2017, 5:40 p.m., David McLaughlin wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, 
> > lines 150-151
> > 
> >
> > Can you explain the reasoning behind doing this in a loop per job 
> > rather than a single loop that queries all tasks?
> 
> Mehrdad Nurolahzade wrote:
> That's how I originally did it (see revision 1). The concern expressed by 
> reviewers (read above) was around the following areas:
> 1. We are trying to load/filter all terminal state tasks at once which 
> might cause heap pressure. Looking at one job at a time might be slow and 
> inefficient from a throughput standpoint (which is a secondary concern) but 
> can help with putting less pressure on the heap.
> 2. `MemStore` does not have a scheduling status index therefore it has to 
> sequentially scan all tasks to find matching tasks.

Were those concerns backed by data? Do we have performance numbers from before 
and after that change? It doesn't seem to me like a given that moving from a 
single query to N+1 queries leads to less work and less objects being created. 
Especially when the tasks are already on heap (DbTaskStore is still considered 
experimental and not production ready. We should not optimize for it). 

For point (2) - this does not prevent us scanning every single task in the 
store. It just divides the full scan into multiple queries, and each of those 
queries (and subsequent filtering) has the overhead of objects, streams, sorted 
copies (yikes) and collections that are used for filtering inside the task 
processing loop.


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165733
---


On Feb. 15, 2017, 5:07 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 15, 2017, 5:07 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-15 Thread Mehrdad Nurolahzade


> On Feb. 15, 2017, 9:40 a.m., David McLaughlin wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, 
> > lines 150-151
> > 
> >
> > Can you explain the reasoning behind doing this in a loop per job 
> > rather than a single loop that queries all tasks?

That's how I originally did it (see revision 1). The concern expressed by 
reviewers (read above) was around the following areas:
1. We are trying to load/filter all terminal state tasks at once which might 
cause heap pressure. Looking at one job at a time might be slow and inefficient 
from a throughput standpoint (which is a secondary concern) but can help with 
putting less pressure on the heap.
2. `MemStore` does not have a scheduling status index therefore it has to 
sequentially scan all tasks to find matching tasks.


- Mehrdad


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165733
---


On Feb. 15, 2017, 9:07 a.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 15, 2017, 9:07 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-15 Thread David McLaughlin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165733
---




src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (lines 
134 - 135)


Can you explain the reasoning behind doing this in a loop per job rather 
than a single loop that queries all tasks?


- David McLaughlin


On Feb. 15, 2017, 5:07 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 15, 2017, 5:07 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-15 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165726
---


Ship it!




Master (9ea8979) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Feb. 15, 2017, 5:07 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 15, 2017, 5:07 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-15 Thread Mehrdad Nurolahzade

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/
---

(Updated Feb. 15, 2017, 9:07 a.m.)


Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
Stephan Erb.


Changes
---

- Restored to previous changeset (pruning takes place at `TaskHistoryPruner`)
- Removed `Thread.sleep()` calls in pruning loop


Bugs: AURORA-1837
https://issues.apache.org/jira/browse/AURORA-1837


Repository: aurora


Description
---

This patch addressed efficiency issues in the current implementation of 
`TaskHistoryPruner`. The new design is similar to that of 
`JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run upon 
terminal task state transitions, it runs on preconfigured intervals, finds all 
terminal state tasks that meet pruning criteria and deletes them. (b) Makes the 
initial task history pruning delay configurable so that it does not hamper 
scheduler upon start.

The new design addressed the following two efficiecy problems:

1. Upon scheduler restart/failure, the in-memory state of task history pruning 
scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns about these 
dead tasks upon restart when log is replayed. These expired tasks are picked up 
by the second call to `executor.execute()` that performs job level pruning 
immediately (i.e., without delay). Hence, most task history pruning happens 
after scheduler restarts and can severely hamper scheduler performance (or 
cause consecutive fail-overs on test clusters when we put load test on 
scheduler).

2. Expired tasks can be picked up for pruning multiple times. The asynchronous 
nature of `BatchWorker` which used to process task deletions introduces some 
delay between delete enqueue and delete execution. As a result, tasks already 
queued for deletion in a previous evaluation round might get picked up, 
evaluated and enqueued for deletion again. This is evident in `tasks_pruned` 
metric which reflects numbers much higher than the actual number of expired 
tasks deleted.


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
735199ac1ab343c24471890aa330d6635c26 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f77849498ff23616f1d56d133eb218f837ac3413 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
14e4040e0b94e96f77068b41454311fa3bf53573 

Diff: https://reviews.apache.org/r/56575/diff/


Testing
---

Manual testing under Vagrant


Thanks,

Mehrdad Nurolahzade



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-15 Thread Mehrdad Nurolahzade


> On Feb. 14, 2017, 4:02 p.m., Mehrdad Nurolahzade wrote:
> > Looking at it as is, I'm not sure if there is much value to be gained from 
> > pushing this down to `TaskStore`.
> > Do you see any value in pursuing this idea any further? Or shall I restore 
> > it to previous state?
> 
> Zameer Manji wrote:
> I think there is value in moving operations like this down to storage.
> First, it mirrors the update store for pruning old updates.
> Second, I think we can benefit from more precise storage operations. It 
> allows us to leverage the power of DbTaskStore (ie we don't need to hydrate a 
> full object for deletion) and ensures that callers don't do excessive work.
> 
> Mehrdad Nurolahzade wrote:
> Right, that was the motivation. 
> 
> However the way task deletion is handled now, it needs to go through 
> `StateManager` to (a) verify transition to `DELETED` is allowrd and (b) 
> publish `TaskDeleted` events to other subscribers like `TaskGroups`, 
> `TaskVars`, and so on. For this to work at storage level, like it's done for 
> job updates, we need to address its rather inefficient dependency on 
> `TaskManager`.
> 
> Santhosh Kumar Shanmugham wrote:
> Looking how there might be more work needed here, why don't we use this 
> RB to fix the O(N^2) issue on startup and handle the other improvements in a 
> separate RB. The way the current `TaskHistoryPruner` works on startup is 
> going need a fix.

Fair point, I am going to restore this RB to previous state unblocking it from 
getting merged.
We can address the general inefficiencies in task delete workflow separately.


- Mehrdad


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165599
---


On Feb. 14, 2017, 3:38 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 14, 2017, 3:38 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   src/main/java/org/apache/aurora/scheduler/storage/TaskStore.java 
> 1094a122fe836e53d0481ee5c097447f1e91fa0a 
>   src/main/java/org/apache/aurora/scheduler/storage/db/DbTaskStore.java 
> a649a6e3d2f2d0aeaf6d7ac704ed24911c310a1e 
>   src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java 
> d89e715b1b08faf95f8b5788c9c28cbbb33af093 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
>   
> src/test/java/org/apache/aurora/scheduler/storage/mem/InMemTaskStoreTest.java 
> 02719c312294b58525c1fddd3ed096a9b1cef601 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 

Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-14 Thread Santhosh Kumar Shanmugham


> On Feb. 14, 2017, 4:02 p.m., Mehrdad Nurolahzade wrote:
> > Looking at it as is, I'm not sure if there is much value to be gained from 
> > pushing this down to `TaskStore`.
> > Do you see any value in pursuing this idea any further? Or shall I restore 
> > it to previous state?
> 
> Zameer Manji wrote:
> I think there is value in moving operations like this down to storage.
> First, it mirrors the update store for pruning old updates.
> Second, I think we can benefit from more precise storage operations. It 
> allows us to leverage the power of DbTaskStore (ie we don't need to hydrate a 
> full object for deletion) and ensures that callers don't do excessive work.
> 
> Mehrdad Nurolahzade wrote:
> Right, that was the motivation. 
> 
> However the way task deletion is handled now, it needs to go through 
> `StateManager` to (a) verify transition to `DELETED` is allowrd and (b) 
> publish `TaskDeleted` events to other subscribers like `TaskGroups`, 
> `TaskVars`, and so on. For this to work at storage level, like it's done for 
> job updates, we need to address its rather inefficient dependency on 
> `TaskManager`.

Looking how there might be more work needed here, why don't we use this RB to 
fix the O(N^2) issue on startup and handle the other improvements in a separate 
RB. The way the current `TaskHistoryPruner` works on startup is going need a 
fix.


- Santhosh Kumar


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165599
---


On Feb. 14, 2017, 3:38 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 14, 2017, 3:38 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   src/main/java/org/apache/aurora/scheduler/storage/TaskStore.java 
> 1094a122fe836e53d0481ee5c097447f1e91fa0a 
>   src/main/java/org/apache/aurora/scheduler/storage/db/DbTaskStore.java 
> a649a6e3d2f2d0aeaf6d7ac704ed24911c310a1e 
>   src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java 
> d89e715b1b08faf95f8b5788c9c28cbbb33af093 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
>   
> src/test/java/org/apache/aurora/scheduler/storage/mem/InMemTaskStoreTest.java 
> 02719c312294b58525c1fddd3ed096a9b1cef601 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-14 Thread Mehrdad Nurolahzade


> On Feb. 14, 2017, 4:02 p.m., Mehrdad Nurolahzade wrote:
> > Looking at it as is, I'm not sure if there is much value to be gained from 
> > pushing this down to `TaskStore`.
> > Do you see any value in pursuing this idea any further? Or shall I restore 
> > it to previous state?
> 
> Zameer Manji wrote:
> I think there is value in moving operations like this down to storage.
> First, it mirrors the update store for pruning old updates.
> Second, I think we can benefit from more precise storage operations. It 
> allows us to leverage the power of DbTaskStore (ie we don't need to hydrate a 
> full object for deletion) and ensures that callers don't do excessive work.

Right, that was the motivation. 

However the way task deletion is handled now, it needs to go through 
`StateManager` to (a) verify transition to `DELETED` is allowrd and (b) publish 
`TaskDeleted` events to other subscribers like `TaskGroups`, `TaskVars`, and so 
on. For this to work at storage level, like it's done for job updates, we need 
to address its rather inefficient dependency on `TaskManager`.


- Mehrdad


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165599
---


On Feb. 14, 2017, 3:38 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 14, 2017, 3:38 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   src/main/java/org/apache/aurora/scheduler/storage/TaskStore.java 
> 1094a122fe836e53d0481ee5c097447f1e91fa0a 
>   src/main/java/org/apache/aurora/scheduler/storage/db/DbTaskStore.java 
> a649a6e3d2f2d0aeaf6d7ac704ed24911c310a1e 
>   src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java 
> d89e715b1b08faf95f8b5788c9c28cbbb33af093 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
>   
> src/test/java/org/apache/aurora/scheduler/storage/mem/InMemTaskStoreTest.java 
> 02719c312294b58525c1fddd3ed096a9b1cef601 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-14 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165606
---


Ship it!




Master (0e9c086) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Feb. 14, 2017, 11:38 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 14, 2017, 11:38 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   src/main/java/org/apache/aurora/scheduler/storage/TaskStore.java 
> 1094a122fe836e53d0481ee5c097447f1e91fa0a 
>   src/main/java/org/apache/aurora/scheduler/storage/db/DbTaskStore.java 
> a649a6e3d2f2d0aeaf6d7ac704ed24911c310a1e 
>   src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java 
> d89e715b1b08faf95f8b5788c9c28cbbb33af093 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
>   
> src/test/java/org/apache/aurora/scheduler/storage/mem/InMemTaskStoreTest.java 
> 02719c312294b58525c1fddd3ed096a9b1cef601 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-14 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165605
---



Master (0e9c086) is red with this patch.
  ./build-support/jenkins/build.sh

Execution failed for task ':test'.
> Process 'Gradle Test Executor 6' finished with non-zero exit value 137

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.
==

2: Task failed with an exception.
---
* What went wrong:
Execution failed for task ':analyzeReport'.
> Test coverage missing for org/apache/aurora/scheduler/storage/db/views/DbImage
  Test coverage missing for org/apache/aurora/scheduler/http/Mname
  Test coverage missing for org/apache/aurora/scheduler/http/Services
  Test coverage missing for org/apache/aurora/scheduler/http/QuitCallback
  Test coverage missing for org/apache/aurora/scheduler/http/Cron
  Test coverage missing for org/apache/aurora/scheduler/app/VolumeParser
  Test coverage missing for 
org/apache/aurora/scheduler/stats/AsyncStatsModule$OfferAdapter
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/IniShiroRealmModule
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/ShiroUtils
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/HttpSecurityModule$3
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/HttpSecurityModule$2
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/KerberosPrincipalParser
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/Kerberos5ShiroRealmModule
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/Kerberos5ShiroRealmModule$1
  Test coverage missing for 
org/apache/aurora/scheduler/discovery/CommonsServiceDiscoveryModule
  Test coverage missing for 
org/apache/aurora/scheduler/reconciliation/KillRetry$KillAttempt
  Test coverage missing for 
org/apache/aurora/scheduler/cron/quartz/CronPredictorImpl
  Test coverage missing for 
org/apache/aurora/scheduler/preemptor/Preemptor$PreemptorImpl
  Test coverage missing for 
org/apache/aurora/scheduler/events/PubsubEvent$DriverDisconnected
  Test coverage missing for 
org/apache/aurora/scheduler/events/PubsubEvent$DriverRegistered
  Test coverage missing for 
org/apache/aurora/scheduler/storage/db/typehandlers/VolumeModeTypeHandler

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.
==

BUILD FAILED

Total time: 29 mins 57.137 secs


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Feb. 14, 2017, 11:38 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 14, 2017, 11:38 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion 

Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-14 Thread Zameer Manji


> On Feb. 14, 2017, 4:02 p.m., Mehrdad Nurolahzade wrote:
> > Looking at it as is, I'm not sure if there is much value to be gained from 
> > pushing this down to `TaskStore`.
> > Do you see any value in pursuing this idea any further? Or shall I restore 
> > it to previous state?

I think there is value in moving operations like this down to storage.
First, it mirrors the update store for pruning old updates.
Second, I think we can benefit from more precise storage operations. It allows 
us to leverage the power of DbTaskStore (ie we don't need to hydrate a full 
object for deletion) and ensures that callers don't do excessive work.


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165599
---


On Feb. 14, 2017, 3:38 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 14, 2017, 3:38 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   src/main/java/org/apache/aurora/scheduler/storage/TaskStore.java 
> 1094a122fe836e53d0481ee5c097447f1e91fa0a 
>   src/main/java/org/apache/aurora/scheduler/storage/db/DbTaskStore.java 
> a649a6e3d2f2d0aeaf6d7ac704ed24911c310a1e 
>   src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java 
> d89e715b1b08faf95f8b5788c9c28cbbb33af093 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
>   
> src/test/java/org/apache/aurora/scheduler/storage/mem/InMemTaskStoreTest.java 
> 02719c312294b58525c1fddd3ed096a9b1cef601 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-14 Thread Mehrdad Nurolahzade

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165599
---



Looking at it as is, I'm not sure if there is much value to be gained from 
pushing this down to `TaskStore`.
Do you see any value in pursuing this idea any further? Or shall I restore it 
to previous state?

- Mehrdad Nurolahzade


On Feb. 14, 2017, 3:38 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 14, 2017, 3:38 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   src/main/java/org/apache/aurora/scheduler/storage/TaskStore.java 
> 1094a122fe836e53d0481ee5c097447f1e91fa0a 
>   src/main/java/org/apache/aurora/scheduler/storage/db/DbTaskStore.java 
> a649a6e3d2f2d0aeaf6d7ac704ed24911c310a1e 
>   src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java 
> d89e715b1b08faf95f8b5788c9c28cbbb33af093 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
>   
> src/test/java/org/apache/aurora/scheduler/storage/mem/InMemTaskStoreTest.java 
> 02719c312294b58525c1fddd3ed096a9b1cef601 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-14 Thread Mehrdad Nurolahzade

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/
---

(Updated Feb. 14, 2017, 3:38 p.m.)


Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
Stephan Erb.


Changes
---

WIP: pushing prunable history selection logic down to `TaskStore`.
- Done: `MemTaskStore`
- ToDo: `DbTaskStore`


Bugs: AURORA-1837
https://issues.apache.org/jira/browse/AURORA-1837


Repository: aurora


Description
---

This patch addressed efficiency issues in the current implementation of 
`TaskHistoryPruner`. The new design is similar to that of 
`JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run upon 
terminal task state transitions, it runs on preconfigured intervals, finds all 
terminal state tasks that meet pruning criteria and deletes them. (b) Makes the 
initial task history pruning delay configurable so that it does not hamper 
scheduler upon start.

The new design addressed the following two efficiecy problems:

1. Upon scheduler restart/failure, the in-memory state of task history pruning 
scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns about these 
dead tasks upon restart when log is replayed. These expired tasks are picked up 
by the second call to `executor.execute()` that performs job level pruning 
immediately (i.e., without delay). Hence, most task history pruning happens 
after scheduler restarts and can severely hamper scheduler performance (or 
cause consecutive fail-overs on test clusters when we put load test on 
scheduler).

2. Expired tasks can be picked up for pruning multiple times. The asynchronous 
nature of `BatchWorker` which used to process task deletions introduces some 
delay between delete enqueue and delete execution. As a result, tasks already 
queued for deletion in a previous evaluation round might get picked up, 
evaluated and enqueued for deletion again. This is evident in `tasks_pruned` 
metric which reflects numbers much higher than the actual number of expired 
tasks deleted.


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
735199ac1ab343c24471890aa330d6635c26 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f77849498ff23616f1d56d133eb218f837ac3413 
  src/main/java/org/apache/aurora/scheduler/storage/TaskStore.java 
1094a122fe836e53d0481ee5c097447f1e91fa0a 
  src/main/java/org/apache/aurora/scheduler/storage/db/DbTaskStore.java 
a649a6e3d2f2d0aeaf6d7ac704ed24911c310a1e 
  src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java 
d89e715b1b08faf95f8b5788c9c28cbbb33af093 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
14e4040e0b94e96f77068b41454311fa3bf53573 
  src/test/java/org/apache/aurora/scheduler/storage/mem/InMemTaskStoreTest.java 
02719c312294b58525c1fddd3ed096a9b1cef601 

Diff: https://reviews.apache.org/r/56575/diff/


Testing
---

Manual testing under Vagrant


Thanks,

Mehrdad Nurolahzade



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-13 Thread Mehrdad Nurolahzade

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165419
---



After testing this patch on our test clusters, I can see that this 
implementation is generating lots of garbage that causes heap pressure. This is 
happening because filtering takes place in memory. Hence, even if no tasks are 
to be pruned, still lots of garbage is generated due to filtering.

To avoid in-memory filtering, I am going to experiment with moving the pruning 
logic into `TaskStore`; similar to how it is currently done in 
`JobUpdateStore.pruneHistory()`.

- Mehrdad Nurolahzade


On Feb. 13, 2017, 9:30 a.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 13, 2017, 9:30 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-13 Thread David McLaughlin


> On Feb. 13, 2017, 8:24 p.m., David McLaughlin wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, 
> > line 188
> > 
> >
> > If the goal is to reduce GC pressure, then what we want is to put an 
> > upper bound on object allocation. To do that, I'd be +1 to your earlier 
> > proposal to have an argument that limits the total number of tasks (across 
> > all jobs) that can be pruned. 
> > 
> > The other way to (possibly) reduce GC pressure is to do most of the 
> > filtering in H2 and only fetch task ids rather than fully hydrated task 
> > objects. Since H2 is on-heap, it might end up generating a lot of gargbage 
> > anyway.. but it would be hard to imagine that being more than the saving 
> > through the MyBatis -> Immutable Thrift translation.

Although the latter point only applies to those using DbTaskStore...


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165388
---


On Feb. 13, 2017, 5:30 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 13, 2017, 5:30 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-13 Thread David McLaughlin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165388
---




src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 
172)


If the goal is to reduce GC pressure, then what we want is to put an upper 
bound on object allocation. To do that, I'd be +1 to your earlier proposal to 
have an argument that limits the total number of tasks (across all jobs) that 
can be pruned. 

The other way to (possibly) reduce GC pressure is to do most of the 
filtering in H2 and only fetch task ids rather than fully hydrated task 
objects. Since H2 is on-heap, it might end up generating a lot of gargbage 
anyway.. but it would be hard to imagine that being more than the saving 
through the MyBatis -> Immutable Thrift translation.


- David McLaughlin


On Feb. 13, 2017, 5:30 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 13, 2017, 5:30 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-13 Thread Santhosh Kumar Shanmugham


> On Feb. 13, 2017, 9:49 a.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, 
> > lines 187-188
> > 
> >
> > I am still not convinced this is really needed. For example, 
> > `MetricCalculator` doesn't have such a throttle either. 
> > 
> > Have you tested if those 10ms are sufficient and/or needed at all? How 
> > did you came up with this number?

I am afraid that this will start making less sense in the future, since there 
are no comments on how this value was chosen. Why don't we add this only if we 
need it, after performing a test?


- Santhosh Kumar


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165356
---


On Feb. 13, 2017, 9:30 a.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 13, 2017, 9:30 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-13 Thread Santhosh Kumar Shanmugham


> On Feb. 12, 2017, 2:59 p.m., Mehrdad Nurolahzade wrote:
> > To further manage task history pruning heap/workload pressure, we can 
> > introduce a new configuation parameter in `PruningModule` for specifying 
> > the max number of tasks to be pruned per round. It can default to -1 
> > (unlimited).
> 
> Mehrdad Nurolahzade wrote:
> We can also consider adding sleep cycles (e.g., `Thread.sleep(10)`) in 
> the job iteration loop to further slow down cpu/heap pressure.
> 
> Stephan Erb wrote:
> We should only add those if you observe a need for them in your clusters.

+1 for adding them only if need be.


- Santhosh Kumar


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165282
---


On Feb. 13, 2017, 9:30 a.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 13, 2017, 9:30 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-13 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165356
---


Fix it, then Ship it!




LGTM in general, +- the sleep below.


src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (lines 
171 - 172)


I am still not convinced this is really needed. For example, 
`MetricCalculator` doesn't have such a throttle either. 

Have you tested if those 10ms are sufficient and/or needed at all? How did 
you came up with this number?


- Stephan Erb


On Feb. 13, 2017, 6:30 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 13, 2017, 6:30 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-13 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165357
---



Master (ad3377a) is red with this patch.
  ./build-support/jenkins/build.sh

Execution failed for task ':test'.
> Process 'Gradle Test Executor 6' finished with non-zero exit value 137

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.
==

2: Task failed with an exception.
---
* What went wrong:
Execution failed for task ':analyzeReport'.
> Test coverage missing for org/apache/aurora/scheduler/storage/db/views/DbImage
  Test coverage missing for org/apache/aurora/scheduler/http/Mname
  Test coverage missing for org/apache/aurora/scheduler/http/Services
  Test coverage missing for org/apache/aurora/scheduler/http/QuitCallback
  Test coverage missing for org/apache/aurora/scheduler/http/Cron
  Test coverage missing for org/apache/aurora/scheduler/app/VolumeParser
  Test coverage missing for 
org/apache/aurora/scheduler/stats/AsyncStatsModule$OfferAdapter
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/IniShiroRealmModule
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/ShiroUtils
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/HttpSecurityModule$3
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/HttpSecurityModule$2
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/KerberosPrincipalParser
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/Kerberos5ShiroRealmModule
  Test coverage missing for 
org/apache/aurora/scheduler/http/api/security/Kerberos5ShiroRealmModule$1
  Test coverage missing for 
org/apache/aurora/scheduler/discovery/CommonsServiceDiscoveryModule
  Test coverage missing for 
org/apache/aurora/scheduler/reconciliation/KillRetry$KillAttempt
  Test coverage missing for 
org/apache/aurora/scheduler/cron/quartz/CronPredictorImpl
  Test coverage missing for 
org/apache/aurora/scheduler/preemptor/Preemptor$PreemptorImpl
  Test coverage missing for 
org/apache/aurora/scheduler/events/PubsubEvent$DriverDisconnected
  Test coverage missing for 
org/apache/aurora/scheduler/events/PubsubEvent$DriverRegistered
  Test coverage missing for 
org/apache/aurora/scheduler/storage/db/typehandlers/VolumeModeTypeHandler

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.
==

BUILD FAILED

Total time: 5 mins 47.367 secs


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Feb. 13, 2017, 5:30 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 13, 2017, 5:30 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. 

Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-13 Thread Mehrdad Nurolahzade

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/
---

(Updated Feb. 13, 2017, 9:30 a.m.)


Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
Stephan Erb.


Changes
---

1. Review feedback
2. Added missing test-case
3. Added sleep cycles between processing jobs to soften the workload/heap blow


Bugs: AURORA-1837
https://issues.apache.org/jira/browse/AURORA-1837


Repository: aurora


Description
---

This patch addressed efficiency issues in the current implementation of 
`TaskHistoryPruner`. The new design is similar to that of 
`JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run upon 
terminal task state transitions, it runs on preconfigured intervals, finds all 
terminal state tasks that meet pruning criteria and deletes them. (b) Makes the 
initial task history pruning delay configurable so that it does not hamper 
scheduler upon start.

The new design addressed the following two efficiecy problems:

1. Upon scheduler restart/failure, the in-memory state of task history pruning 
scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns about these 
dead tasks upon restart when log is replayed. These expired tasks are picked up 
by the second call to `executor.execute()` that performs job level pruning 
immediately (i.e., without delay). Hence, most task history pruning happens 
after scheduler restarts and can severely hamper scheduler performance (or 
cause consecutive fail-overs on test clusters when we put load test on 
scheduler).

2. Expired tasks can be picked up for pruning multiple times. The asynchronous 
nature of `BatchWorker` which used to process task deletions introduces some 
delay between delete enqueue and delete execution. As a result, tasks already 
queued for deletion in a previous evaluation round might get picked up, 
evaluated and enqueued for deletion again. This is evident in `tasks_pruned` 
metric which reflects numbers much higher than the actual number of expired 
tasks deleted.


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
735199ac1ab343c24471890aa330d6635c26 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f77849498ff23616f1d56d133eb218f837ac3413 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
14e4040e0b94e96f77068b41454311fa3bf53573 

Diff: https://reviews.apache.org/r/56575/diff/


Testing
---

Manual testing under Vagrant


Thanks,

Mehrdad Nurolahzade



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-13 Thread Mehrdad Nurolahzade


> On Feb. 12, 2017, 11:54 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, 
> > lines 174-182
> > 
> >
> > Shouldn't all this be done on the `expiredTasks` rather than the 
> > `inactiveTasks`? Otherwise we risk deleting tasks which have not been 
> > around for the minimal retention period.

You are right, that's a mistake.


- Mehrdad


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165297
---


On Feb. 12, 2017, 2:49 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 12, 2017, 2:49 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-13 Thread Stephan Erb


> On Feb. 12, 2017, 11:59 p.m., Mehrdad Nurolahzade wrote:
> > To further manage task history pruning heap/workload pressure, we can 
> > introduce a new configuation parameter in `PruningModule` for specifying 
> > the max number of tasks to be pruned per round. It can default to -1 
> > (unlimited).
> 
> Mehrdad Nurolahzade wrote:
> We can also consider adding sleep cycles (e.g., `Thread.sleep(10)`) in 
> the job iteration loop to further slow down cpu/heap pressure.

We should only add those if you observe a need for them in your clusters.


- Stephan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165282
---


On Feb. 12, 2017, 11:49 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 12, 2017, 11:49 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-12 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165297
---




src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 
136)


Have you considered something like `Sets.newHashSet(...)`. This requires 
less code for the same effect.



src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (lines 
157 - 165)


Shouldn't all this be done on the `expiredTasks` rather than the 
`inactiveTasks`? Otherwise we risk deleting tasks which have not been around 
for the minimal retention period.


- Stephan Erb


On Feb. 12, 2017, 11:49 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 12, 2017, 11:49 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-12 Thread Mehrdad Nurolahzade


> On Feb. 12, 2017, 2:59 p.m., Mehrdad Nurolahzade wrote:
> > To further manage task history pruning heap/workload pressure, we can 
> > introduce a new configuation parameter in `PruningModule` for specifying 
> > the max number of tasks to be pruned per round. It can default to -1 
> > (unlimited).

We can also consider adding sleep cycles (e.g., `Thread.sleep(10)`) in the job 
iteration loop to further slow down cpu/heap pressure.


- Mehrdad


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165282
---


On Feb. 12, 2017, 2:49 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 12, 2017, 2:49 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-12 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165283
---


Ship it!




Master (ad3377a) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Feb. 12, 2017, 10:49 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 12, 2017, 10:49 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-12 Thread Mehrdad Nurolahzade

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165282
---



To further manage task history pruning heap/workload pressure, we can introduce 
a new configuation parameter in `PruningModule` for specifying the max number 
of tasks to be pruned per round. It can default to -1 (unlimited).

- Mehrdad Nurolahzade


On Feb. 12, 2017, 2:49 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 12, 2017, 2:49 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-12 Thread Mehrdad Nurolahzade

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/
---

(Updated Feb. 12, 2017, 2:49 p.m.)


Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
Stephan Erb.


Changes
---

Now limiting retrieved terminated tasks to a job scope to
(1) decrease heap pressure, and 
(2) eliminate full-scan of `MemTaskStore`


Bugs: AURORA-1837
https://issues.apache.org/jira/browse/AURORA-1837


Repository: aurora


Description
---

This patch addressed efficiency issues in the current implementation of 
`TaskHistoryPruner`. The new design is similar to that of 
`JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run upon 
terminal task state transitions, it runs on preconfigured intervals, finds all 
terminal state tasks that meet pruning criteria and deletes them. (b) Makes the 
initial task history pruning delay configurable so that it does not hamper 
scheduler upon start.

The new design addressed the following two efficiecy problems:

1. Upon scheduler restart/failure, the in-memory state of task history pruning 
scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns about these 
dead tasks upon restart when log is replayed. These expired tasks are picked up 
by the second call to `executor.execute()` that performs job level pruning 
immediately (i.e., without delay). Hence, most task history pruning happens 
after scheduler restarts and can severely hamper scheduler performance (or 
cause consecutive fail-overs on test clusters when we put load test on 
scheduler).

2. Expired tasks can be picked up for pruning multiple times. The asynchronous 
nature of `BatchWorker` which used to process task deletions introduces some 
delay between delete enqueue and delete execution. As a result, tasks already 
queued for deletion in a previous evaluation round might get picked up, 
evaluated and enqueued for deletion again. This is evident in `tasks_pruned` 
metric which reflects numbers much higher than the actual number of expired 
tasks deleted.


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
735199ac1ab343c24471890aa330d6635c26 
  src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
f77849498ff23616f1d56d133eb218f837ac3413 
  src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
14e4040e0b94e96f77068b41454311fa3bf53573 

Diff: https://reviews.apache.org/r/56575/diff/


Testing
---

Manual testing under Vagrant


Thanks,

Mehrdad Nurolahzade



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-12 Thread Stephan Erb


> On Feb. 12, 2017, 9:14 a.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, 
> > line 62
> > 
> >
> > It is worthwhile to note that we are moving from a workload that was 
> > spread over a duration to a bursty instanteous workload (saw-tooth like), 
> > which can potentially make the situation worse by causing a thundering-herd 
> > at regular intervals.
> 
> Mehrdad Nurolahzade wrote:
> That's a valid concern; testing can better clarify this.
> 
> I agree that the existing algorithm offers a better best/average case 
> behavior (due to its scheduled pruning strategy). However, I still think the 
> worst case behavior of this implementation is better for two reasons (1) 
> every task/job is evaluated only once and (2) first prune after restart is 
> similar to other prunes and is not burstier. The burst can better be tamed by 
> reducing the pruning interval (e.g., 5 minutes).
> 
> I believe the key to get this bursty workload under control is extending 
> `org.apache.aurora.scheduler.base.Query` abstraction. If we add something 
> like `.limit(int)` then we can control the max volume of tasks retrieved == 
> load to be processed == garbage to be collected.

Have you considered to use a control flow in the form of:

for job j in all jobs:
  retrive terminal tasks of j
  do pruning for retrieved tasks  
   
This would result in less peak memory consumption as only a small portion of 
terminal tasks will be worked on simultaneously. If you are concerned about 
heap pressure this may be a favorable setup.


- Stephan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165260
---


On Feb. 12, 2017, 12:12 a.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 12, 2017, 12:12 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/base/Query.java 
> c76b365f43eb6a3b9b0b63a879b43eb04dcd8fac 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-12 Thread Stephan Erb


> On Feb. 12, 2017, 9:14 a.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, 
> > line 153
> > 
> >
> > We need to verify this change at scale. We will be performing 
> > full-table scans since we cannot leverage the `jobIndex` secondary index. 
> > Should we create another index?

Given that since couple of weeks ago every rendering of the scheduler landing 
page resulted in a full table scale, I believe you will be good without an 
index.


- Stephan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165260
---


On Feb. 12, 2017, 12:12 a.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 12, 2017, 12:12 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/base/Query.java 
> c76b365f43eb6a3b9b0b63a879b43eb04dcd8fac 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-12 Thread Mehrdad Nurolahzade


> On Feb. 12, 2017, 12:14 a.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java, 
> > line 62
> > 
> >
> > It is worthwhile to note that we are moving from a workload that was 
> > spread over a duration to a bursty instanteous workload (saw-tooth like), 
> > which can potentially make the situation worse by causing a thundering-herd 
> > at regular intervals.

That's a valid concern; testing can better clarify this.

I agree that the existing algorithm offers a better best/average case behavior 
(due to its scheduled pruning strategy). However, I still think the worst case 
behavior of this implementation is better for two reasons (1) every task/job is 
evaluated only once and (2) first prune after restart is similar to other 
prunes and is not burstier. The burst can better be tamed by reducing the 
pruning interval (e.g., 5 minutes).

I believe the key to get this bursty workload under control is extending 
`org.apache.aurora.scheduler.base.Query` abstraction. If we add something like 
`.limit(int)` then we can control the max volume of tasks retrieved == load to 
be processed == garbage to be collected.


- Mehrdad


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165260
---


On Feb. 11, 2017, 3:12 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 11, 2017, 3:12 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/base/Query.java 
> c76b365f43eb6a3b9b0b63a879b43eb04dcd8fac 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-12 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165260
---



This change warrants a test at scale (in our test cluster) before committing, 
to make sure we don't regress.


src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 
57)


It is worthwhile to note that we are moving from a workload that was spread 
over a duration to a bursty instanteous workload (saw-tooth like), which can 
potentially make the situation worse by causing a thundering-herd at regular 
intervals.



src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 
138)


We need to verify this change at scale. We will be performing full-table 
scans since we cannot leverage the `jobIndex` secondary index. Should we create 
another index?



src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 
140)


s/jobWithExpiredTasks/jobKeyToExpiredTasks/



src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java (line 
147)


s/jobTasks/expiredTasks/


- Santhosh Kumar Shanmugham


On Feb. 11, 2017, 3:12 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 11, 2017, 3:12 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/base/Query.java 
> c76b365f43eb6a3b9b0b63a879b43eb04dcd8fac 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 56575: AURORA-1837 Improve task history pruning

2017-02-11 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56575/#review165257
---


Ship it!




Master (ad3377a) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On Feb. 11, 2017, 11:12 p.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56575/
> ---
> 
> (Updated Feb. 11, 2017, 11:12 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, and 
> Stephan Erb.
> 
> 
> Bugs: AURORA-1837
> https://issues.apache.org/jira/browse/AURORA-1837
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> This patch addressed efficiency issues in the current implementation of 
> `TaskHistoryPruner`. The new design is similar to that of 
> `JobUpdateHistoryPruner`: (a) Instead of registering a `DelayExecutor` run 
> upon terminal task state transitions, it runs on preconfigured intervals, 
> finds all terminal state tasks that meet pruning criteria and deletes them. 
> (b) Makes the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.
> 
> The new design addressed the following two efficiecy problems:
> 
> 1. Upon scheduler restart/failure, the in-memory state of task history 
> pruning scheduled with `DelayExecutor` is lost. `TaskHistoryPruner` learns 
> about these dead tasks upon restart when log is replayed. These expired tasks 
> are picked up by the second call to `executor.execute()` that performs job 
> level pruning immediately (i.e., without delay). Hence, most task history 
> pruning happens after scheduler restarts and can severely hamper scheduler 
> performance (or cause consecutive fail-overs on test clusters when we put 
> load test on scheduler).
> 
> 2. Expired tasks can be picked up for pruning multiple times. The 
> asynchronous nature of `BatchWorker` which used to process task deletions 
> introduces some delay between delete enqueue and delete execution. As a 
> result, tasks already queued for deletion in a previous evaluation round 
> might get picked up, evaluated and enqueued for deletion again. This is 
> evident in `tasks_pruned` metric which reflects numbers much higher than the 
> actual number of expired tasks deleted.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/base/Query.java 
> c76b365f43eb6a3b9b0b63a879b43eb04dcd8fac 
>   src/main/java/org/apache/aurora/scheduler/pruning/PruningModule.java 
> 735199ac1ab343c24471890aa330d6635c26 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> f77849498ff23616f1d56d133eb218f837ac3413 
>   
> src/test/java/org/apache/aurora/scheduler/pruning/TaskHistoryPrunerTest.java 
> 14e4040e0b94e96f77068b41454311fa3bf53573 
> 
> Diff: https://reviews.apache.org/r/56575/diff/
> 
> 
> Testing
> ---
> 
> Manual testing under Vagrant
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>