[jira] [Commented] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Mehrdad Nurolahzade (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862579#comment-15862579
 ] 

Mehrdad Nurolahzade commented on AURORA-1837:
-

I like the idea of a {{RateLimitedBatchWorker}}. 

I'm hoping the submitted patch addresses some of the efficiency issues here. 
However, it does not address the problem with high workload that should be rate 
limited to ensure it does not interfere with scheduling and does not cause 
bursty garbage collection pressure.

> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Mehrdad Nurolahzade (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862581#comment-15862581
 ] 

Mehrdad Nurolahzade commented on AURORA-1837:
-

https://reviews.apache.org/r/56575/

> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 11:01 PM:
-

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which in-turn will release the work into the 
underlying {{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which in-turn will be release the work into the 
underlying {{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change 

[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 11:00 PM:
-

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which in-turn will be release the work into the 
underlying {{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 

[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 10:58 PM:
-

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DB GC ({{RowGarbageCollector}}) can be 
characterized as house-keeping work that is not in the critical scheduling 
path, it would make sense to rate-limit these ambient activities, so that the 
scheduler is protected from bursts of non-critical work (like - job updates 
with large of instances, network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses 

[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham edited comment on AURORA-1837 at 2/11/17 10:59 PM:
-

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large number of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]



was (Author: santhk):
Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DataBase Garbage Collection 
({{RowGarbageCollector}}) can be characterized as house-keeping work that is 
not in the critical scheduling path, it would make sense to rate-limit these 
ambient activities, so that the scheduler is protected from bursts of 
non-critical work (like - job updates with large of instances, 
network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> 

[jira] [Commented] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Santhosh Kumar Shanmugham (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862566#comment-15862566
 ] 

Santhosh Kumar Shanmugham commented on AURORA-1837:
---

Looks like the {{CallOrderEnforcingStorage}} 
[publishes|https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/CallOrderEnforcingStorage.java#L95-L100]
 {{TaskStateChange}} event for every known task on startup. Note: how the 
{{oldState}} is set to {{Optional.absent()}} (due to lack of knowledge on 
startup); this causes the delay to become ZERO. Due to the inefficiency in the 
implemenation we enqueue ~ O(N^2) items into {{BatchWorker}} queue. Although 
{{BatchWorker}} is designed to reduce lock-contention it does not provide any 
rate-limiting and suffers from bursty workloads. Responsiveness to bursty 
workload  makes sense for scheduling work, however the same cannot be said for 
house-keeping work.

Seeing how history-pruning ({{TaskHistoryPruner}}), job-update-pruning 
({{JobUpdateHistoryPruner}}) and DB GC ({{RowGarbageCollector}}) can be 
characterized as house-keeping work that is not in the critical scheduling 
path, it would make sense to rate-limit these ambient activities, so that the 
scheduler is protected from bursts of non-critical work (like - job updates 
with large of instances, network-partition, cleaning up after scale-test). 

One possible design would involve creating a new {{RateLimitedBatchWorker}} 
that feeds into the {{BatchWorker}}'s queue at a controlled rate. To provide 
priority to critical (scheduling) work from {{JobUpdateController}}, 
{{TaskThrottler}} etc, {{BatchWorker}}'s queue should be changed to a 
{{PriorityQueue}} (with necessary changes to {{Work}}). {{TaskHistoryPruner}}, 
{{JobHistoryPruner}} and {{RowGarbageCollector}} can now enqueue work into the 
new {{RateLimitedBatchWorker}} which will be released into the underlying 
{{BatchWorker}} at a steady rate.

We can take advantage of Java's 
[PriorityQueue|https://docs.oracle.com/javase/7/docs/api/java/util/PriorityQueue.html]
 and Guava's 
[RateLimiter|https://google.github.io/guava/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html]


> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (AURORA-1837) Improve task history pruning

2017-02-11 Thread Mehrdad Nurolahzade (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15746961#comment-15746961
 ] 

Mehrdad Nurolahzade edited comment on AURORA-1837 at 2/11/17 7:59 PM:
--

-This becomes a problem on a scheduler that has been down longer than 
{{HISTORY_PRUNE_THRESHOLD}} and an attempt is made to relaunch it. If the 
scheduler was used for load testing, for example, all/many tasks will be dead 
at relaunch time.-

@StephanErb's suggestion above should improve performance. However we need to 
come up with a design around breaking apart instance-based and job-based 
pruning. 


was (Author: mnurolahzade):
This becomes a problem on a scheduler that has been down longer than 
{{HISTORY_PRUNE_THRESHOLD}} and an attempt is made to relaunch it. If the 
scheduler was used for load testing, for example, all/many tasks will be dead 
at relaunch time.

@StephanErb's suggestion above should improve performance. However we need to 
come up with a design around breaking apart instance-based and job-based 
pruning. 

> Improve task history pruning
> 
>
> Key: AURORA-1837
> URL: https://issues.apache.org/jira/browse/AURORA-1837
> Project: Aurora
>  Issue Type: Task
>Reporter: Reza Motamedi
>Assignee: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: scheduler
>
> Current implementation of {{TaskHistoryPrunner}} registers all inactive tasks 
> upon terminal _state_ change for pruning. 
> {{TaskHistoryPrunner::registerInactiveTask()}} uses a delay executor to 
> schedule the process of pruning _task_s. However, we have noticed most of 
> pruning takes place after scheduler recovers from a fail-over.
> Modify {{TaskHistoryPruner}} to a design similar to 
> {{JobUpdateHistoryPruner}}:
> # Instead of registering delay executor's upon terminal task state 
> transitions, have it wake up on preconfigured intervals, find all terminal 
> state tasks that meet pruning criteria and delete them.
> # Make the initial task history pruning delay configurable so that it does 
> not hamper scheduler upon start.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)