[jira] [Commented] (AURORA-137) Save host attributes only when a task is being scheduled

2016-11-19 Thread Stephan Erb (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15678964#comment-15678964
 ] 

Stephan Erb commented on AURORA-137:


Thinking about it, this might not even be necessary:

* {{offerManager.addOffer}} is thread-safe. It is not necessary to place it in 
a write section. We just have to make sure it is only called once the 
attributes have been handled.
* The call to 
{{storeProvider.getAttributeStore().saveHostAttributes(attributes)}} will most 
certainly only update attributes in very few cases as attributes are not 
dynamic. We could therefore adopt some form of double-checked locking as you 
have proposed in the other ticket.


> Save host attributes only when a task is being scheduled
> 
>
> Key: AURORA-137
> URL: https://issues.apache.org/jira/browse/AURORA-137
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler
>Reporter: Bill Farner
>Priority: Minor
>
> The scheduler currently aggressively saves host attributes when handling 
> {{resourceOffers}}, however it seems tractable for this to only happen when a 
> task is actually scheduled.  Context: the scheduler stores host attributes to 
> satisfy scheduling constraints (like host/rack diversity).  Doing this would 
> allow us to avoid waiting for the storage write lock, and handle 
> {{resourceOffers}} in a more deterministic time frame.
> One caveat with this approach is that the Offer would need to be plumbed into 
> {{SchedulingFilterImpl}} in a way so as to ensure that the attributes are 
> available for the offer being inspected.  In other words, we need to avoid 
> the chicken and egg of trying to read the attributes for a host when this is 
> the first offer ever received for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-137) Save host attributes only when a task is being scheduled

2016-11-19 Thread Bill Farner (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679560#comment-15679560
 ] 

Bill Farner commented on AURORA-137:


The actual number of write activity for attributes would be useful data to 
decide whether {{BatchWorker}} is useful here.  Echoing one of Stephan's points 
- {{DbAttributeStore.saveHostAttributes()}} _should_ only be triggering a write 
when attributes have actually changed, so i'd expect attribute log entries to 
rapidly quiesce.

Relevant snippet from {{DbAttributeStore. saveHostAttributes()}}:
{code}
Optional existing = 
getHostAttributes(hostAttributes.getHost());
if (existing.equals(Optional.of(hostAttributes))) {
  return false;
...
{code}

I think my original goal with this ticket was to eliminate the call to 
{{storage.write()}} when handling {{resourceOffers()}}, and instead store host 
attributes when a scheduling match is found (and we're already in a write 
transaction).  This would satisfy the current purpose of host attributes 
(satisfying diversity constraints) while avoiding holding a highly-contended 
lock unnecessarily.

{quote}
 have noticed contention on storage write lock
{quote}
[~mnurolahzade] is there any data you can share on this?  Presumably the 
existing {{scheduler_resource_offers_*}} no longer give insight, as the method 
body is asynchronous.  Perhaps another low-effort action item is to extract the 
synchronous stage of that method so that the {{@Timed}} stats are meaningful.

> Save host attributes only when a task is being scheduled
> 
>
> Key: AURORA-137
> URL: https://issues.apache.org/jira/browse/AURORA-137
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler
>Reporter: Bill Farner
>Priority: Minor
>
> The scheduler currently aggressively saves host attributes when handling 
> {{resourceOffers}}, however it seems tractable for this to only happen when a 
> task is actually scheduled.  Context: the scheduler stores host attributes to 
> satisfy scheduling constraints (like host/rack diversity).  Doing this would 
> allow us to avoid waiting for the storage write lock, and handle 
> {{resourceOffers}} in a more deterministic time frame.
> One caveat with this approach is that the Offer would need to be plumbed into 
> {{SchedulingFilterImpl}} in a way so as to ensure that the attributes are 
> available for the offer being inspected.  In other words, we need to avoid 
> the chicken and egg of trying to read the attributes for a host when this is 
> the first offer ever received for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING

2016-11-19 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679664#comment-15679664
 ] 

Zameer Manji commented on AURORA-1823:
--

Upon some further analysis {{BatchWorker}} might not help us here. After some 
JMH benchmarking and profiling, the biggest problem with {{insertPendingTasks}} 
is that it doesn't use the bulk storage API {{saveTasks}}. Instead it calls 
{{mutateTask}} for every task that is moving to {{PENDING}}. I can get a 10x+ 
improvement in throughput by simply queueing up mutations and side effects that 
are a result of the state machine and then calling {{saveTasks}} once all of 
the mutations have been computed.

I'm going to look into refactoring {{StateManagerImpl}} to support evaluating 
multiple task state machine concurrently and then  merging all of the side 
effects from those state machines into a single operation.


> `createJob` API uses single thread to move all tasks to PENDING 
> 
>
> Key: AURORA-1823
> URL: https://issues.apache.org/jira/browse/AURORA-1823
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Priority: Minor
>
> If you create a single job with many tasks (lets say 10k+) the `createJob` 
> API will take a long time. This is because the `createJob` API only returns 
> when all of the tasks have moved to PENDING and it uses a single thread to do 
> so. Here is a snippet of the logs:
> {noformat}
> ...
> I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
>  state machine transition INIT -> PENDING
> I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
> I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
>  state machine transition INIT -> PENDING
> I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
> I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
>  state machine transition INIT -> PENDING
> I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
> I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
>  state machine transition INIT -> PENDING
> I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
> I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
>  state machine transition INIT -> PENDING
> I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
> I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
>  state machine transition INIT -> PENDING
> I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
> I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
>  state machine transition INIT -> PENDING
> I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
> ...
> {noformat}
> Observe that a single jetty thread is doing this.
> We should leverage {{BatchWorker}} to have concurrent mutations here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)