Re: Review Request 58259: Add update affinity to Scheduler

2017-04-19 Thread Santhosh Kumar Shanmugham


> On April 19, 2017, 1 a.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java
> > Lines 98-106 (original), 100-108 (patched)
> > 
> >
> > In order to reduce the duration the offer is help in the 
> > `UpdateAgentReserver`, wonder if this might be a better place to add to the 
> > reserver? Killing a task can take arbitrarily long and we risk expiring the 
> > offer.
> 
> David McLaughlin wrote:
> There are race conditions that appear when moving the logic to ADD_TASK. 
> Some other pending task could be assigned into the offer in the gap between 
> the task being killed by Mesos and the JobUpdateController processing it. We 
> use batch workers to batch up work, so there's definitely gaps where Mesos 
> can send the resources freed up by the kill to Aurora and they are used for 
> some other pending task during this window. 
> 
> It might seem like a rare race condition, but we've seen at Twitter 
> issues where a bad host causes chronic issues because there are no slots for 
> some large tasks. So in low capacity situations, we'd see things like this 
> happen a lot.  
> 
> Also note: killing a task doesn't have unbounded execution. There is an 
> upper bound of 5s at the executor level. One other issue here is the 
> turnaround from a task being killed to Mesos sending Aurora a new offer with 
> the freed up resources. Obviously there are best-effort network calls 
> involved in both killing and the offer management, so there are definitely 
> issues where offers could expire unnecessarily due to latency. 
> 
> At least by reserving the agent upfront when you learn that you're going 
> to be sending a future ADD_TASK, you have the control to extend the timeout 
> to increase cache hits (although I'd be careful extending it too long). 
> 
> We will have run several simulations in our scale test environment to 
> confirm some of this before it ships.

Let us scale test to confirm the assumptions. Otherwise I am satisfied with the 
patch.


- Santhosh Kumar


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58259/#review172324
---


On April 12, 2017, 12:51 a.m., David McLaughlin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58259/
> ---
> 
> (Updated April 12, 2017, 12:51 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer 
> Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> In the Dynamic Reservations review (and on the mailing list), I mentioned 
> that we could implement update affinity with less complexity using the same 
> technique as preemption. Here is how that would work. 
> 
> This just adds a simple wrapper around the preemptor's BiCache structure and 
> then optimistically tries to keep an agent free for a task during the update 
> process. 
> 
> 
> Note: I don't bother even checking the resources before reserving the agent. 
> I figure there is a chance the agent has enough room, and if not we'll catch 
> it when we attempt to veto the offer. We need to always check the offer like 
> this anyway in case constraints change. In the worst case it adds some delay 
> in the rare cases you increase resources. 
> 
> We also don't persist the reservations, so if the Scheduler fails over during 
> an update, the worst case is that any instances between the KILLED and 
> ASSIGNED in-flight batch need to fall back to the current first-fit 
> scheduling algorithm.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
> f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> 203f62bacc47470545d095e4d25f7e0f25990ed9 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> a177b301203143539b052524d14043ec8a85a46d 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceAction.java 
> b4cd01b3e03029157d5ca5d1d8e79f01296b57c2 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> f25dc0c6d9c05833b9938b023669c9c36a489f68 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  e14112479807b4477b82554caf84fe733f62cf58 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdateAgentReserver.java 
> PRE-CREATION 
>   

Re: Review Request 57487: Implementation of Dynamic Reservations Proposal

2017-04-19 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57487/#review172435
---



Master (b847db8) is green with this patch.
  ./build-support/jenkins/build.sh

However, it appears that it might lack test coverage.

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On April 20, 2017, 12:14 a.m., Dmitriy Shirchenko wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/57487/
> ---
> 
> (Updated April 20, 2017, 12:14 a.m.)
> 
> 
> Review request for Aurora, Mehrdad Nurolahzade, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Esteemed reviewers, here is the latest iteration on the implementation of 
> dynamic reservations. Some changes are merging of the patches into a single 
> one, updated design document with a more high level overview and user stories 
> told from an operator’s point of view. Unit TESTS are going to be done as 
> soon as we agree on the approach, as I have tested this patch on local 
> vagrant and a multi-node dev cluster. Jenkins build is expected to fail as 
> tested are incomplete.
> 
> For reference, here are previous two patches which feedback I addressed in 
> this new single patch. 
> Previous 2 patches:
> https://reviews.apache.org/r/56690/
> https://reviews.apache.org/r/56691/
> 
> RFC document: 
> https://docs.google.com/document/d/15n29HSQPXuFrnxZAgfVINTRP1Iv47_jfcstJNuMwr5A
> Design Doc [UPDATED]: 
> https://docs.google.com/document/d/1L2EKEcKKBPmuxRviSUebyuqiNwaO-2hsITBjt3SgWvE
> 
> 
> Diffs
> -
> 
>   examples/vagrant/mesos_config/etc_mesos-slave/resources 
> aa0e97e1c4a6c1a76cc712549159db9336d051eb 
>   examples/vagrant/upstart/aurora-scheduler.conf 
> 63fcc87be653835cb3c3f25dae4164f1d7c8d4da 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 
> f2296a9d7a88be7e43124370edecfe64415df00f 
>   src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 
> 6f2ca35c5d83dde29c24865b4826d4932e96da80 
>   src/main/java/org/apache/aurora/scheduler/HostOffer.java 
> bc40d0798f40003cab5bf6efe607217e4d5de9f1 
>   src/main/java/org/apache/aurora/scheduler/TaskVars.java 
> 676dfd9f9d7ee0633c05424f788fd0ab116976bb 
>   src/main/java/org/apache/aurora/scheduler/TierInfo.java 
> c45b949ae7946fc92d7e62f94696ddc4f0790cfa 
>   src/main/java/org/apache/aurora/scheduler/TierManager.java 
> c6ad2b1c48673ca2c14ddd308684d81ce536beca 
>   src/main/java/org/apache/aurora/scheduler/base/InstanceKeys.java 
> b12ac83168401c15fb1d30179ea8e4816f09cd3d 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
> f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 
>   
> src/main/java/org/apache/aurora/scheduler/configuration/ConfigurationManager.java
>  754fde0fdc976b673d78ae15d8ccd8c85b792373 
>   
> src/main/java/org/apache/aurora/scheduler/events/NotifyingSchedulingFilter.java
>  f6c759f03c4152ae93317692fc9db202fe251122 
>   src/main/java/org/apache/aurora/scheduler/filter/SchedulingFilter.java 
> 36608a9f027c95723c31f9915852112beb367223 
>   src/main/java/org/apache/aurora/scheduler/filter/SchedulingFilterImpl.java 
> df51d4cf4893899613683603ab4aa9aefa88faa6 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosTaskFactory.java 
> 0d639f66db456858278b0485c91c40975c3b45ac 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
> 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java 
> adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 
> 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
>   
> src/main/java/org/apache/aurora/scheduler/preemptor/PreemptionVictimFilter.java
>  5ed578cc4c11b49f607db5f7e516d9e6022a926c 
>   src/main/java/org/apache/aurora/scheduler/resources/AcceptedOffer.java 
> 291d5c95916915afc48a7143759e523fccd52feb 
>   
> src/main/java/org/apache/aurora/scheduler/resources/MesosResourceConverter.java
>  7040004ae48d3a9d0985cb9b231f914ebf6ff5a4 
>   src/main/java/org/apache/aurora/scheduler/resources/ResourceManager.java 
> 9aa263a9cfae03a9a0c5bc7fe3a1405397d3009c 
>   src/main/java/org/apache/aurora/scheduler/resources/ResourceMapper.java 
> 375f93c5277a78666fc4823382c82ac4d179f39d 
>   
> src/main/java/org/apache/aurora/scheduler/scheduling/ReservationTimeoutCalculator.java
>  PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/scheduling/SchedulingModule.java 
> 03a0e8485d1a392f107fda5b4af05b7f8f6067c6 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> 203f62bacc47470545d095e4d25f7e0f25990ed9 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> 

Re: Review Request 57487: Implementation of Dynamic Reservations Proposal

2017-04-19 Thread Dmitriy Shirchenko

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/57487/
---

(Updated April 20, 2017, 12:14 a.m.)


Review request for Aurora, Mehrdad Nurolahzade, Stephan Erb, and Zameer Manji.


Repository: aurora


Description
---

Esteemed reviewers, here is the latest iteration on the implementation of 
dynamic reservations. Some changes are merging of the patches into a single 
one, updated design document with a more high level overview and user stories 
told from an operator’s point of view. Unit TESTS are going to be done as soon 
as we agree on the approach, as I have tested this patch on local vagrant and a 
multi-node dev cluster. Jenkins build is expected to fail as tested are 
incomplete.

For reference, here are previous two patches which feedback I addressed in this 
new single patch. 
Previous 2 patches:
https://reviews.apache.org/r/56690/
https://reviews.apache.org/r/56691/

RFC document: 
https://docs.google.com/document/d/15n29HSQPXuFrnxZAgfVINTRP1Iv47_jfcstJNuMwr5A
Design Doc [UPDATED]: 
https://docs.google.com/document/d/1L2EKEcKKBPmuxRviSUebyuqiNwaO-2hsITBjt3SgWvE


Diffs (updated)
-

  examples/vagrant/mesos_config/etc_mesos-slave/resources 
aa0e97e1c4a6c1a76cc712549159db9336d051eb 
  examples/vagrant/upstart/aurora-scheduler.conf 
63fcc87be653835cb3c3f25dae4164f1d7c8d4da 
  src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 
f2296a9d7a88be7e43124370edecfe64415df00f 
  src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 
6f2ca35c5d83dde29c24865b4826d4932e96da80 
  src/main/java/org/apache/aurora/scheduler/HostOffer.java 
bc40d0798f40003cab5bf6efe607217e4d5de9f1 
  src/main/java/org/apache/aurora/scheduler/TaskVars.java 
676dfd9f9d7ee0633c05424f788fd0ab116976bb 
  src/main/java/org/apache/aurora/scheduler/TierInfo.java 
c45b949ae7946fc92d7e62f94696ddc4f0790cfa 
  src/main/java/org/apache/aurora/scheduler/TierManager.java 
c6ad2b1c48673ca2c14ddd308684d81ce536beca 
  src/main/java/org/apache/aurora/scheduler/base/InstanceKeys.java 
b12ac83168401c15fb1d30179ea8e4816f09cd3d 
  src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 
  
src/main/java/org/apache/aurora/scheduler/configuration/ConfigurationManager.java
 754fde0fdc976b673d78ae15d8ccd8c85b792373 
  
src/main/java/org/apache/aurora/scheduler/events/NotifyingSchedulingFilter.java 
f6c759f03c4152ae93317692fc9db202fe251122 
  src/main/java/org/apache/aurora/scheduler/filter/SchedulingFilter.java 
36608a9f027c95723c31f9915852112beb367223 
  src/main/java/org/apache/aurora/scheduler/filter/SchedulingFilterImpl.java 
df51d4cf4893899613683603ab4aa9aefa88faa6 
  src/main/java/org/apache/aurora/scheduler/mesos/MesosTaskFactory.java 
0d639f66db456858278b0485c91c40975c3b45ac 
  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 
78255e6dfa31c4920afc0221ee60ec4f8c2a12c4 
  src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java 
adf7f33e4a72d87c3624f84dfe4998e20dc75fdc 
  src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 
317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2 
  
src/main/java/org/apache/aurora/scheduler/preemptor/PreemptionVictimFilter.java 
5ed578cc4c11b49f607db5f7e516d9e6022a926c 
  src/main/java/org/apache/aurora/scheduler/resources/AcceptedOffer.java 
291d5c95916915afc48a7143759e523fccd52feb 
  
src/main/java/org/apache/aurora/scheduler/resources/MesosResourceConverter.java 
7040004ae48d3a9d0985cb9b231f914ebf6ff5a4 
  src/main/java/org/apache/aurora/scheduler/resources/ResourceManager.java 
9aa263a9cfae03a9a0c5bc7fe3a1405397d3009c 
  src/main/java/org/apache/aurora/scheduler/resources/ResourceMapper.java 
375f93c5277a78666fc4823382c82ac4d179f39d 
  
src/main/java/org/apache/aurora/scheduler/scheduling/ReservationTimeoutCalculator.java
 PRE-CREATION 
  src/main/java/org/apache/aurora/scheduler/scheduling/SchedulingModule.java 
03a0e8485d1a392f107fda5b4af05b7f8f6067c6 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
203f62bacc47470545d095e4d25f7e0f25990ed9 
  src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
a177b301203143539b052524d14043ec8a85a46d 
  src/main/java/org/apache/aurora/scheduler/stats/AsyncStatsModule.java 
40451e91aed45866c2030d901160cc4e084834df 
  src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
c129896d8cd54abd2634e2a339c27921042b0162 
  src/main/resources/org/apache/aurora/scheduler/tiers.json 
34ddb1dc769a73115c209c9b2ee158cd364392d8 
  src/test/java/org/apache/aurora/scheduler/TierManagerTest.java 
82e40d509d84c37a19b6a9ef942283d908833840 
  
src/test/java/org/apache/aurora/scheduler/configuration/ConfigurationManagerTest.java
 50d7499f4332a3feb0e2301cb707f2cea6bb2e98 
  
src/test/java/org/apache/aurora/scheduler/filter/SchedulingFilterImplTest.java 

Re: Review Request 58259: Add update affinity to Scheduler

2017-04-19 Thread David McLaughlin


> On April 19, 2017, 8 a.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java
> > Lines 228-229 (patched)
> > 
> >
> > Should we check if the offer is already reserved for preemption before 
> > proceeding? It is possible for a preemption to fail an update, if the 
> > premption steals the instance that was just updated.

Currently that is not possible, since we also check if the agentReserver has 
any reservations for a host inside the maybeAssign default path.


> On April 19, 2017, 8 a.m., Santhosh Kumar Shanmugham wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java
> > Lines 98-106 (original), 100-108 (patched)
> > 
> >
> > In order to reduce the duration the offer is help in the 
> > `UpdateAgentReserver`, wonder if this might be a better place to add to the 
> > reserver? Killing a task can take arbitrarily long and we risk expiring the 
> > offer.

There are race conditions that appear when moving the logic to ADD_TASK. Some 
other pending task could be assigned into the offer in the gap between the task 
being killed by Mesos and the JobUpdateController processing it. We use batch 
workers to batch up work, so there's definitely gaps where Mesos can send the 
resources freed up by the kill to Aurora and they are used for some other 
pending task during this window. 

It might seem like a rare race condition, but we've seen at Twitter issues 
where a bad host causes chronic issues because there are no slots for some 
large tasks. So in low capacity situations, we'd see things like this happen a 
lot.  

Also note: killing a task doesn't have unbounded execution. There is an upper 
bound of 5s at the executor level. One other issue here is the turnaround from 
a task being killed to Mesos sending Aurora a new offer with the freed up 
resources. Obviously there are best-effort network calls involved in both 
killing and the offer management, so there are definitely issues where offers 
could expire unnecessarily due to latency. 

At least by reserving the agent upfront when you learn that you're going to be 
sending a future ADD_TASK, you have the control to extend the timeout to 
increase cache hits (although I'd be careful extending it too long). 

We will have run several simulations in our scale test environment to confirm 
some of this before it ships.


- David


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58259/#review172324
---


On April 12, 2017, 7:51 a.m., David McLaughlin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58259/
> ---
> 
> (Updated April 12, 2017, 7:51 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer 
> Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> In the Dynamic Reservations review (and on the mailing list), I mentioned 
> that we could implement update affinity with less complexity using the same 
> technique as preemption. Here is how that would work. 
> 
> This just adds a simple wrapper around the preemptor's BiCache structure and 
> then optimistically tries to keep an agent free for a task during the update 
> process. 
> 
> 
> Note: I don't bother even checking the resources before reserving the agent. 
> I figure there is a chance the agent has enough room, and if not we'll catch 
> it when we attempt to veto the offer. We need to always check the offer like 
> this anyway in case constraints change. In the worst case it adds some delay 
> in the rare cases you increase resources. 
> 
> We also don't persist the reservations, so if the Scheduler fails over during 
> an update, the worst case is that any instances between the KILLED and 
> ASSIGNED in-flight batch need to fall back to the current first-fit 
> scheduling algorithm.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
> f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> 203f62bacc47470545d095e4d25f7e0f25990ed9 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> a177b301203143539b052524d14043ec8a85a46d 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceAction.java 
> b4cd01b3e03029157d5ca5d1d8e79f01296b57c2 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> f25dc0c6d9c05833b9938b023669c9c36a489f68 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> 

Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Stephan Erb


> On April 19, 2017, 6:32 p.m., Stephan Erb wrote:
> > I am not sure I understand your description. If you want faster updates, 
> > why don't you just reduce `watch_secs`? 
> > 
> > I understand `watch_secs` as the the time that we always wait _after_ we 
> > have entered `RUNNING`. So no task should be in `STARTING` state when we 
> > are watching. The documentation seems to agree with that: `Minimum number 
> > of seconds a shard must remain in ```RUNNING``` state before considered a 
> > success (Default: 45)`.
> > 
> > (I haven't looked into the code yet, but maybe you can clarify before I do).
> 
> Santhosh Kumar Shanmugham wrote:
> That was the intent initially. However we risk delaying all updates 
> without any changes to `watch_secs`. That is the main concern here. Let me 
> know if this will be a breaking for you.
> 
> Zameer Manji wrote:
> It's breaking for me. Instead can we just change the default value of 
> `watch_secs` to be 0?
> 
> Santhosh Kumar Shanmugham wrote:
> That is possible. It just means its not an opt-in change anymore.
> 
> Zameer Manji wrote:
> I'm ok to change the defaults and have our upgrades wildly fast by 
> default. Users can tweak health checks to have slower upgrades and only use 
> this parameter to have a fixed amount of time after an instance. This value 
> is the most brittle and imprecise.

Changing the default sounds good for me as well.


- Stephan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172354
---


On April 19, 2017, 6:18 p.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 6:18 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/2/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Zameer Manji


> On April 19, 2017, 9:32 a.m., Stephan Erb wrote:
> > I am not sure I understand your description. If you want faster updates, 
> > why don't you just reduce `watch_secs`? 
> > 
> > I understand `watch_secs` as the the time that we always wait _after_ we 
> > have entered `RUNNING`. So no task should be in `STARTING` state when we 
> > are watching. The documentation seems to agree with that: `Minimum number 
> > of seconds a shard must remain in ```RUNNING``` state before considered a 
> > success (Default: 45)`.
> > 
> > (I haven't looked into the code yet, but maybe you can clarify before I do).
> 
> Santhosh Kumar Shanmugham wrote:
> That was the intent initially. However we risk delaying all updates 
> without any changes to `watch_secs`. That is the main concern here. Let me 
> know if this will be a breaking for you.
> 
> Zameer Manji wrote:
> It's breaking for me. Instead can we just change the default value of 
> `watch_secs` to be 0?
> 
> Santhosh Kumar Shanmugham wrote:
> That is possible. It just means its not an opt-in change anymore.

I'm ok to change the defaults and have our upgrades wildly fast by default. 
Users can tweak health checks to have slower upgrades and only use this 
parameter to have a fixed amount of time after an instance. This value is the 
most brittle and imprecise.


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172354
---


On April 19, 2017, 9:18 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 9:18 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/2/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Santhosh Kumar Shanmugham


> On April 19, 2017, 9:32 a.m., Stephan Erb wrote:
> > I am not sure I understand your description. If you want faster updates, 
> > why don't you just reduce `watch_secs`? 
> > 
> > I understand `watch_secs` as the the time that we always wait _after_ we 
> > have entered `RUNNING`. So no task should be in `STARTING` state when we 
> > are watching. The documentation seems to agree with that: `Minimum number 
> > of seconds a shard must remain in ```RUNNING``` state before considered a 
> > success (Default: 45)`.
> > 
> > (I haven't looked into the code yet, but maybe you can clarify before I do).
> 
> Santhosh Kumar Shanmugham wrote:
> That was the intent initially. However we risk delaying all updates 
> without any changes to `watch_secs`. That is the main concern here. Let me 
> know if this will be a breaking for you.
> 
> Zameer Manji wrote:
> It's breaking for me. Instead can we just change the default value of 
> `watch_secs` to be 0?

That is possible. It just means its not an opt-in change anymore.


- Santhosh Kumar


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172354
---


On April 19, 2017, 9:18 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 9:18 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/2/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Zameer Manji


> On April 19, 2017, 9:32 a.m., Stephan Erb wrote:
> > I am not sure I understand your description. If you want faster updates, 
> > why don't you just reduce `watch_secs`? 
> > 
> > I understand `watch_secs` as the the time that we always wait _after_ we 
> > have entered `RUNNING`. So no task should be in `STARTING` state when we 
> > are watching. The documentation seems to agree with that: `Minimum number 
> > of seconds a shard must remain in ```RUNNING``` state before considered a 
> > success (Default: 45)`.
> > 
> > (I haven't looked into the code yet, but maybe you can clarify before I do).
> 
> Santhosh Kumar Shanmugham wrote:
> That was the intent initially. However we risk delaying all updates 
> without any changes to `watch_secs`. That is the main concern here. Let me 
> know if this will be a breaking for you.

It's breaking for me. Instead can we just change the default value of 
`watch_secs` to be 0?


- Zameer


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172354
---


On April 19, 2017, 9:18 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 9:18 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/2/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Santhosh Kumar Shanmugham


> On April 19, 2017, 9:32 a.m., Stephan Erb wrote:
> > I am not sure I understand your description. If you want faster updates, 
> > why don't you just reduce `watch_secs`? 
> > 
> > I understand `watch_secs` as the the time that we always wait _after_ we 
> > have entered `RUNNING`. So no task should be in `STARTING` state when we 
> > are watching. The documentation seems to agree with that: `Minimum number 
> > of seconds a shard must remain in ```RUNNING``` state before considered a 
> > success (Default: 45)`.
> > 
> > (I haven't looked into the code yet, but maybe you can clarify before I do).

That was the intent initially. However we risk delaying all updates without any 
changes to `watch_secs`. That is the main concern here. Let me know if this 
will be a breaking for you.


- Santhosh Kumar


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172354
---


On April 19, 2017, 9:18 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 9:18 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/2/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Stephan Erb


> On April 19, 2017, 6:31 p.m., Zameer Manji wrote:
> > "Without this tasks will wait for watch_secs after the task moves to 
> > RUNNING extending the update time."
> > 
> > I thought this was intentional. One can use this parameter to naturally 
> > slow down update speed. I was under the impression that `watch_secs` should 
> > be set to `0` if you want purely health check driven updates.

Race contion :). I have the same understanding as you.


- Stephan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172357
---


On April 19, 2017, 6:18 p.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 6:18 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/2/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172358
---



Master (b847db8) is red with this patch.
  ./build-support/jenkins/build.sh

  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl$8
  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl$7
  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl$9
  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl$4
  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl$3
  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl$6
  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl$5
  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl$2
  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl$1
  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/LogStorage$Settings
  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/LogStorage$ScheduledExecutorSchedulingService
  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/LogStorageModule
  Test coverage missing for 
org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl$SnapshotField
  Test coverage missing for 
org/apache/aurora/scheduler/storage/backup/TemporaryStorage$TemporaryStorageFactory$1
  Test coverage missing for 
org/apache/aurora/scheduler/storage/backup/BackupModule
  Test coverage missing for 
org/apache/aurora/scheduler/storage/backup/Recovery$RecoveryImpl
  Test coverage missing for 
org/apache/aurora/scheduler/storage/backup/TemporaryStorage$TemporaryStorageFactory
  Test coverage missing for 
org/apache/aurora/scheduler/storage/backup/Recovery$RecoveryImpl$PendingRecovery
  Test coverage missing for org/apache/aurora/scheduler/TaskVars
  Test coverage missing for 
org/apache/aurora/scheduler/SchedulerLifecycle$DefaultDelayedActions
  Test coverage missing for 
org/apache/aurora/scheduler/TierManager$TierManagerImpl$TierConfig
  Test coverage missing for org/apache/aurora/scheduler/TaskVars$Counter
  Test coverage missing for org/apache/aurora/scheduler/TaskVars$1
  Test coverage missing for 
org/apache/aurora/scheduler/SchedulerModule$TaskEventBatchWorker
  Test coverage missing for org/apache/aurora/scheduler/HostOffer$1
  Test coverage missing for org/apache/aurora/scheduler/SchedulerModule
  Test coverage missing for 
org/apache/aurora/scheduler/TaskIdGenerator$TaskIdGeneratorImpl
  Test coverage missing for org/apache/aurora/scheduler/SchedulerModule$1
  Test coverage missing for org/apache/aurora/scheduler/TaskStatusHandlerImpl
  Test coverage missing for org/apache/aurora/scheduler/TaskStatusHandlerImpl$1
  Test coverage missing for org/apache/aurora/scheduler/TierModule
  Test coverage missing for 
org/apache/aurora/scheduler/storage/db/typehandlers/VolumeModeTypeHandler

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug 
option to get more log output.
==

BUILD FAILED

Total time: 12 mins 14.443 secs


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On April 19, 2017, 4:18 p.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 4:18 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/2/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> 

Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Stephan Erb

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172354
---



I am not sure I understand your description. If you want faster updates, why 
don't you just reduce `watch_secs`? 

I understand `watch_secs` as the the time that we always wait _after_ we have 
entered `RUNNING`. So no task should be in `STARTING` state when we are 
watching. The documentation seems to agree with that: `Minimum number of 
seconds a shard must remain in ```RUNNING``` state before considered a success 
(Default: 45)`.

(I haven't looked into the code yet, but maybe you can clarify before I do).

- Stephan Erb


On April 19, 2017, 6:18 p.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 6:18 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/2/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Zameer Manji

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172357
---



"Without this tasks will wait for watch_secs after the task moves to RUNNING 
extending the update time."

I thought this was intentional. One can use this parameter to naturally slow 
down update speed. I was under the impression that `watch_secs` should be set 
to `0` if you want purely health check driven updates.

- Zameer Manji


On April 19, 2017, 9:18 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 9:18 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/2/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 56935: Fix for unnecessary object serializations

2017-04-19 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56935/#review172352
---


Ship it!




Ship It!

- Santhosh Kumar Shanmugham


On Feb. 22, 2017, 11:34 a.m., Mehrdad Nurolahzade wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56935/
> ---
> 
> (Updated Feb. 22, 2017, 11:34 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Methodology: I attached a profiler (YourKit) to Aurora in one of test 
> clusters (where very minimal workload was being applied to the scheduler). 
> Looking at object allocation rates, `Webhook` and `ResourceType` stood out. 
> Based on the two cases above, I looked itno `requireNonNull()` and 
> `LOG.xxx()` methods in the code base to find potentially misbehaving 
> statements.
> 
> This patch provides a fix for some unnecessary object serilizations that 
> happen on high frequency execution paths and contribute to scheduler's high 
> object creation rate. Due to low scheduler workload at the time observation, 
> this list is not exhaustive.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/events/Webhook.java 
> 3e8e38abe29766f6fcf08707fba5df402c96a257 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosSchedulerImpl.java 
> 17301036e28d95ba90b3e4d8840d8a5641e49c46 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosTaskFactory.java 
> 0d639f66db456858278b0485c91c40975c3b45ac 
>   src/main/java/org/apache/aurora/scheduler/resources/ResourceType.java 
> c88428412d69e4202e7cceb1b608dc1809a9ccc0 
>   src/main/java/org/apache/aurora/scheduler/storage/db/DbTaskStore.java 
> a649a6e3d2f2d0aeaf6d7ac704ed24911c310a1e 
>   src/main/java/org/apache/aurora/scheduler/storage/mem/MemTaskStore.java 
> d89e715b1b08faf95f8b5788c9c28cbbb33af093 
>   src/test/java/org/apache/aurora/scheduler/mesos/MesosSchedulerImplTest.java 
> 9bb319bb04bb386d9792c3cc0017b039e8f25044 
> 
> 
> Diff: https://reviews.apache.org/r/56935/diff/1/
> 
> 
> Testing
> ---
> 
> ```./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh```
> 
> 
> Thanks,
> 
> Mehrdad Nurolahzade
> 
>



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/
---

(Updated April 19, 2017, 9:18 a.m.)


Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Changes
---

Feedback.


Repository: aurora


Description
---

Tasks are gated on health-checks before they transistion into `RUNNING`
and hence it is possible for a task to stay in `STARTING` during the
watch duration of an instance's update. So include `STARTING` into the
possible states for a task when watching it after an update. Without
this tasks will wait for watch_secs after the task moves to RUNNING
extending the update time.


Diffs (updated)
-

  src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
c129896d8cd54abd2634e2a339c27921042b0162 
  src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
c95943d242dc2f539778bdc9e071f342005e8de3 
  src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
  src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
32a8c89d3e71395696fc613da96b871330891c42 


Diff: https://reviews.apache.org/r/58524/diff/2/

Changes: https://reviews.apache.org/r/58524/diff/1-2/


Testing
---

./build-support/jenkins/build.sh
./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh


Thanks,

Santhosh Kumar Shanmugham



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread David McLaughlin

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172348
---


Ship it!




Ship It!

- David McLaughlin


On April 19, 2017, 9:07 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 9:07 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/1/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Joshua Cohen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172340
---


Ship it!




lgtm


src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java
Lines 51 (patched)


0L


- Joshua Cohen


On April 19, 2017, 9:07 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 9:07 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/1/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172332
---



@ReviewBot retry

- Santhosh Kumar Shanmugham


On April 19, 2017, 2:07 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 2:07 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/1/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Santhosh Kumar Shanmugham


> On April 19, 2017, 2:12 a.m., Aurora ReviewBot wrote:
> > Master (b847db8) is red with this patch.
> >   ./build-support/jenkins/build.sh
> > 
> > :commons:classesThriftNote: Some input files use unchecked or unsafe 
> > operations.
> > Note: Recompile with -Xlint:unchecked for details.
> > 
> > :commons-args:compileJava
> > :commons-args:processResources
> > :commons-args:classes
> > :commons-args:jar
> > :commons:compileJavaNote: Writing 
> > file:/home/jenkins/jenkins-slave/workspace/AuroraBot/commons/dist/classes/main/org/apache/aurora/common/args/apt/cmdline.arg.info.txt.1
> > Note: Writing 
> > file:/home/jenkins/jenkins-slave/workspace/AuroraBot/commons/dist/classes/main/META-INF/compiler/resource-mappings/org.apache.aurora.common.args.apt.CmdLineProcessor
> > Note: Some input files use or override a deprecated API.
> > Note: Recompile with -Xlint:deprecation for details.
> > Note: Some input files use unchecked or unsafe operations.
> > Note: Recompile with -Xlint:unchecked for details.
> > 
> > :commons:generateThriftResources
> > :commons:processResources
> > :commons:classes
> > :commons:jar
> > :compileJava
> > Download 
> > https://repo1.maven.org/maven2/org/apache/mesos/mesos/1.2.0/mesos-1.2.0.pom
> > Download 
> > https://repo1.maven.org/maven2/org/apache/mesos/mesos/1.2.0/mesos-1.2.0.jar
> > /home/jenkins/jenkins-slave/workspace/AuroraBot/src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java:74:
> >  Note: Wrote forwarder 
> > org.apache.aurora.scheduler.storage.log.WriteAheadStorageForwarder
> > @Forward({
> > ^
> > Note: Writing 
> > file:/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/classes/main/org/apache/aurora/common/args/apt/cmdline.arg.info.txt.2
> > Note: Writing 
> > file:/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/classes/main/META-INF/compiler/resource-mappings/org.apache.aurora.common.args.apt.CmdLineProcessor
> > :generateBuildProperties
> > :processResources
> > :classes
> > :jar
> > :startScripts
> > :distTar
> > :distZip
> > :assemble
> > :compileJmhJavaJava HotSpot(TM) 64-Bit Server VM warning: INFO: 
> > os::commit_memory(0x00079fc0, 243269632, 0) failed; error='Cannot 
> > allocate memory' (errno=12)
> > #
> > # There is insufficient memory for the Java Runtime Environment to continue.
> > # Native memory allocation (mmap) failed to map 243269632 bytes for 
> > committing reserved memory.
> > # An error report file with more information is saved as:
> > # /home/jenkins/jenkins-slave/workspace/AuroraBot/hs_err_pid11304.log
> > 
> > 
> > I will refresh this build result if you post a review containing 
> > "@ReviewBot retry"

@ReviewBot retry


- Santhosh Kumar


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172329
---


On April 19, 2017, 2:07 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 2:07 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/1/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Re: Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Aurora ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/#review172329
---



Master (b847db8) is red with this patch.
  ./build-support/jenkins/build.sh

:commons:classesThriftNote: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

:commons-args:compileJava
:commons-args:processResources
:commons-args:classes
:commons-args:jar
:commons:compileJavaNote: Writing 
file:/home/jenkins/jenkins-slave/workspace/AuroraBot/commons/dist/classes/main/org/apache/aurora/common/args/apt/cmdline.arg.info.txt.1
Note: Writing 
file:/home/jenkins/jenkins-slave/workspace/AuroraBot/commons/dist/classes/main/META-INF/compiler/resource-mappings/org.apache.aurora.common.args.apt.CmdLineProcessor
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

:commons:generateThriftResources
:commons:processResources
:commons:classes
:commons:jar
:compileJava
Download 
https://repo1.maven.org/maven2/org/apache/mesos/mesos/1.2.0/mesos-1.2.0.pom
Download 
https://repo1.maven.org/maven2/org/apache/mesos/mesos/1.2.0/mesos-1.2.0.jar
/home/jenkins/jenkins-slave/workspace/AuroraBot/src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java:74:
 Note: Wrote forwarder 
org.apache.aurora.scheduler.storage.log.WriteAheadStorageForwarder
@Forward({
^
Note: Writing 
file:/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/classes/main/org/apache/aurora/common/args/apt/cmdline.arg.info.txt.2
Note: Writing 
file:/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/classes/main/META-INF/compiler/resource-mappings/org.apache.aurora.common.args.apt.CmdLineProcessor
:generateBuildProperties
:processResources
:classes
:jar
:startScripts
:distTar
:distZip
:assemble
:compileJmhJavaJava HotSpot(TM) 64-Bit Server VM warning: INFO: 
os::commit_memory(0x00079fc0, 243269632, 0) failed; error='Cannot 
allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 243269632 bytes for committing 
reserved memory.
# An error report file with more information is saved as:
# /home/jenkins/jenkins-slave/workspace/AuroraBot/hs_err_pid11304.log


I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On April 19, 2017, 9:07 a.m., Santhosh Kumar Shanmugham wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58524/
> ---
> 
> (Updated April 19, 2017, 9:07 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> Tasks are gated on health-checks before they transistion into `RUNNING`
> and hence it is possible for a task to stay in `STARTING` during the
> watch duration of an instance's update. So include `STARTING` into the
> possible states for a task when watching it after an update. Without
> this tasks will wait for watch_secs after the task moves to RUNNING
> extending the update time.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
>   src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
> 32a8c89d3e71395696fc613da96b871330891c42 
> 
> 
> Diff: https://reviews.apache.org/r/58524/diff/1/
> 
> 
> Testing
> ---
> 
> ./build-support/jenkins/build.sh
> ./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>



Review Request 58524: Allow task to be in STARTING state during watch_secs.

2017-04-19 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58524/
---

Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.


Repository: aurora


Description
---

Tasks are gated on health-checks before they transistion into `RUNNING`
and hence it is possible for a task to stay in `STARTING` during the
watch duration of an instance's update. So include `STARTING` into the
possible states for a task when watching it after an update. Without
this tasks will wait for watch_secs after the task moves to RUNNING
extending the update time.


Diffs
-

  src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
c129896d8cd54abd2634e2a339c27921042b0162 
  src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
c95943d242dc2f539778bdc9e071f342005e8de3 
  src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
df1f8394b824dbb7b2745fcccdab5adaafdf6e6c 
  src/test/java/org/apache/aurora/scheduler/updater/OneWayJobUpdaterTest.java 
32a8c89d3e71395696fc613da96b871330891c42 


Diff: https://reviews.apache.org/r/58524/diff/1/


Testing
---

./build-support/jenkins/build.sh
./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh


Thanks,

Santhosh Kumar Shanmugham



Re: Review Request 58462: Fix bug. Do not increase current_consecutive_successes if .healthchecksnooze present

2017-04-19 Thread Santhosh Kumar Shanmugham


> On April 14, 2017, 2:26 p.m., Santhosh Kumar Shanmugham wrote:
> > src/main/python/apache/aurora/executor/common/health_checker.py
> > Lines 163-166 (patched)
> > 
> >
> > This will cause a task to get stuck in `STARTING` since `self.running` 
> > will never be set to `True`.
> > 
> > Can you explain the particular usecase here? Also add a test case to 
> > exercise this branch.
> 
> Vladimir Khalatyan wrote:
> The idea is to make HealthCheck process to start after some of the setup 
> processes are finished. With the current approach it's possible to addjust 
> the "starting" point of the HealthCheck process by changing 
> initial_interval_secs. But it means that we rely on the timing which doesn't 
> guarantee anything.  
> The idea of HealthCheck "snoozing" is ignore any status of the 
> healthcheck unless some process tells HealthCheck to start checking the 
> health of the service.
> 
> Example (simplified one):
>  Let's assume we start two processes on the machine: the LB registration 
> and the UWSGI process. Let's say the uwsgi process requires some time to warm 
> up. The LB registration depends on the load on LB, how soon uwsgi warms up, 
> etc. So the actual moment when the application becomes available can vary 
> from couple of seconds to minutes and we can not rely on 
> initial_interval_secs. So we create a .healthchecksnooze file and ignore all 
> results of the healthcheck unless this file is there. In a meanwhile the LB 
> registration process will try to register service some number of times ( < 
> max_failures) and delete the .healthchecksnooze after it succeeds. Since this 
> particular moment the healthcheck will start incrementing the concecutive 
> successes or failures and we can determine whether the deployment is 
> successfull or not. 
>  So with this approach we can specify the "starting" point of health 
> checking more accurately and dependent on other processes. 
>  
>  Here by "starting" point of the health check I mean the checking of the 
> application health and changing the consecutive successes or failures, not 
> the actual system process.
> 
> Santhosh Kumar Shanmugham wrote:
> > "So the actual moment when the application becomes available can vary 
> from couple of seconds to minutes and we can not rely on 
> initial_interval_secs."
> 
> The current implementation addresses this problem of 
> `initial_interval_secs` not responding faster with varying startup times. It 
> achieves this by performing `health checks` during the startup time 
> (`initial_interval_secs`) but ignores all failures during this period, 
> however successful health checks now count towards transitioning the task to 
> a healthy (RUNNING) state. Thereby it can accomodate both slow startup as 
> well as fast startup without making the faster startup instances from waiting 
> until the entire `initial_interval_secs` has expired.
> 
> However for your change in particular, you might also need to account for 
> `_should_enforce_deadline` - which will treat a task as unhealthy if it runs 
> out of attempts.
> 
> Santhosh Kumar Shanmugham wrote:
> I was just looking at the docs and your usecase of health-check snoozing 
> is vastly different from the usecase the documentation - 
> 
> > You can pause health checking by touching a file inside of your 
> sandbox, named .healthchecksnooze. As long as that file is present, health 
> checks will be disabled, enabling users to gather core dumps or other 
> performance measurements without worrying about Aurora’s health check killing 
> their process.
> 
> Using health check snoozing to achieve synchronization of health checks, 
> is indicative of how inflexible the health-checking mechanism is in reality. 
> :(
> 
> Stephan Erb wrote:
> Personally speaking, I would not want to run without a sane 
> `initial_interval_secs` (e.g. 5-10 minutes). In distributed systems, things 
> tend to stall from time to time. Not having a timeout to unstuck a deployment 
> seems very problematic to me.
> 
> There is a recent discussion on the mailing list discovering a related 
> issue: 
> https://lists.apache.org/thread.html/a0b01ba8928dc637cff223d2ff74a8d00436850954d61150fc709ca4@%3Cuser.aurora.apache.org%3E
> 
> Vladimir Khalatyan wrote:
> Santhosh, so what can you suggest in my case? Let's assume the LB 
> registration process depends on the latency, readiness of the service, etc. 
> Similar dependencied for uwsgi process. What is the initial_interval_secs 
> should I set? Or how can I determine what is the interval supposed to be? 1 
> minute? 10 minutes? 30 minutes? I can only guess. But if there are any delays 
> on the LB register or uwsgi warm up, everything could screw up. So it's not 
> flexible anymore. 
> With the approach of disabling the health check process, it's flexible. 
> 

Re: Review Request 58259: Add update affinity to Scheduler

2017-04-19 Thread Santhosh Kumar Shanmugham


> On April 17, 2017, 1:27 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java
> > Lines 52-55 (patched)
> > 
> >
> > I am trying to understand if this is a good default for this 
> > best-effort feature.
> > 
> > What is your cluster-wide MTTA? It should give us a decent hint for a 
> > suitable default.
> 
> David McLaughlin wrote:
> Our MTTA can range from a couple milliseconds to several minutes. Depends 
> how many tasks are pending and how full the cluster is.
> 
> Stephan Erb wrote:
> If I understand this correctly, this patch will help the "good case" but 
> could fall down quickly during overload: If the cluster is getting overloaded 
> with pending tasks, the 1 min timeout might not be sufficient to place a job 
> in its reserved spot. This will then lead to preemptions that further 
> aggregate the overload situation. 
> 
> We will need a counter to track those expired reservations.
> 
> David McLaughlin wrote:
> Yup, definitely need more metrics here. If the community gives the 
> overall approach a +1, I will move forward with making this production ready.

A hard-limit of 1 minute may not be enough and raising the limit might causing 
starvation.


- Santhosh Kumar


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58259/#review172122
---


On April 12, 2017, 12:51 a.m., David McLaughlin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58259/
> ---
> 
> (Updated April 12, 2017, 12:51 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer 
> Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> In the Dynamic Reservations review (and on the mailing list), I mentioned 
> that we could implement update affinity with less complexity using the same 
> technique as preemption. Here is how that would work. 
> 
> This just adds a simple wrapper around the preemptor's BiCache structure and 
> then optimistically tries to keep an agent free for a task during the update 
> process. 
> 
> 
> Note: I don't bother even checking the resources before reserving the agent. 
> I figure there is a chance the agent has enough room, and if not we'll catch 
> it when we attempt to veto the offer. We need to always check the offer like 
> this anyway in case constraints change. In the worst case it adds some delay 
> in the rare cases you increase resources. 
> 
> We also don't persist the reservations, so if the Scheduler fails over during 
> an update, the worst case is that any instances between the KILLED and 
> ASSIGNED in-flight batch need to fall back to the current first-fit 
> scheduling algorithm.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
> f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> 203f62bacc47470545d095e4d25f7e0f25990ed9 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> a177b301203143539b052524d14043ec8a85a46d 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceAction.java 
> b4cd01b3e03029157d5ca5d1d8e79f01296b57c2 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> f25dc0c6d9c05833b9938b023669c9c36a489f68 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  e14112479807b4477b82554caf84fe733f62cf58 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdateAgentReserver.java 
> PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java 
> 13cbdadad606d9acaadc541320b22b0ae538cc5e 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java
>  fa1a81785802b82542030e1aae786fe9570d9827 
>   src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java 
> cf2d25ec2e407df7159e0021ddb44adf937e1777 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> b2c4c66850dd8f35e06a631809530faa3b776252 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> c78c7fbd7d600586136863c99ce3d7387895efee 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> 30b44f88a5b8477e917da21d92361aea1a39ceeb 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> 

Re: Review Request 58259: Add update affinity to Scheduler

2017-04-19 Thread Santhosh Kumar Shanmugham

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/58259/#review172324
---



I like the simplicity of the overall approach.

We should take care to pay close attention to the interactions between 
`Preemption` and `Update` reservations.


src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java
Lines 228-229 (patched)


Should we check if the offer is already reserved for preemption before 
proceeding? It is possible for a preemption to fail an update, if the premption 
steals the instance that was just updated.



src/main/java/org/apache/aurora/scheduler/updater/InstanceAction.java
Lines 24-27 (original), 25-28 (patched)


Looks like Bill was of the same opinion. One more vote to grab the offer 
during the `AddTask` action.



src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java
Lines 98-106 (original), 100-108 (patched)


In order to reduce the duration the offer is help in the 
`UpdateAgentReserver`, wonder if this might be a better place to add to the 
reserver? Killing a task can take arbitrarily long and we risk expiring the 
offer.


- Santhosh Kumar Shanmugham


On April 12, 2017, 12:51 a.m., David McLaughlin wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/58259/
> ---
> 
> (Updated April 12, 2017, 12:51 a.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer 
> Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> ---
> 
> In the Dynamic Reservations review (and on the mailing list), I mentioned 
> that we could implement update affinity with less complexity using the same 
> technique as preemption. Here is how that would work. 
> 
> This just adds a simple wrapper around the preemptor's BiCache structure and 
> then optimistically tries to keep an agent free for a task during the update 
> process. 
> 
> 
> Note: I don't bother even checking the resources before reserving the agent. 
> I figure there is a chance the agent has enough room, and if not we'll catch 
> it when we attempt to veto the offer. We need to always check the offer like 
> this anyway in case constraints change. In the worst case it adds some delay 
> in the rare cases you increase resources. 
> 
> We also don't persist the reservations, so if the Scheduler fails over during 
> an update, the worst case is that any instances between the KILLED and 
> ASSIGNED in-flight batch need to fall back to the current first-fit 
> scheduling algorithm.
> 
> 
> Diffs
> -
> 
>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
> f0b148cd158d61cd89cc51dca9f3fa4c6feb1b49 
>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java 
> 203f62bacc47470545d095e4d25f7e0f25990ed9 
>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 
> a177b301203143539b052524d14043ec8a85a46d 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceAction.java 
> b4cd01b3e03029157d5ca5d1d8e79f01296b57c2 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> f25dc0c6d9c05833b9938b023669c9c36a489f68 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> c129896d8cd54abd2634e2a339c27921042b0162 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  e14112479807b4477b82554caf84fe733f62cf58 
>   src/main/java/org/apache/aurora/scheduler/updater/StateEvaluator.java 
> c95943d242dc2f539778bdc9e071f342005e8de3 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdateAgentReserver.java 
> PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java 
> 13cbdadad606d9acaadc541320b22b0ae538cc5e 
>   
> src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java
>  fa1a81785802b82542030e1aae786fe9570d9827 
>   src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java 
> cf2d25ec2e407df7159e0021ddb44adf937e1777 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> b2c4c66850dd8f35e06a631809530faa3b776252 
>   src/test/java/org/apache/aurora/scheduler/updater/InstanceUpdaterTest.java 
> c78c7fbd7d600586136863c99ce3d7387895efee 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> 30b44f88a5b8477e917da21d92361aea1a39ceeb 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> 833fd62c870f96b96343ee5e0eed0d439536381f 
>   
> src/test/java/org/apache/aurora/scheduler/updater/NullAgentReserverTest.java 
> PRE-CREATION 
>   
>