> On July 17, 2018, 10:52 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
> > Lines 413-417 (patched)
> > <https://reviews.apache.org/r/67696/diff/9/?file=2059744#file2059744line414>
> >
> >     I am a bit confused about this. Can you please elaborate a bit more 
> > when/how you observed this problem?
> >     
> >     I am mostly concerned about race conditions here. 
> >     
> >     For example, say I trigger a job update from config A to B. Shortly 
> > before I update an instance the instance fails or transitions to LOST. 
> > Concurrently we might end up hopping into the `instanceChanged` method 
> > here, but bail due to the new guard without ever handling the switch from 
> > config A to B.
> >     
> >     Could be that I am missing something entirely here. Would be great if 
> > you could elaborate.
> 
> Jordan Ly wrote:
>     Good question.
>     
>     When we determine what action we should perform on an instance, we use 
> the config/state we get from these task events. If we are using stale 
> information, we may be producing inaccurate actions to perform. This is a 
> rare scenario but it relies on tons of async events happening at the same 
> time so there is a large delay in event processing. The scenario that I 
> encountered that lead me to introduce this guard was the following:
>     
>     1. Task "ABC-old-1" is being scheduled and goes from ASSIGNED -> RUNNING 
> (RUNNING is now on queue to be evaluated as an update)
>     2. A different previous task finishes updating so instance 1 is added to 
> the working group to be updated. The initial evaluation causes "ABC-old-1" to 
> be moved to KILLING.
>     3. "ABC-old-1" is KILLED and "ABC-new-1" is created and running.
>     4. RUNNING from the async queue is finally evaluated so we see that this 
> instance number needs to be updated (since it is "ABC-old-1") so we kill the 
> task at 1 (which is now "ABC-new-1").
>     5. Before this patch, this would mean the task needs to be killed and 
> added again. Now, since there is an async aspect to killing, this could mean 
> an instance # is permanently killed post-update.
>     
>     I am ensuring this condition is met within the evaluation of state 
> changes. From `StateEvaluator`:
>     
>     ```
>        * It is the responsibility of the caller to ensure that the {@code 
> actualState} is the latest
>        * value.  Note: the caller should avoid calling this when a terminal 
> task is moving to another
>        * terminal state.  It should also suppress deletion events for tasks 
> that have been replaced by
>        * an active task.
>     ```
>     
>     In the scenario you provide, the condition will work as intended and the 
> update should still move to completion successfully. A write lock governs all 
> actions dealing with updates so we do not have race conditions.
>     
>     1. Instance A needs to be updated.
>     2. Instance A fails before being evaluated for an update.
>     3. Since A failed, it will be rescheduled into A'.
>       - If A is evaluated for the update before being rescheduled (e.g. we 
> evaluate the state change into PENDING or FAILED before rescheduling), then 
> we kill that task and update the configuration
>       - If A is rescheduled before being evaluated for the update, then we 
> will kill A' and update the configuration
>     
>     Lots of moving parts for updates so hopefully what I said makes sense :/

Thanks for the detailed writeup. I can follow your argument and it makes sense 
to me. However, I also have to admit that I still haven't wrapped my head 
around all the details of the update procedure 100%. I trust your judgment and 
analysis though :)


- Stephan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67696/#review206149
-----------------------------------------------------------


On July 18, 2018, 12:15 a.m., Jordan Ly wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67696/
> -----------------------------------------------------------
> 
> (Updated July 18, 2018, 12:15 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Daniel Knightly, Renan DelValle, 
> Santhosh Kumar Shanmugham, and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables SLA-aware updates.
> 
> Following https://reviews.apache.org/r/66716/, tasks may now specify custom 
> SLA policies that will be respected by the scheduler during maintenance. This 
> patch integrates into the same system to allow users to specify if they want 
> their updates to also respect SLA. Please see 
> https://docs.google.com/document/d/1lCoDyoX26qrGrptrgO7vJHqYR_L2CBRGFIywsAd8uQo/edit?usp=sharing
>  for a more detailed description.
> 
> This patch adds two optional Thrift fields, `slaAware` to `JobUpdateSettings` 
> and `message` to `JobInstanceUpdateEvent`. These should be forward and 
> backwards compatible.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md edc081f502370190597ad028f3275cdfd572f5ca 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 
> 7265b11103aa12743c42355163ae64e98e965d7f 
>   docs/features/job-updates.md b52eb35a1de9da40f8a00ef0b905df30069029d3 
>   docs/reference/configuration.md acab4c58d9ab3c04d156fed3636e77aed6d1faf4 
>   docs/reference/scheduler-configuration.md 
> 805e516689be019101f7c220c89fd9c391bb93b3 
>   src/main/java/org/apache/aurora/scheduler/base/Tasks.java 
> 2e13aacf576e648d9fffe989e4fc05c8954e72d8 
>   src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java 
> 9aa51c3637df74cca088bd65c5539e1ebb8e5f0d 
>   
> src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java
>  6a28bc274acdd6d3ac239166771ef2d45648d60f 
>   
> src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java 
> 9fa68b2dd55b4e4f5436356c1b94af1393967679 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java 
> a002d955c3bc7b7c39da5e130e8c10c536bdcebd 
>   
> src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java
>  ec577cccb86914ebd679ca235103f79dd7e7b79d 
>   src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java 
> f2d33fb9ab6bd2c3ff199ab03dc75b1d6d618f3a 
>   src/main/java/org/apache/aurora/scheduler/updater/SlaKillController.java 
> PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdateFactory.java 
> 3992aa77fc305adc390a4aaeb1d3939d6241ddbd 
>   src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java 
> 74ee1745e1fd7c308e2bdfa46aeb18a7ecfe14d2 
>   src/main/java/org/apache/aurora/scheduler/updater/Updates.java 
> f949fd54f524780672167e12fcadf268da08e679 
>   src/main/python/apache/aurora/client/api/updater_util.py 
> 4e3986220aaa4c9b138394b962120b176185af12 
>   src/main/python/apache/aurora/config/schema/base.py 
> 7baded79acdf863670afc183d740dcad602490c2 
>   src/test/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 
> abee0951ca998894b29ee32c5362ef30da6421c7 
>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java 
> dcf58896f1e866c0369ba1b78060236e98d9d46b 
>   
> src/test/java/org/apache/aurora/scheduler/storage/AbstractJobUpdateStoreTest.java
>  3a06a451da4ef3acccb33b5495b9fae141557148 
>   
> src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java 
> 0aea369d8a8f75291de9691b6d61f3d48895507c 
>   
> src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java
>  aa1cb2b287642e87d787e160e04a17ad0e4690d9 
>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 
> 43f857d893a54e19e71b36f2f06fef3a3ef6e874 
>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java 
> 5667a1b59681a6de87149d7161d760bff5da3818 
>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java 
> 2c27ec7136ff22a3570a4ec278c73f7ee310f628 
>   
> src/test/java/org/apache/aurora/scheduler/updater/SlaKillControllerTest.java 
> PRE-CREATION 
>   src/test/python/apache/aurora/client/cli/test_inspect.py 
> 2baba2aa55865ec298a4c9e5af3952b56cb9a910 
>   
> src/test/resources/org/apache/aurora/scheduler/storage/durability/goldens/current/saveJobInstanceUpdateEvent
>  48902d39b9d2cbeae1a52180669aba8349e4dd65 
>   
> src/test/resources/org/apache/aurora/scheduler/storage/durability/goldens/current/saveJobUpdate
>  08dfa5b9a67989083a0d405ce8100698a4d096ae 
>   ui/src/main/js/components/UpdateInstanceEvents.js 
> 8351f2c4c256e27625d94b70842be0e91065a551 
>   ui/src/main/js/components/UpdateSettings.js 
> d756f5916fd8c39dcbb5578ee5eda198f807f458 
> 
> 
> Diff: https://reviews.apache.org/r/67696/diff/11/
> 
> 
> Testing
> -------
> 
> Added unit tests, `./gradlew test`.
> 
> Tested at scale with over 10,000 SLA-aware instance update events occuring 
> concurrently. Scheduler stability did not seem to be affected.
> 
> 
> Thanks,
> 
> Jordan Ly
> 
>

Reply via email to