> On July 17, 2018, 8:52 p.m., Stephan Erb wrote: > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java > > Lines 413-417 (patched) > > <https://reviews.apache.org/r/67696/diff/9/?file=2059744#file2059744line414> > > > > I am a bit confused about this. Can you please elaborate a bit more > > when/how you observed this problem? > > > > I am mostly concerned about race conditions here. > > > > For example, say I trigger a job update from config A to B. Shortly > > before I update an instance the instance fails or transitions to LOST. > > Concurrently we might end up hopping into the `instanceChanged` method > > here, but bail due to the new guard without ever handling the switch from > > config A to B. > > > > Could be that I am missing something entirely here. Would be great if > > you could elaborate.
Good question. When we determine what action we should perform on an instance, we use the config/state we get from these task events. If we are using stale information, we may be producing inaccurate actions to perform. This is a rare scenario but it relies on tons of async events happening at the same time so there is a large delay in event processing. The scenario that I encountered that lead me to introduce this guard was the following: 1. Task "ABC-old-1" is being scheduled and goes from ASSIGNED -> RUNNING (RUNNING is now on queue to be evaluated as an update) 2. A different previous task finishes updating so instance 1 is added to the working group to be updated. The initial evaluation causes "ABC-old-1" to be moved to KILLING. 3. "ABC-old-1" is KILLED and "ABC-new-1" is created and running. 4. RUNNING from the async queue is finally evaluated so we see that this instance number needs to be updated (since it is "ABC-old-1") so we kill the task at 1 (which is now "ABC-new-1"). 5. Before this patch, this would mean the task needs to be killed and added again. Now, since there is an async aspect to killing, this could mean an instance # is permanently killed post-update. I am ensuring this condition is met within the evaluation of state changes. From `StateEvaluator`: ``` * It is the responsibility of the caller to ensure that the {@code actualState} is the latest * value. Note: the caller should avoid calling this when a terminal task is moving to another * terminal state. It should also suppress deletion events for tasks that have been replaced by * an active task. ``` In the scenario you provide, the condition will work as intended and the update should still move to completion successfully. A write lock governs all actions dealing with updates so we do not have race conditions. 1. Instance A needs to be updated. 2. Instance A fails before being evaluated for an update. 3. Since A failed, it will be rescheduled into A'. - If A is evaluated for the update before being rescheduled (e.g. we evaluate the state change into PENDING or FAILED before rescheduling), then we kill that task and update the configuration - If A is rescheduled before being evaluated for the update, then we will kill A' and update the configuration Lots of moving parts for updates so hopefully what I said makes sense :/ - Jordan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67696/#review206149 ----------------------------------------------------------- On July 17, 2018, 6:21 p.m., Jordan Ly wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/67696/ > ----------------------------------------------------------- > > (Updated July 17, 2018, 6:21 p.m.) > > > Review request for Aurora, David McLaughlin, Daniel Knightly, Renan DelValle, > Santhosh Kumar Shanmugham, and Stephan Erb. > > > Repository: aurora > > > Description > ------- > > This patch enables SLA-aware updates. > > Following https://reviews.apache.org/r/66716/, tasks may now specify custom > SLA policies that will be respected by the scheduler during maintenance. This > patch integrates into the same system to allow users to specify if they want > their updates to also respect SLA. Please see > https://docs.google.com/document/d/1lCoDyoX26qrGrptrgO7vJHqYR_L2CBRGFIywsAd8uQo/edit?usp=sharing > for a more detailed description. > > This patch adds two optional Thrift fields, `slaAware` to `JobUpdateSettings` > and `message` to `JobInstanceUpdateEvent`. These should be forward and > backwards compatible. > > > Diffs > ----- > > RELEASE-NOTES.md edc081f502370190597ad028f3275cdfd572f5ca > api/src/main/thrift/org/apache/aurora/gen/api.thrift > 7265b11103aa12743c42355163ae64e98e965d7f > docs/features/job-updates.md b52eb35a1de9da40f8a00ef0b905df30069029d3 > docs/reference/configuration.md acab4c58d9ab3c04d156fed3636e77aed6d1faf4 > docs/reference/scheduler-configuration.md > 805e516689be019101f7c220c89fd9c391bb93b3 > src/main/java/org/apache/aurora/scheduler/base/Tasks.java > 2e13aacf576e648d9fffe989e4fc05c8954e72d8 > src/main/java/org/apache/aurora/scheduler/pruning/TaskHistoryPruner.java > 9aa51c3637df74cca088bd65c5539e1ebb8e5f0d > > src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java > 6a28bc274acdd6d3ac239166771ef2d45648d60f > > src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java > 9fa68b2dd55b4e4f5436356c1b94af1393967679 > src/main/java/org/apache/aurora/scheduler/updater/InstanceUpdater.java > a002d955c3bc7b7c39da5e130e8c10c536bdcebd > > src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java > ec577cccb86914ebd679ca235103f79dd7e7b79d > src/main/java/org/apache/aurora/scheduler/updater/OneWayJobUpdater.java > f2d33fb9ab6bd2c3ff199ab03dc75b1d6d618f3a > src/main/java/org/apache/aurora/scheduler/updater/SlaKillController.java > PRE-CREATION > src/main/java/org/apache/aurora/scheduler/updater/UpdateFactory.java > 3992aa77fc305adc390a4aaeb1d3939d6241ddbd > src/main/java/org/apache/aurora/scheduler/updater/UpdaterModule.java > 74ee1745e1fd7c308e2bdfa46aeb18a7ecfe14d2 > src/main/java/org/apache/aurora/scheduler/updater/Updates.java > f949fd54f524780672167e12fcadf268da08e679 > src/main/python/apache/aurora/client/api/updater_util.py > 4e3986220aaa4c9b138394b962120b176185af12 > src/main/python/apache/aurora/config/schema/base.py > 7baded79acdf863670afc183d740dcad602490c2 > src/test/java/org/apache/aurora/scheduler/base/TaskTestUtil.java > abee0951ca998894b29ee32c5362ef30da6421c7 > src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java > dcf58896f1e866c0369ba1b78060236e98d9d46b > > src/test/java/org/apache/aurora/scheduler/storage/AbstractJobUpdateStoreTest.java > 3a06a451da4ef3acccb33b5495b9fae141557148 > > src/test/java/org/apache/aurora/scheduler/testing/FakeScheduledExecutor.java > 0aea369d8a8f75291de9691b6d61f3d48895507c > > src/test/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterfaceTest.java > aa1cb2b287642e87d787e160e04a17ad0e4690d9 > src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java > 43f857d893a54e19e71b36f2f06fef3a3ef6e874 > src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java > 5667a1b59681a6de87149d7161d760bff5da3818 > src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java > 2c27ec7136ff22a3570a4ec278c73f7ee310f628 > > src/test/java/org/apache/aurora/scheduler/updater/SlaKillControllerTest.java > PRE-CREATION > src/test/python/apache/aurora/client/cli/test_inspect.py > 2baba2aa55865ec298a4c9e5af3952b56cb9a910 > > src/test/resources/org/apache/aurora/scheduler/storage/durability/goldens/current/saveJobInstanceUpdateEvent > 48902d39b9d2cbeae1a52180669aba8349e4dd65 > > src/test/resources/org/apache/aurora/scheduler/storage/durability/goldens/current/saveJobUpdate > 08dfa5b9a67989083a0d405ce8100698a4d096ae > ui/src/main/js/components/UpdateInstanceEvents.js > 8351f2c4c256e27625d94b70842be0e91065a551 > ui/src/main/js/components/UpdateSettings.js > d756f5916fd8c39dcbb5578ee5eda198f807f458 > > > Diff: https://reviews.apache.org/r/67696/diff/10/ > > > Testing > ------- > > Added unit tests, `./gradlew test`. > > Tested at scale with over 10,000 SLA-aware instance update events occuring > concurrently. Scheduler stability did not seem to be affected. > > > Thanks, > > Jordan Ly > >