Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

Maxim Khutornenko Wed, 21 Sep 2016 22:39:13 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/
-----------------------------------------------------------


Review request for Aurora, Stephan Erb and Zameer Manji.


Bugs: AURORA-1779
    https://issues.apache.org/jira/browse/AURORA-1779


Repository: aurora


Description
-------

######Problem

The current `BatchWorker` error handling assumes graceful recovery if any of 
the individual batch items fail. This may not hold true if the storage is 
modified _before_ the error is thrown and the `LogStorage` transaction is not 
rolled back. Consider the following fragment from the `StateManagerImpl`:
```
    storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);

    for (IScheduledTask scheduledTask : scheduledTasks) {
      updateTaskAndExternalState(
          storeProvider.getUnsafeTaskStore(),
          Tasks.id(scheduledTask),
          Optional.of(scheduledTask),
          Optional.of(PENDING),
          Optional.absent());
    }
```
If a task is saved but the subsequent `updateTaskAndExternalState()` throws AND 
`BatchWorker` catches and logs that exception, the native store transaction 
will be committed and as a result the storage will be left in a logically 
inconsistent state.

######Solution

I can see at least 3 ways to solve this problem:
1. Fail hard and shutdown scheduler when any of the batch items throw;
2. Fail only the last batch (drop all its items) and let storage transaction 
roll back;
3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
failed item would be retried when the storage transaction rolls back;

The #3 is the most involved and would be very hard to get right (assuming batch 
items are idempotent, which may not be the case). The #2 may result in a very 
hard to troubleshoot scenario where _some_ items would be dropped on the floor 
and never completed. 

I suggest we go with #1 as the most straightforward and transparent approach. 
It's also the only one that ensures storage state consistency, which is the 
most vital invariant to preserve (as AURORA-1779 proves). The only downside of 
this approach is that scheduler will go down hard any time an unhandled error 
is thrown but arguably this is the easiest way to improve our codebase and 
certainly better than leaving the scheduler in a crippled state.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
e05d4b48749b0691902a505bea1b4490fdd08f29 
  src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 
2ec3967ddb1d470cf681de062a6400f647978185 
  src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 
7c8047a7235a937c29fe96517242923ff84a080c 
  src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java 
d390c07522d22e43d79ce4370985f3643ef021ca 
  src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
  src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 
8556253fc11f6027316871eb9dc66d8627a77fe6 

Diff: https://reviews.apache.org/r/52141/diff/


Testing
-------

unit and e2e tests


Thanks,

Maxim Khutornenko

Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

Reply via email to