Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

Joshua Cohen Thu, 22 Sep 2016 09:53:35 -0700


> On Sept. 22, 2016, 2:24 p.m., Joshua Cohen wrote:
> > What's the behavior pre-BatchWorker in this case? Would we fail hard, or is 
> > there something inherent to the batch worker that makes this necessary?
> > 
> > Also, this fixes the error handling that led to corruption in the event of 
> > a failure, but should we also determine the root cause that led to the 
> > failure to begin with (as part of another review no doubt)?
> 
> Maxim Khutornenko wrote:
>     The pre-BatchWorker behavior depended on where/how the error was thrown 
> and could have resulted in a hard shutdown or a mysterious silent async 
> failure killing one of the executor threads. With the BatchWorker, the 
> scheduler will always fail on unhandled exception. This is needed to preserve 
> the transaction boundaries and let the error rollback the changes.
>     
>     > ...but should we also determine the root cause that led to the failure 
> to begin with  
>     
>     I thought it should have been clear from the RB description but then I 
> realized I explained it in Slack and not here. We purposefully induce 
> scheduler failovers many times during the e2e run to verify things like job 
> updates still work. If the shutdown happens during the transaction, the DB 
> gets closed underneath and the subsequent queries fail any way they like. 
> This is exactly what happened in this case. A task is saved, driver exits, 
> StateManager attempts to save task events but fails due to "DB closed" error. 
> That error is caught by the BatchWorker and the native log transaction 
> successfully completes before the scheduler fully terminates.


Thanks, I missed that in slack!


- Joshua


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review149999
-----------------------------------------------------------


On Sept. 22, 2016, 3:20 p.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2016, 3:20 p.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
>     https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> ######Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of 
> the individual batch items fail. This may not hold true if the storage is 
> modified _before_ the error is thrown and the `LogStorage` transaction is not 
> rolled back. Consider the following fragment from the `StateManagerImpl`:
> ```
>     storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
>     for (IScheduledTask scheduledTask : scheduledTasks) {
>       updateTaskAndExternalState(
>           storeProvider.getUnsafeTaskStore(),
>           Tasks.id(scheduledTask),
>           Optional.of(scheduledTask),
>           Optional.of(PENDING),
>           Optional.absent());
>     }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws 
> AND `BatchWorker` catches and logs that exception, the native store 
> transaction will be committed and as a result the storage will be left in a 
> logically inconsistent state.
> 
> ######Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction 
> roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the 
> failed item would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming 
> batch items are idempotent, which may not be the case). The #2 may result in 
> a very hard to troubleshoot scenario where _some_ items would be dropped on 
> the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. 
> It's also the only one that ensures storage state consistency, which is the 
> most vital invariant to preserve (as AURORA-1779 proves). The only downside 
> of this approach is that scheduler will go down hard any time an unhandled 
> error is thrown but arguably this is the easiest way to improve our codebase 
> and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java 
> e05d4b48749b0691902a505bea1b4490fdd08f29 
>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java 
> a86dc82cafa7f5436d2b8d00c6db575ff8311eea 
> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> -------
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>

Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.

Reply via email to