I guess we have to slice some issues needed for checkpoint recovery. In my opinion we have two types of recovery: - single task recovery - global recovery of all tasks
And I guess we can simply make a rule: If a task fails inside our barrier sync method (since we have a double barrier, after enterBarrier() and before leaveBarrier()), we have to do a global recovery. Else we can just do a single task rollback. For those asking why we can't do just always a global rollback: it is too costly and we really do not need it in any case. But we need it in the case where a task fails inside the barrier (between enter and leave) just because a single rollbacked task can't trip the enterBarrier-Barrier. Anything I have forgotten? -- Thomas Jungblut Berlin <[email protected]>
