I guess we have to slice some issues needed for checkpoint recovery.

In my opinion we have two types of recovery:
- single task recovery
- global recovery of all tasks

And I guess we can simply make a rule:
If a task fails inside our barrier sync method (since we have a double
barrier, after enterBarrier() and before leaveBarrier()), we have to do a
global recovery.
Else we can just do a single task rollback.

For those asking why we can't do just always a global rollback: it is too
costly and we really do not need it in any case.
But we need it in the case where a task fails inside the barrier (between
enter and leave) just because a single rollbacked task can't trip the
enterBarrier-Barrier.

Anything I have forgotten?


-- 
Thomas Jungblut
Berlin <[email protected]>

Reply via email to