Ah yes, good points. If we don't have a checkpoint from the current superstep we have to do a global rollback of the least known messages. So we shouldn't offer this configurability through the BSPJob API, this is for specialized users only.
One more issue that I have in mind is how we would be able to recover the > values of static variables that someone would be holding in each bsp job. > This scenario is a problem if a user is maintaining some static variable > state whose lifecycle spans across multiple supersteps. > Ideally you would transfer your shared state through the messages. I thought of making a backup function available in the BSP class where someone can backup their internal state, but I guess this is not how BSP should be written. Which does not mean that we don't want to provide this in next releases. Am 12. März 2012 09:01 schrieb Suraj Menon <[email protected]>: > Hello, > > I want to understand single task rollback. So consider a scenario, where > all tasks checkpoint every 5 supersteps. Now when one of the tasks failed > at superstep 7, it would have to recover from the checkpointed data at > superstep 5. How would it get messages from the peer BSPs at superstep 6 > and 7? > > One more issue that I have in mind is how we would be able to recover the > values of static variables that someone would be holding in each bsp job. > This scenario is a problem if a user is maintaining some static variable > state whose lifecycle spans across multiple supersteps. > > Thanks, > Suraj > > On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut < > [email protected]> wrote: > > > I guess we have to slice some issues needed for checkpoint recovery. > > > > In my opinion we have two types of recovery: > > - single task recovery > > - global recovery of all tasks > > > > And I guess we can simply make a rule: > > If a task fails inside our barrier sync method (since we have a double > > barrier, after enterBarrier() and before leaveBarrier()), we have to do a > > global recovery. > > Else we can just do a single task rollback. > > > > For those asking why we can't do just always a global rollback: it is too > > costly and we really do not need it in any case. > > But we need it in the case where a task fails inside the barrier (between > > enter and leave) just because a single rollbacked task can't trip the > > enterBarrier-Barrier. > > > > Anything I have forgotten? > > > > > > -- > > Thomas Jungblut > > Berlin <[email protected]> > > > -- Thomas Jungblut Berlin <[email protected]>
