Re: Recovery Issues

Thomas Jungblut Mon, 12 Mar 2012 01:34:42 -0700

Ah yes, good points.

If we don't have a checkpoint from the current superstep we have to do a
global rollback of the least known messages.
So we shouldn't offer this configurability through the BSPJob API, this is
for specialized users only.


One more issue that I have in mind is how we would be able to recover the
> values of static variables that someone would be holding in each bsp job.
> This scenario is a problem if a user is maintaining some static variable
> state whose lifecycle spans across multiple supersteps.
>

Ideally you would transfer your shared state through the messages. I
thought of making a backup function available in the BSP class where
someone can backup their internal state, but I guess this is not how BSP
should be written.

Which does not mean that we don't want to provide this in next releases.

Am 12. März 2012 09:01 schrieb Suraj Menon <[email protected]>:

> Hello,
>
> I want to understand single task rollback. So consider a scenario, where
> all tasks checkpoint every 5 supersteps. Now when one of the tasks failed
> at superstep 7, it would have to recover from the checkpointed data at
> superstep 5. How would it get messages from the peer BSPs at superstep 6
> and 7?
>
> One more issue that I have in mind is how we would be able to recover the
> values of static variables that someone would be holding in each bsp job.
> This scenario is a problem if a user is maintaining some static variable
> state whose lifecycle spans across multiple supersteps.
>
> Thanks,
> Suraj
>
> On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut <
> [email protected]> wrote:
>
> > I guess we have to slice some issues needed for checkpoint recovery.
> >
> > In my opinion we have two types of recovery:
> > - single task recovery
> > - global recovery of all tasks
> >
> > And I guess we can simply make a rule:
> > If a task fails inside our barrier sync method (since we have a double
> > barrier, after enterBarrier() and before leaveBarrier()), we have to do a
> > global recovery.
> > Else we can just do a single task rollback.
> >
> > For those asking why we can't do just always a global rollback: it is too
> > costly and we really do not need it in any case.
> > But we need it in the case where a task fails inside the barrier (between
> > enter and leave) just because a single rollbacked task can't trip the
> > enterBarrier-Barrier.
> >
> > Anything I have forgotten?
> >
> >
> > --
> > Thomas Jungblut
> > Berlin <[email protected]>
> >
>



-- 
Thomas Jungblut
Berlin <[email protected]>

Re: Recovery Issues

Reply via email to