Re: Recovery Issues

Thomas Jungblut Fri, 16 Mar 2012 16:18:53 -0700

That is totally okay. Do you want to open them?
BTW I gave you committer rights in JIRA, so have fun ;)



Am 16. März 2012 13:33 schrieb Suraj Menon <[email protected]>:

> I am planning to add the following subtasks under HAMA-505
>
>
>   1.
>
>   BSP Peer should have the ability to start with a non-zero superstep from
>   a partition of checkpointed message for that task ID, attempt ID
>   2.
>
>   For configurable number of attempts, BSPMaster should direct groomserver
>   to run the recovery task on failure. Implement recovery directive from
>   BSPMaster to GroomServer. The task state should be started with state
>   RECOVERING
>   3.
>
>   Checkpointing can be configured to be done asynchronous to the
>   sync.(This is in lines of last comment left by Chiahung)
>   4.
>
>   Maintain global state on last superstep for which checkpointing was
>   completed successfully
>   5.
>
>   Should we stats and logs for the failed task that was re-attempted.
>
> Please share your opinions.
>
> Thanks,
> Suraj
>
>   1.
>
>
> On Wed, Mar 14, 2012 at 5:19 PM, Edward J. Yoon <[email protected]
> >wrote:
>
> > Nice plan.
> >
> > On Thu, Mar 15, 2012 at 3:58 AM, Thomas Jungblut
> > <[email protected]> wrote:
> > > Great we started a nice discussion about it.
> > >
> > > If user do something with communication APIs in a while read.next()
> > >> loop or in memory, recovery is not simple. So in my opinion, first of
> > >> all, we should have to separate (or hide) the data handlers to
> > >> somewhere from bsp() function like M/R or Pregel. For example,
> > >
> > >
> > > Yes, but do you know what? We can simply add another layer (a retry
> > layer)
> > > onto the MessageService instead of innovating new fancy methods and
> > > classes. Hiding is the key.
> > >
> > > Otherwise I totally agree Chia-Hung. I will take a deeper look into
> > > incremental checkpointing the next few days.
> > >
> > > a) If a user's bsp job fails or throws a Runtime exception while
> running,
> > >> the odds are that if we recover his task state again, the task would
> > fail
> > >> again because we would be recovering his error state. Hadoop has a
> > feature
> > >> where we can tolerate failure of certain configurable percentage of
> > failed
> > >> tasks during computation. I think we should be open to the idea of
> users
> > >> wanting to be oblivious to certain data.
> > >
> > >
> > > Not for certain. There are cases where usercode may fail due to unlucky
> > > timing, e.G. a downed service in an API. This is actually no runtime
> > > exception like ClassNotFound or stuff which is very unlikely to be
> fixed
> > in
> > > several attempts.
> > > I'm +1 for making it smarter than Hadoop, to decide on the exception
> type
> > > if something can be recovered or not. However Hadoop is doing it right.
> > >
> > > 2. BSPPeerImpl should have a new flavor for initialization for recovery
> > >> task, where:
> > >>    - it has a non-negative superstep to start with
> > >
> > >
> > > Yes, this must trigger a read of the last checkpointed messages.
> > >
> > > We should provide a Map to the users to store their global state that
> > >> should be recovered on failure.
> > >
> > >
> > > This is really really tricky and I would put my efford in doing the
> > > simplest thing first (sorry for the users). We can add a recovery for
> > user
> > > internals later on.
> > > In my opinion internal state is not how BSP should be coded: Everything
> > can
> > > be stateless, really!
> > >
> > > I know it is difficult for the first time, but I observed that the less
> > > state you store, the more simpler the code gets and that's what
> > frameworks
> > > are for.
> > > So let's add this bad-style edge case in one of the upcoming releases
> if
> > > there is really a demand for it.
> > >
> > > I take a bit of time on the weekend to write the JIRA issues from the
> > > things we discussed here.
> > >
> > > I think we can really start right away to implement it, the simplest
> case
> > > is very straightforward and we can improve later on.
> > >
> > > Am 14. März 2012 19:20 schrieb Suraj Menon <[email protected]>:
> > >
> > >> Hello,
> > >>
> > >> +1 on Chiahung's comment on getting things right till the beginning of
> > >> superstep on failure.
> > >> I think we are all over the place (or atleast I am) in our thinking on
> > >> fault tolerance. I want to take a step backward.
> > >> First we should decide the nature of faults that we have to recover
> > from. I
> > >> am putting these in points below:
> > >>
> > >> 1. Hardware failure.- In my opinion, this is the most important
> failure
> > >> reason we should be working on.
> > >> Why? - Most other nature of faults would happen because of errors in
> > user
> > >> programmer logic or Hama framework bugs if any. I will get to these
> > >> scenarios in the next point.
> > >> Picking up from Chiahung's mail, say we have 15 bsp tasks running in
> > >> parallel for the 12th superstep. Out of it, node running task number
> 5,6
> > >> and 7 failed during execution. We would have to rollback that bsp task
> > from
> > >> the checkpointed data in superstep 11 based on the task id, atttempt
> id
> > and
> > >> job number. This would mean that the BSPChild executor in GroomServer
> > >> has to be communicated that BSP Peer should be started in recovery
> mode
> > and
> > >> not assume superstep as -1. It should then load the input message
> queue
> > >> with the checkpointed messages. It would be provided with the
> partition
> > >> that holds the checkpointed data for the task id. All other tasks
> would
> > be
> > >> waiting in the sync barrier until recovered tasks 5,6 and 7 enters the
> > sync
> > >> of 12 th superstep to continue.
> > >>
> > >> 2. Software failure.
> > >>
> > >> a) If a user's bsp job fails or throws a Runtime exception while
> > running,
> > >> the odds are that if we recover his task state again, the task would
> > fail
> > >> again because we would be recovering his error state. Hadoop has a
> > feature
> > >> where we can tolerate failure of certain configurable percentage of
> > failed
> > >> tasks during computation. I think we should be open to the idea of
> users
> > >> wanting to be oblivious to certain data.
> > >>
> > >> b) If we have a JVM error on GroomServer or in BSPPeer process, we can
> > >> retry the execution of tasks as in case of hardware failure.
> > >>
> > >> I agree with Chiahung to focus on recovery logic and how to start it.
> > With
> > >> current design, it could be handled in following changes.
> > >>
> > >> 1. GroomServer should get a new directive on recovery task (different
> > than
> > >> new task)
> > >> 2. BSPPeerImpl should have a new flavor for initialization for
> recovery
> > >> task, where:
> > >>    - it has a non-negative superstep to start with
> > >>    - the partition and input file is from the checkpointed data.
> > >>    - the messages are read from partition and the input queue is
> filled
> > >> with the messages if superstep > 0
> > >>
> > >> We should provide a Map to the users to store their global state that
> > >> should be recovered on failure.
> > >> I will get some time from tomorrow evening. I shall get things more
> > >> organized.
> > >>
> > >> Thanks,
> > >> Suraj
> > >>
> > >> On Wed, Mar 14, 2012 at 9:00 AM, Chia-Hung Lin <[email protected]
> >
> > >> wrote:
> > >> >
> > >> > It would be simpler for the first version of fault tolerance  to
> focus
> > >> > on rollback to the level at the beginning of start of a superstep.
> For
> > >> > example, when a job executes to the middle of 12th superstep, then
> the
> > >> > task fails. The system should restart that task (assume checkpoint
> on
> > >> > every superstep and fortunately the system have the latest snapshot
> of
> > >> > e.g. the 11th superstep for that task) and put the necessary
> messages
> > >> > (checkpointed at the 11th superstep that would be transferred to the
> > >> > 12th superstep) back to the task so that the failed task can restart
> > >> > from the 12th superstep.
> > >> >
> > >> > If later on the community wants the checkpoint to the level at
> > >> > specific detail within a task. Probably we can exploit something
> like
> > >> > incremental checkpoint[1] or Memento pattern, which requires users'
> > >> > assistance, to restart a task for local checkpoint.
> > >> >
> > >> > [1]. Efﬁcient Incremental Checkpointing of Java Programs.
> > >> > hal.inria.fr/inria-00072848/PDF/RR-3810.pdf
> > >> >
> > >> > On 14 March 2012 15:29, Edward J. Yoon <[email protected]>
> wrote:
> > >> > > If user do something with communication APIs in a while
> read.next()
> > >> > > loop or in memory, recovery is not simple. So in my opinion, first
> > of
> > >> > > all, we should have to separate (or hide) the data handlers to
> > >> > > somewhere from bsp() function like M/R or Pregel. For example,
> > >> > >
> > >> > > ftbsp(Communicator comm);
> > >> > > setup(DataInput in);
> > >> > > close(DataOutput out);
> > >> > >
> > >> > > And then, maybe we can design the flow of checkpoint-based (Task
> > >> > > Failure) recovery like this:
> > >> > >
> > >> > > 1. If some task failed to execute setup() or close() functions,
> just
> > >> > > re-attempt, finally return "job failed" message.
> > >> > >
> > >> > > 2. If some task failed in the middle of processing, and it should
> be
> > >> > > re-launched, the statuses of JobInProgress and TaskInProgress
> should
> > >> > > be changed.
> > >> > >
> > >> > > 3. And, in a every step, all tasks should check whether status of
> > >> > > rollback to earlier checkpoint or keep running (or waiting to
> leave
> > >> > > barrier).
> > >> > >
> > >> > > 4. Re-launch a failed task.
> > >> > >
> > >> > > 5. Change the status to RUNNING from ROLLBACK or RECOVERY.
> > >> > >
> > >> > > On Mon, Mar 12, 2012 at 5:34 PM, Thomas Jungblut
> > >> > > <[email protected]> wrote:
> > >> > >> Ah yes, good points.
> > >> > >>
> > >> > >> If we don't have a checkpoint from the current superstep we have
> > to do
> > >> a
> > >> > >> global rollback of the least known messages.
> > >> > >> So we shouldn't offer this configurability through the BSPJob
> API,
> > >> this is
> > >> > >> for specialized users only.
> > >> > >>
> > >> > >> One more issue that I have in mind is how we would be able to
> > recover
> > >> the
> > >> > >>> values of static variables that someone would be holding in each
> > bsp
> > >> job.
> > >> > >>> This scenario is a problem if a user is maintaining some static
> > >> variable
> > >> > >>> state whose lifecycle spans across multiple supersteps.
> > >> > >>>
> > >> > >>
> > >> > >> Ideally you would transfer your shared state through the
> messages.
> > I
> > >> > >> thought of making a backup function available in the BSP class
> > where
> > >> > >> someone can backup their internal state, but I guess this is not
> > how
> > >> BSP
> > >> > >> should be written.
> > >> > >>
> > >> > >> Which does not mean that we don't want to provide this in next
> > >> releases.
> > >> > >>
> > >> > >> Am 12. März 2012 09:01 schrieb Suraj Menon <
> [email protected]
> > >:
> > >> > >>
> > >> > >>> Hello,
> > >> > >>>
> > >> > >>> I want to understand single task rollback. So consider a
> scenario,
> > >> where
> > >> > >>> all tasks checkpoint every 5 supersteps. Now when one of the
> tasks
> > >> failed
> > >> > >>> at superstep 7, it would have to recover from the checkpointed
> > data
> > >> at
> > >> > >>> superstep 5. How would it get messages from the peer BSPs at
> > >> superstep 6
> > >> > >>> and 7?
> > >> > >>>
> > >> > >>> One more issue that I have in mind is how we would be able to
> > recover
> > >> the
> > >> > >>> values of static variables that someone would be holding in each
> > bsp
> > >> job.
> > >> > >>> This scenario is a problem if a user is maintaining some static
> > >> variable
> > >> > >>> state whose lifecycle spans across multiple supersteps.
> > >> > >>>
> > >> > >>> Thanks,
> > >> > >>> Suraj
> > >> > >>>
> > >> > >>> On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut <
> > >> > >>> [email protected]> wrote:
> > >> > >>>
> > >> > >>> > I guess we have to slice some issues needed for checkpoint
> > >> recovery.
> > >> > >>> >
> > >> > >>> > In my opinion we have two types of recovery:
> > >> > >>> > - single task recovery
> > >> > >>> > - global recovery of all tasks
> > >> > >>> >
> > >> > >>> > And I guess we can simply make a rule:
> > >> > >>> > If a task fails inside our barrier sync method (since we have
> a
> > >> double
> > >> > >>> > barrier, after enterBarrier() and before leaveBarrier()), we
> > have
> > >> to do a
> > >> > >>> > global recovery.
> > >> > >>> > Else we can just do a single task rollback.
> > >> > >>> >
> > >> > >>> > For those asking why we can't do just always a global
> rollback:
> > it
> > >> is too
> > >> > >>> > costly and we really do not need it in any case.
> > >> > >>> > But we need it in the case where a task fails inside the
> barrier
> > >> (between
> > >> > >>> > enter and leave) just because a single rollbacked task can't
> > trip
> > >> the
> > >> > >>> > enterBarrier-Barrier.
> > >> > >>> >
> > >> > >>> > Anything I have forgotten?
> > >> > >>> >
> > >> > >>> >
> > >> > >>> > --
> > >> > >>> > Thomas Jungblut
> > >> > >>> > Berlin <[email protected]>
> > >> > >>> >
> > >> > >>>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> --
> > >> > >> Thomas Jungblut
> > >> > >> Berlin <[email protected]>
> > >> > >
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Best Regards, Edward J. Yoon
> > >> > > @eddieyoon
> > >>
> > >
> > >
> > >
> > > --
> > > Thomas Jungblut
> > > Berlin <[email protected]>
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon
> >
>



-- 
Thomas Jungblut
Berlin <[email protected]>

Re: Recovery Issues

Reply via email to