+1 On Wed, Mar 14, 2012 at 2:58 PM, Thomas Jungblut < [email protected]> wrote:
> Great we started a nice discussion about it. > > If user do something with communication APIs in a while read.next() > > loop or in memory, recovery is not simple. So in my opinion, first of > > all, we should have to separate (or hide) the data handlers to > > somewhere from bsp() function like M/R or Pregel. For example, > > > Yes, but do you know what? We can simply add another layer (a retry layer) > onto the MessageService instead of innovating new fancy methods and > classes. Hiding is the key. > > Otherwise I totally agree Chia-Hung. I will take a deeper look into > incremental checkpointing the next few days. > > a) If a user's bsp job fails or throws a Runtime exception while running, > > the odds are that if we recover his task state again, the task would fail > > again because we would be recovering his error state. Hadoop has a > feature > > where we can tolerate failure of certain configurable percentage of > failed > > tasks during computation. I think we should be open to the idea of users > > wanting to be oblivious to certain data. > > > Not for certain. There are cases where usercode may fail due to unlucky > timing, e.G. a downed service in an API. This is actually no runtime > exception like ClassNotFound or stuff which is very unlikely to be fixed in > several attempts. > I'm +1 for making it smarter than Hadoop, to decide on the exception type > if something can be recovered or not. However Hadoop is doing it right. > > 2. BSPPeerImpl should have a new flavor for initialization for recovery > > task, where: > > - it has a non-negative superstep to start with > > > Yes, this must trigger a read of the last checkpointed messages. > > We should provide a Map to the users to store their global state that > > should be recovered on failure. > > > This is really really tricky and I would put my efford in doing the > simplest thing first (sorry for the users). We can add a recovery for user > internals later on. > In my opinion internal state is not how BSP should be coded: Everything can > be stateless, really! > > I know it is difficult for the first time, but I observed that the less > state you store, the more simpler the code gets and that's what frameworks > are for. > So let's add this bad-style edge case in one of the upcoming releases if > there is really a demand for it. > > I take a bit of time on the weekend to write the JIRA issues from the > things we discussed here. > > I think we can really start right away to implement it, the simplest case > is very straightforward and we can improve later on. > > Am 14. März 2012 19:20 schrieb Suraj Menon <[email protected]>: > > > Hello, > > > > +1 on Chiahung's comment on getting things right till the beginning of > > superstep on failure. > > I think we are all over the place (or atleast I am) in our thinking on > > fault tolerance. I want to take a step backward. > > First we should decide the nature of faults that we have to recover > from. I > > am putting these in points below: > > > > 1. Hardware failure.- In my opinion, this is the most important failure > > reason we should be working on. > > Why? - Most other nature of faults would happen because of errors in user > > programmer logic or Hama framework bugs if any. I will get to these > > scenarios in the next point. > > Picking up from Chiahung's mail, say we have 15 bsp tasks running in > > parallel for the 12th superstep. Out of it, node running task number 5,6 > > and 7 failed during execution. We would have to rollback that bsp task > from > > the checkpointed data in superstep 11 based on the task id, atttempt id > and > > job number. This would mean that the BSPChild executor in GroomServer > > has to be communicated that BSP Peer should be started in recovery mode > and > > not assume superstep as -1. It should then load the input message queue > > with the checkpointed messages. It would be provided with the partition > > that holds the checkpointed data for the task id. All other tasks would > be > > waiting in the sync barrier until recovered tasks 5,6 and 7 enters the > sync > > of 12 th superstep to continue. > > > > 2. Software failure. > > > > a) If a user's bsp job fails or throws a Runtime exception while running, > > the odds are that if we recover his task state again, the task would fail > > again because we would be recovering his error state. Hadoop has a > feature > > where we can tolerate failure of certain configurable percentage of > failed > > tasks during computation. I think we should be open to the idea of users > > wanting to be oblivious to certain data. > > > > b) If we have a JVM error on GroomServer or in BSPPeer process, we can > > retry the execution of tasks as in case of hardware failure. > > > > I agree with Chiahung to focus on recovery logic and how to start it. > With > > current design, it could be handled in following changes. > > > > 1. GroomServer should get a new directive on recovery task (different > than > > new task) > > 2. BSPPeerImpl should have a new flavor for initialization for recovery > > task, where: > > - it has a non-negative superstep to start with > > - the partition and input file is from the checkpointed data. > > - the messages are read from partition and the input queue is filled > > with the messages if superstep > 0 > > > > We should provide a Map to the users to store their global state that > > should be recovered on failure. > > I will get some time from tomorrow evening. I shall get things more > > organized. > > > > Thanks, > > Suraj > > > > On Wed, Mar 14, 2012 at 9:00 AM, Chia-Hung Lin <[email protected]> > > wrote: > > > > > > It would be simpler for the first version of fault tolerance to focus > > > on rollback to the level at the beginning of start of a superstep. For > > > example, when a job executes to the middle of 12th superstep, then the > > > task fails. The system should restart that task (assume checkpoint on > > > every superstep and fortunately the system have the latest snapshot of > > > e.g. the 11th superstep for that task) and put the necessary messages > > > (checkpointed at the 11th superstep that would be transferred to the > > > 12th superstep) back to the task so that the failed task can restart > > > from the 12th superstep. > > > > > > If later on the community wants the checkpoint to the level at > > > specific detail within a task. Probably we can exploit something like > > > incremental checkpoint[1] or Memento pattern, which requires users' > > > assistance, to restart a task for local checkpoint. > > > > > > [1]. Efficient Incremental Checkpointing of Java Programs. > > > hal.inria.fr/inria-00072848/PDF/RR-3810.pdf > > > > > > On 14 March 2012 15:29, Edward J. Yoon <[email protected]> wrote: > > > > If user do something with communication APIs in a while read.next() > > > > loop or in memory, recovery is not simple. So in my opinion, first of > > > > all, we should have to separate (or hide) the data handlers to > > > > somewhere from bsp() function like M/R or Pregel. For example, > > > > > > > > ftbsp(Communicator comm); > > > > setup(DataInput in); > > > > close(DataOutput out); > > > > > > > > And then, maybe we can design the flow of checkpoint-based (Task > > > > Failure) recovery like this: > > > > > > > > 1. If some task failed to execute setup() or close() functions, just > > > > re-attempt, finally return "job failed" message. > > > > > > > > 2. If some task failed in the middle of processing, and it should be > > > > re-launched, the statuses of JobInProgress and TaskInProgress should > > > > be changed. > > > > > > > > 3. And, in a every step, all tasks should check whether status of > > > > rollback to earlier checkpoint or keep running (or waiting to leave > > > > barrier). > > > > > > > > 4. Re-launch a failed task. > > > > > > > > 5. Change the status to RUNNING from ROLLBACK or RECOVERY. > > > > > > > > On Mon, Mar 12, 2012 at 5:34 PM, Thomas Jungblut > > > > <[email protected]> wrote: > > > >> Ah yes, good points. > > > >> > > > >> If we don't have a checkpoint from the current superstep we have to > do > > a > > > >> global rollback of the least known messages. > > > >> So we shouldn't offer this configurability through the BSPJob API, > > this is > > > >> for specialized users only. > > > >> > > > >> One more issue that I have in mind is how we would be able to > recover > > the > > > >>> values of static variables that someone would be holding in each > bsp > > job. > > > >>> This scenario is a problem if a user is maintaining some static > > variable > > > >>> state whose lifecycle spans across multiple supersteps. > > > >>> > > > >> > > > >> Ideally you would transfer your shared state through the messages. I > > > >> thought of making a backup function available in the BSP class where > > > >> someone can backup their internal state, but I guess this is not how > > BSP > > > >> should be written. > > > >> > > > >> Which does not mean that we don't want to provide this in next > > releases. > > > >> > > > >> Am 12. März 2012 09:01 schrieb Suraj Menon <[email protected]>: > > > >> > > > >>> Hello, > > > >>> > > > >>> I want to understand single task rollback. So consider a scenario, > > where > > > >>> all tasks checkpoint every 5 supersteps. Now when one of the tasks > > failed > > > >>> at superstep 7, it would have to recover from the checkpointed data > > at > > > >>> superstep 5. How would it get messages from the peer BSPs at > > superstep 6 > > > >>> and 7? > > > >>> > > > >>> One more issue that I have in mind is how we would be able to > recover > > the > > > >>> values of static variables that someone would be holding in each > bsp > > job. > > > >>> This scenario is a problem if a user is maintaining some static > > variable > > > >>> state whose lifecycle spans across multiple supersteps. > > > >>> > > > >>> Thanks, > > > >>> Suraj > > > >>> > > > >>> On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut < > > > >>> [email protected]> wrote: > > > >>> > > > >>> > I guess we have to slice some issues needed for checkpoint > > recovery. > > > >>> > > > > >>> > In my opinion we have two types of recovery: > > > >>> > - single task recovery > > > >>> > - global recovery of all tasks > > > >>> > > > > >>> > And I guess we can simply make a rule: > > > >>> > If a task fails inside our barrier sync method (since we have a > > double > > > >>> > barrier, after enterBarrier() and before leaveBarrier()), we have > > to do a > > > >>> > global recovery. > > > >>> > Else we can just do a single task rollback. > > > >>> > > > > >>> > For those asking why we can't do just always a global rollback: > it > > is too > > > >>> > costly and we really do not need it in any case. > > > >>> > But we need it in the case where a task fails inside the barrier > > (between > > > >>> > enter and leave) just because a single rollbacked task can't trip > > the > > > >>> > enterBarrier-Barrier. > > > >>> > > > > >>> > Anything I have forgotten? > > > >>> > > > > >>> > > > > >>> > -- > > > >>> > Thomas Jungblut > > > >>> > Berlin <[email protected]> > > > >>> > > > > >>> > > > >> > > > >> > > > >> > > > >> -- > > > >> Thomas Jungblut > > > >> Berlin <[email protected]> > > > > > > > > > > > > > > > > -- > > > > Best Regards, Edward J. Yoon > > > > @eddieyoon > > > > > > -- > Thomas Jungblut > Berlin <[email protected]> >
