That is totally okay. Do you want to open them? BTW I gave you committer rights in JIRA, so have fun ;)
Am 16. März 2012 13:33 schrieb Suraj Menon <[email protected]>: > I am planning to add the following subtasks under HAMA-505 > > > 1. > > BSP Peer should have the ability to start with a non-zero superstep from > a partition of checkpointed message for that task ID, attempt ID > 2. > > For configurable number of attempts, BSPMaster should direct groomserver > to run the recovery task on failure. Implement recovery directive from > BSPMaster to GroomServer. The task state should be started with state > RECOVERING > 3. > > Checkpointing can be configured to be done asynchronous to the > sync.(This is in lines of last comment left by Chiahung) > 4. > > Maintain global state on last superstep for which checkpointing was > completed successfully > 5. > > Should we stats and logs for the failed task that was re-attempted. > > Please share your opinions. > > Thanks, > Suraj > > 1. > > > On Wed, Mar 14, 2012 at 5:19 PM, Edward J. Yoon <[email protected] > >wrote: > > > Nice plan. > > > > On Thu, Mar 15, 2012 at 3:58 AM, Thomas Jungblut > > <[email protected]> wrote: > > > Great we started a nice discussion about it. > > > > > > If user do something with communication APIs in a while read.next() > > >> loop or in memory, recovery is not simple. So in my opinion, first of > > >> all, we should have to separate (or hide) the data handlers to > > >> somewhere from bsp() function like M/R or Pregel. For example, > > > > > > > > > Yes, but do you know what? We can simply add another layer (a retry > > layer) > > > onto the MessageService instead of innovating new fancy methods and > > > classes. Hiding is the key. > > > > > > Otherwise I totally agree Chia-Hung. I will take a deeper look into > > > incremental checkpointing the next few days. > > > > > > a) If a user's bsp job fails or throws a Runtime exception while > running, > > >> the odds are that if we recover his task state again, the task would > > fail > > >> again because we would be recovering his error state. Hadoop has a > > feature > > >> where we can tolerate failure of certain configurable percentage of > > failed > > >> tasks during computation. I think we should be open to the idea of > users > > >> wanting to be oblivious to certain data. > > > > > > > > > Not for certain. There are cases where usercode may fail due to unlucky > > > timing, e.G. a downed service in an API. This is actually no runtime > > > exception like ClassNotFound or stuff which is very unlikely to be > fixed > > in > > > several attempts. > > > I'm +1 for making it smarter than Hadoop, to decide on the exception > type > > > if something can be recovered or not. However Hadoop is doing it right. > > > > > > 2. BSPPeerImpl should have a new flavor for initialization for recovery > > >> task, where: > > >> - it has a non-negative superstep to start with > > > > > > > > > Yes, this must trigger a read of the last checkpointed messages. > > > > > > We should provide a Map to the users to store their global state that > > >> should be recovered on failure. > > > > > > > > > This is really really tricky and I would put my efford in doing the > > > simplest thing first (sorry for the users). We can add a recovery for > > user > > > internals later on. > > > In my opinion internal state is not how BSP should be coded: Everything > > can > > > be stateless, really! > > > > > > I know it is difficult for the first time, but I observed that the less > > > state you store, the more simpler the code gets and that's what > > frameworks > > > are for. > > > So let's add this bad-style edge case in one of the upcoming releases > if > > > there is really a demand for it. > > > > > > I take a bit of time on the weekend to write the JIRA issues from the > > > things we discussed here. > > > > > > I think we can really start right away to implement it, the simplest > case > > > is very straightforward and we can improve later on. > > > > > > Am 14. März 2012 19:20 schrieb Suraj Menon <[email protected]>: > > > > > >> Hello, > > >> > > >> +1 on Chiahung's comment on getting things right till the beginning of > > >> superstep on failure. > > >> I think we are all over the place (or atleast I am) in our thinking on > > >> fault tolerance. I want to take a step backward. > > >> First we should decide the nature of faults that we have to recover > > from. I > > >> am putting these in points below: > > >> > > >> 1. Hardware failure.- In my opinion, this is the most important > failure > > >> reason we should be working on. > > >> Why? - Most other nature of faults would happen because of errors in > > user > > >> programmer logic or Hama framework bugs if any. I will get to these > > >> scenarios in the next point. > > >> Picking up from Chiahung's mail, say we have 15 bsp tasks running in > > >> parallel for the 12th superstep. Out of it, node running task number > 5,6 > > >> and 7 failed during execution. We would have to rollback that bsp task > > from > > >> the checkpointed data in superstep 11 based on the task id, atttempt > id > > and > > >> job number. This would mean that the BSPChild executor in GroomServer > > >> has to be communicated that BSP Peer should be started in recovery > mode > > and > > >> not assume superstep as -1. It should then load the input message > queue > > >> with the checkpointed messages. It would be provided with the > partition > > >> that holds the checkpointed data for the task id. All other tasks > would > > be > > >> waiting in the sync barrier until recovered tasks 5,6 and 7 enters the > > sync > > >> of 12 th superstep to continue. > > >> > > >> 2. Software failure. > > >> > > >> a) If a user's bsp job fails or throws a Runtime exception while > > running, > > >> the odds are that if we recover his task state again, the task would > > fail > > >> again because we would be recovering his error state. Hadoop has a > > feature > > >> where we can tolerate failure of certain configurable percentage of > > failed > > >> tasks during computation. I think we should be open to the idea of > users > > >> wanting to be oblivious to certain data. > > >> > > >> b) If we have a JVM error on GroomServer or in BSPPeer process, we can > > >> retry the execution of tasks as in case of hardware failure. > > >> > > >> I agree with Chiahung to focus on recovery logic and how to start it. > > With > > >> current design, it could be handled in following changes. > > >> > > >> 1. GroomServer should get a new directive on recovery task (different > > than > > >> new task) > > >> 2. BSPPeerImpl should have a new flavor for initialization for > recovery > > >> task, where: > > >> - it has a non-negative superstep to start with > > >> - the partition and input file is from the checkpointed data. > > >> - the messages are read from partition and the input queue is > filled > > >> with the messages if superstep > 0 > > >> > > >> We should provide a Map to the users to store their global state that > > >> should be recovered on failure. > > >> I will get some time from tomorrow evening. I shall get things more > > >> organized. > > >> > > >> Thanks, > > >> Suraj > > >> > > >> On Wed, Mar 14, 2012 at 9:00 AM, Chia-Hung Lin <[email protected] > > > > >> wrote: > > >> > > > >> > It would be simpler for the first version of fault tolerance to > focus > > >> > on rollback to the level at the beginning of start of a superstep. > For > > >> > example, when a job executes to the middle of 12th superstep, then > the > > >> > task fails. The system should restart that task (assume checkpoint > on > > >> > every superstep and fortunately the system have the latest snapshot > of > > >> > e.g. the 11th superstep for that task) and put the necessary > messages > > >> > (checkpointed at the 11th superstep that would be transferred to the > > >> > 12th superstep) back to the task so that the failed task can restart > > >> > from the 12th superstep. > > >> > > > >> > If later on the community wants the checkpoint to the level at > > >> > specific detail within a task. Probably we can exploit something > like > > >> > incremental checkpoint[1] or Memento pattern, which requires users' > > >> > assistance, to restart a task for local checkpoint. > > >> > > > >> > [1]. Efficient Incremental Checkpointing of Java Programs. > > >> > hal.inria.fr/inria-00072848/PDF/RR-3810.pdf > > >> > > > >> > On 14 March 2012 15:29, Edward J. Yoon <[email protected]> > wrote: > > >> > > If user do something with communication APIs in a while > read.next() > > >> > > loop or in memory, recovery is not simple. So in my opinion, first > > of > > >> > > all, we should have to separate (or hide) the data handlers to > > >> > > somewhere from bsp() function like M/R or Pregel. For example, > > >> > > > > >> > > ftbsp(Communicator comm); > > >> > > setup(DataInput in); > > >> > > close(DataOutput out); > > >> > > > > >> > > And then, maybe we can design the flow of checkpoint-based (Task > > >> > > Failure) recovery like this: > > >> > > > > >> > > 1. If some task failed to execute setup() or close() functions, > just > > >> > > re-attempt, finally return "job failed" message. > > >> > > > > >> > > 2. If some task failed in the middle of processing, and it should > be > > >> > > re-launched, the statuses of JobInProgress and TaskInProgress > should > > >> > > be changed. > > >> > > > > >> > > 3. And, in a every step, all tasks should check whether status of > > >> > > rollback to earlier checkpoint or keep running (or waiting to > leave > > >> > > barrier). > > >> > > > > >> > > 4. Re-launch a failed task. > > >> > > > > >> > > 5. Change the status to RUNNING from ROLLBACK or RECOVERY. > > >> > > > > >> > > On Mon, Mar 12, 2012 at 5:34 PM, Thomas Jungblut > > >> > > <[email protected]> wrote: > > >> > >> Ah yes, good points. > > >> > >> > > >> > >> If we don't have a checkpoint from the current superstep we have > > to do > > >> a > > >> > >> global rollback of the least known messages. > > >> > >> So we shouldn't offer this configurability through the BSPJob > API, > > >> this is > > >> > >> for specialized users only. > > >> > >> > > >> > >> One more issue that I have in mind is how we would be able to > > recover > > >> the > > >> > >>> values of static variables that someone would be holding in each > > bsp > > >> job. > > >> > >>> This scenario is a problem if a user is maintaining some static > > >> variable > > >> > >>> state whose lifecycle spans across multiple supersteps. > > >> > >>> > > >> > >> > > >> > >> Ideally you would transfer your shared state through the > messages. > > I > > >> > >> thought of making a backup function available in the BSP class > > where > > >> > >> someone can backup their internal state, but I guess this is not > > how > > >> BSP > > >> > >> should be written. > > >> > >> > > >> > >> Which does not mean that we don't want to provide this in next > > >> releases. > > >> > >> > > >> > >> Am 12. März 2012 09:01 schrieb Suraj Menon < > [email protected] > > >: > > >> > >> > > >> > >>> Hello, > > >> > >>> > > >> > >>> I want to understand single task rollback. So consider a > scenario, > > >> where > > >> > >>> all tasks checkpoint every 5 supersteps. Now when one of the > tasks > > >> failed > > >> > >>> at superstep 7, it would have to recover from the checkpointed > > data > > >> at > > >> > >>> superstep 5. How would it get messages from the peer BSPs at > > >> superstep 6 > > >> > >>> and 7? > > >> > >>> > > >> > >>> One more issue that I have in mind is how we would be able to > > recover > > >> the > > >> > >>> values of static variables that someone would be holding in each > > bsp > > >> job. > > >> > >>> This scenario is a problem if a user is maintaining some static > > >> variable > > >> > >>> state whose lifecycle spans across multiple supersteps. > > >> > >>> > > >> > >>> Thanks, > > >> > >>> Suraj > > >> > >>> > > >> > >>> On Sat, Mar 10, 2012 at 4:11 AM, Thomas Jungblut < > > >> > >>> [email protected]> wrote: > > >> > >>> > > >> > >>> > I guess we have to slice some issues needed for checkpoint > > >> recovery. > > >> > >>> > > > >> > >>> > In my opinion we have two types of recovery: > > >> > >>> > - single task recovery > > >> > >>> > - global recovery of all tasks > > >> > >>> > > > >> > >>> > And I guess we can simply make a rule: > > >> > >>> > If a task fails inside our barrier sync method (since we have > a > > >> double > > >> > >>> > barrier, after enterBarrier() and before leaveBarrier()), we > > have > > >> to do a > > >> > >>> > global recovery. > > >> > >>> > Else we can just do a single task rollback. > > >> > >>> > > > >> > >>> > For those asking why we can't do just always a global > rollback: > > it > > >> is too > > >> > >>> > costly and we really do not need it in any case. > > >> > >>> > But we need it in the case where a task fails inside the > barrier > > >> (between > > >> > >>> > enter and leave) just because a single rollbacked task can't > > trip > > >> the > > >> > >>> > enterBarrier-Barrier. > > >> > >>> > > > >> > >>> > Anything I have forgotten? > > >> > >>> > > > >> > >>> > > > >> > >>> > -- > > >> > >>> > Thomas Jungblut > > >> > >>> > Berlin <[email protected]> > > >> > >>> > > > >> > >>> > > >> > >> > > >> > >> > > >> > >> > > >> > >> -- > > >> > >> Thomas Jungblut > > >> > >> Berlin <[email protected]> > > >> > > > > >> > > > > >> > > > > >> > > -- > > >> > > Best Regards, Edward J. Yoon > > >> > > @eddieyoon > > >> > > > > > > > > > > > > -- > > > Thomas Jungblut > > > Berlin <[email protected]> > > > > > > > > -- > > Best Regards, Edward J. Yoon > > @eddieyoon > > > -- Thomas Jungblut Berlin <[email protected]>
