Ah - thanks! That really helped clarify things. Much appreciated. Will look at the patch in this light...
On Tue, Jun 7, 2011 at 1:00 PM, Wesley Bland <wbl...@eecs.utk.edu> wrote: > > Perhaps it would help if you folks could provide a little explanation about > how you use epoch? While the value sounds similar, your explanations are > beginning to sound very different from what we are doing and/or had > envisioned. > > I'm not sure how you can talk about an epoch being too high or too low, > unless you are envisioning an overall system where procs try to maintain > some global notion of the value - which sounds like a race condition begging > to cause problems. > > > When we say epoch we mean a value that is stored locally. When a failure is > detected the detector notifies the HNP who notifies everyone else. Thus > everyone will _eventually_ receive the notification that the process has > failed. It may take a while for you to receive the notification, but in the > meantime you will behave normally. When you do receive the notification that > the failure occurred, you update your local copy of the epoch. > > This is similar to the definition of the "perfect" failure detector that > Josh references. It doesn't matter if you don't find about the failure > immediately, as long as you find out about it eventually. If you aren't > actually in the same jobid as the failed process you might never find out > about the failure because it does not apply to you. > > Are you then thinking that MPI processes are going to detect failure > instead of local orteds?? Right now, no MPI process would ever report > failure of a peer - the orted detects failure using the sigchild and reports > it. What mechanism would the MPI procs use, and how would that be more > reliable than sigchild?? > > Definitely not. ORTEDs are the processes that detect and report the > failures. They can detect the failure of other ORTEDs or of applications. > Basically anything to which they have a connection. > > > So right now the HNP can -never- receive more than one failure report at a > time for a process. The only issue we've been working is that there are > several pathways for reporting that error - e.g., if the orted detects the > process fails and reports it, and then the orted itself fails, we can get > multiple failure events back at the HNP before we respond to the first one. > > Not the same issue as having MPI procs reporting failures... > > This is where the epoch becomes necessary. When reporting a failure, you > tell the HNP which process failed by name, including the epoch. Thus the HNP > will not make a process as having failed twice (thus incrementing the epoch > twice and notifying everyone about the failure twice). The HNP might receive > multiple notifications because more than one ORTED could (and often will) > detect the failure. It is easier to have the HNP decide what is a failure > and what is a duplicate rather than have the ORTEDs reach some consensus > about the fact that a process has failed. Much less overhead this way. > > > I'm not sure what ORCM does in the respect, but I don't know of anything in > ORTE that would track this data other than the process state and that > doesn't keep track of anything beyond one failure (which admittedly isn't an > issue until we implement process recovery). > > > We aren't having any problems with process recovery and process state - > without tracking epochs. We only track "incarnations" so that we can pass it > down to the apps, which use that info to guide their restart. > > Could you clarify why you are having a problem in this regard? Might help > to better understand your proposed changes. > > I think we're talking about the same thing here. The only difference is > that I'm not looking at the ORCM code so I don't have the "incarnations". > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >