Re: [OMPI devel] RFC: revised ORTE error handling

George Bosilca Mon, 15 Jul 2013 13:21:19 -0400

Thanks for adding the capability to stop processing the callbacks. For the rest 
I have no preferences, lets move forward with what's in there and adapt if new 
needs appear.


  Thanks,
    George.


On Jul 15, 2013, at 16:05 , Ralph Castain <[email protected]> wrote:

> 
> On Jul 15, 2013, at 6:45 AM, George Bosilca <[email protected]> wrote:
> 
>> Ralph,
>> 
>> Sorry for the late answer, we have quite a few things on our todo list right 
>> now. Here are few concerns I'm having about the proposed approach.
>> 
>> 1. We would have preferred to have a list of processes for the 
>> ompi_errhandler_runtime_callback function. We don't necessary care about the 
>> error code, but having a list will allow us to move the notifications per 
>> bulk instead of one by one.
> 
> No problem - I can easily make that change
> 
>> 
>> 2. You made the registration of the callbacks ordered, and added special 
>> arguments to append or prepend callbacks to the list. Right now I can't 
>> figure out a good reason on how to use it especially that the order might be 
>> impose on the order the modules are loaded by the frameworks, thus not 
>> something we can easily control.
>> 
>> 3. The callback list. The concept is useful, I don't know about the 
>> implementation. The current version doesn't support stopping the propagation 
>> of the error signal, which might be an issue in some cases. I can picture 
>> the fact that one level know about the issue, and know how to fix it, so the 
>> error does not need to propagate to other levels. This can be implemented in 
>> the old way interrupts were managed in DOS, with basically a simple _get / 
>> _set type of interface. If a callback wants to propagate the error it has 
>> first to retrieve the ancestor on the moment when it registered the callback 
>> and then explicitly calls it upon error.
>> 
> 
> Yeah, these things bothered me too. I did it for only on reason. The current 
> implementation does as you describe in terms of the caller maintaining 
> ancestry. However, what if the first thing registered is the "abort" 
> callback? Then how do you avoid having "abort" called early in the process, 
> not giving other callbacks a chance to attempt to continue?
> 
> So I started with two registration calls - one for a default, and the other 
> for anything else. Then it occurred to me that someone might want a 
> "prologue" handler - e.g., start the error handling by blocking the injection 
> of any more messages until we know what the problem is. So I added a 
> registration for a prologue.
> 
> I now had registrations for a prologue, an epilogue, and a regular callback. 
> So I just generalized it, figuring that someone could ignore the ordering and 
> just add callbacks if they wanted to, but leaving the ability to specify "go 
> first" and "go last".
> 
> I don't honestly have anything specific in mind for it, but that was the 
> reasoning. I added the ability to stop processing callbacks (a return of 
> OMPI_SUCCESS will stop it), so that is there.
> 
> Any preferences?
> 
>> Again, nothing major in the short term as it will take a significant amount 
>> of work to move the only user of such error handling capability (the FT 
>> prototype) back over the current version of the ORTE.
>> 
>> Regards,
>>   George.
>> 
>> 
>> 
>> On Jul 3, 2013, at 06:45 , Ralph Castain <[email protected]> wrote:
>> 
>>> **** NOTICE: This RFC modifies the MPI-RTE interface ****
>>> 
>>> WHAT: revise the RTE error handling to allow registration of callbacks upon 
>>> RTE-detected errors
>>> 
>>> WHY: currently, the RTE aborts the process if an RTE-detected error occurs. 
>>> This allows the upper layers (e.g., MPI) no chance to implement their own 
>>> error response strategy, and it precludes allowing user-defined error 
>>> handling.
>>> 
>>> TIMEOUT:  let's go for July 19th, pending further discussion
>>> 
>>> George and I were talking about ORTE's error handling the other day in 
>>> regards to the right way to deal with errors in the updated OOB. 
>>> Specifically, it seemed a bad idea for a library such as ORTE to be 
>>> aborting the job on its own prerogative. If we lose a connection or cannot 
>>> send a message, then we really should just report it upwards and let the 
>>> application and/or upper layers decide what to do about it.
>>> 
>>> The current code base only allows a single error callback to exist, which 
>>> seemed unduly limiting. So, based on the conversation, I've modified the 
>>> errmgr interface to provide a mechanism for registering any number of error 
>>> handlers (this replaces the current "set_fault_callback" API). When an 
>>> error occurs, these handlers will be called in order until one responds 
>>> that the error has been "resolved" - i.e., no further action is required. 
>>> The default MPI layer error handler is specified to go "last" and calls 
>>> mpi_abort, so the current "abort" behavior is preserved unless other error 
>>> handlers are registered.
>>> 
>>> In the register_callback function, I provide an "order" param so you can 
>>> specify "this callback must come first" or "this callback must come last". 
>>> Seemed to me that we will probably have different code areas registering 
>>> callbacks, and one might require it go first (the default "abort" will 
>>> always require it go last). So you can append and prepend, or go first/last.
>>> 
>>> The errhandler callback function passes the name of the proc involved 
>>> (which can be yourself for internal errors) and the error code. This is a 
>>> change from the current fault callback which returned an opal_pointer_array 
>>> of process names.
>>> 
>>> The work is available for review in my bitbucket:
>>> 
>>> https://bitbucket.org/rhc/ompi-errmgr
>>> 
>>> I've attached the svn diff as well.
>>> 
>>> Appreciate your comments - nothing in concrete.
>>> Ralph
>>> 
>>> <err.diff>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: revised ORTE error handling

Reply via email to