Thanks for adding the capability to stop processing the callbacks. For the rest
I have no preferences, lets move forward with what's in there and adapt if new
needs appear.
Thanks,
George.
On Jul 15, 2013, at 16:05 , Ralph Castain <[email protected]> wrote:
>
> On Jul 15, 2013, at 6:45 AM, George Bosilca <[email protected]> wrote:
>
>> Ralph,
>>
>> Sorry for the late answer, we have quite a few things on our todo list right
>> now. Here are few concerns I'm having about the proposed approach.
>>
>> 1. We would have preferred to have a list of processes for the
>> ompi_errhandler_runtime_callback function. We don't necessary care about the
>> error code, but having a list will allow us to move the notifications per
>> bulk instead of one by one.
>
> No problem - I can easily make that change
>
>>
>> 2. You made the registration of the callbacks ordered, and added special
>> arguments to append or prepend callbacks to the list. Right now I can't
>> figure out a good reason on how to use it especially that the order might be
>> impose on the order the modules are loaded by the frameworks, thus not
>> something we can easily control.
>>
>> 3. The callback list. The concept is useful, I don't know about the
>> implementation. The current version doesn't support stopping the propagation
>> of the error signal, which might be an issue in some cases. I can picture
>> the fact that one level know about the issue, and know how to fix it, so the
>> error does not need to propagate to other levels. This can be implemented in
>> the old way interrupts were managed in DOS, with basically a simple _get /
>> _set type of interface. If a callback wants to propagate the error it has
>> first to retrieve the ancestor on the moment when it registered the callback
>> and then explicitly calls it upon error.
>>
>
> Yeah, these things bothered me too. I did it for only on reason. The current
> implementation does as you describe in terms of the caller maintaining
> ancestry. However, what if the first thing registered is the "abort"
> callback? Then how do you avoid having "abort" called early in the process,
> not giving other callbacks a chance to attempt to continue?
>
> So I started with two registration calls - one for a default, and the other
> for anything else. Then it occurred to me that someone might want a
> "prologue" handler - e.g., start the error handling by blocking the injection
> of any more messages until we know what the problem is. So I added a
> registration for a prologue.
>
> I now had registrations for a prologue, an epilogue, and a regular callback.
> So I just generalized it, figuring that someone could ignore the ordering and
> just add callbacks if they wanted to, but leaving the ability to specify "go
> first" and "go last".
>
> I don't honestly have anything specific in mind for it, but that was the
> reasoning. I added the ability to stop processing callbacks (a return of
> OMPI_SUCCESS will stop it), so that is there.
>
> Any preferences?
>
>> Again, nothing major in the short term as it will take a significant amount
>> of work to move the only user of such error handling capability (the FT
>> prototype) back over the current version of the ORTE.
>>
>> Regards,
>> George.
>>
>>
>>
>> On Jul 3, 2013, at 06:45 , Ralph Castain <[email protected]> wrote:
>>
>>> **** NOTICE: This RFC modifies the MPI-RTE interface ****
>>>
>>> WHAT: revise the RTE error handling to allow registration of callbacks upon
>>> RTE-detected errors
>>>
>>> WHY: currently, the RTE aborts the process if an RTE-detected error occurs.
>>> This allows the upper layers (e.g., MPI) no chance to implement their own
>>> error response strategy, and it precludes allowing user-defined error
>>> handling.
>>>
>>> TIMEOUT: let's go for July 19th, pending further discussion
>>>
>>> George and I were talking about ORTE's error handling the other day in
>>> regards to the right way to deal with errors in the updated OOB.
>>> Specifically, it seemed a bad idea for a library such as ORTE to be
>>> aborting the job on its own prerogative. If we lose a connection or cannot
>>> send a message, then we really should just report it upwards and let the
>>> application and/or upper layers decide what to do about it.
>>>
>>> The current code base only allows a single error callback to exist, which
>>> seemed unduly limiting. So, based on the conversation, I've modified the
>>> errmgr interface to provide a mechanism for registering any number of error
>>> handlers (this replaces the current "set_fault_callback" API). When an
>>> error occurs, these handlers will be called in order until one responds
>>> that the error has been "resolved" - i.e., no further action is required.
>>> The default MPI layer error handler is specified to go "last" and calls
>>> mpi_abort, so the current "abort" behavior is preserved unless other error
>>> handlers are registered.
>>>
>>> In the register_callback function, I provide an "order" param so you can
>>> specify "this callback must come first" or "this callback must come last".
>>> Seemed to me that we will probably have different code areas registering
>>> callbacks, and one might require it go first (the default "abort" will
>>> always require it go last). So you can append and prepend, or go first/last.
>>>
>>> The errhandler callback function passes the name of the proc involved
>>> (which can be yourself for internal errors) and the error code. This is a
>>> change from the current fault callback which returned an opal_pointer_array
>>> of process names.
>>>
>>> The work is available for review in my bitbucket:
>>>
>>> https://bitbucket.org/rhc/ompi-errmgr
>>>
>>> I've attached the svn diff as well.
>>>
>>> Appreciate your comments - nothing in concrete.
>>> Ralph
>>>
>>> <err.diff>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel