No issue - just trying to get ahead of the game instead of running into an 
issue later.

We can leave it for now.

On Jun 10, 2011, at 2:47 PM, Josh Hursey wrote:

> We could, but we could also just replace the callback. I will never
> what to use it in my scenario, and if I did then I could just call it
> directly instead of relying on the errmgr to do the right thing. So
> why complicate the errmgr with additional complexity for something
> that we don't need at the moment?
> 
> On Fri, Jun 10, 2011 at 4:40 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> So why not have the callback return an int, and your callback returns "go no 
>> further"?
>> 
>> 
>> On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote:
>> 
>>> Yeah I do not want the default fatal callback in OMPI. I want to
>>> replace it with something that allows OMPI to continue running when
>>> there are process failures (if the error handlers associated with the
>>> communicators permit such an action). So having the default fatal
>>> callback called after mine would not be useful, since I do not want
>>> the fatal action.
>>> 
>>> As long as I can replace that callback, or selectively get rid of it
>>> then I'm ok.
>>> 
>>> 
>>> On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>> On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
>>>> 
>>>>> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>> 
>>>>>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>>>>>> 
>>>>>>>> Well, you're way to trusty. ;)
>>>>>>> 
>>>>>>> It's the midwestern boy in me :)
>>>>>> 
>>>>>> Still need to shake that corn out of your head... :-)
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> This only works if all component play the game, and even then there it 
>>>>>>>> is difficult if you want to allow components to deregister themselves 
>>>>>>>> in the middle of the execution. The problem is that a callback will be 
>>>>>>>> previous for some component, and that when you want to remove a 
>>>>>>>> callback you have to inform the "next"  component on the callback 
>>>>>>>> chain to change its previous.
>>>>>>> 
>>>>>>> This is a fair point. I think hiding the ordering of callbacks in the 
>>>>>>> errmgr could be dangerous since it takes control from the upper layers, 
>>>>>>> but, conversely, trusting the upper layers to 'do the right thing' with 
>>>>>>> the previous callback is probably too optimistic, esp. for layers that 
>>>>>>> are not designed together.
>>>>>>> 
>>>>>>> To that I would suggest that you leave the code as is - registering a 
>>>>>>> callback overwrites the existing callback. That will allow me to 
>>>>>>> replace the default OMPI callback when I am able to in MPI_Init, and, 
>>>>>>> if I need to, swap back in the default version at MPI_Finalize.
>>>>>>> 
>>>>>>> Does that sound like a reasonable way forward on this design point?
>>>>>> 
>>>>>> It doesn't solve the problem that George alluded to - just because you 
>>>>>> overwrite the callback, it doesn't mean that someone else won't 
>>>>>> overwrite you when their component initializes. Only the last one wins - 
>>>>>> the rest of you lose.
>>>>>> 
>>>>>> I'm not sure how you guarantee that you win, which is why I'm unclear 
>>>>>> how this callback can really work unless everyone agrees that only one 
>>>>>> place gets it. Put that callback in a base function of a new error 
>>>>>> handling framework, and then let everyone create components within that 
>>>>>> for handling desired error responses?
>>>>> 
>>>>> Yep, that is a problem, but one that we can deal with in the immediate
>>>>> case. Since OMPI is the only layer registering the callback, when I
>>>>> replace it in OMPI I will have to make sure that no other place in
>>>>> OMPI replaces the callback.
>>>>> 
>>>>> If at some point we need more than one callback above ORTE then we may
>>>>> want to revisit this point. But since we only have one layer on top of
>>>>> ORTE, it is the responsibility of that layer to be internally
>>>>> consistent with regard to which callback it wants to be triggered.
>>>>> 
>>>>> If the layers above ORTE want more than one callback I would suggest
>>>>> that that layer design some mechanism for coordinating these multiple
>>>>> - possibly conflicting - callbacks (by the way this is policy
>>>>> management, which can get complex fast as you add more interested
>>>>> parties). Meaning that if OMPI wanted multiple callbacks to be active
>>>>> at the same time, then OMPI would create a mechanism for managing
>>>>> these callbacks, not ORTE. ORTE should just have one callback provided
>>>>> to the upper layer, and keep it -simple-. If the upper layer wants to
>>>>> toy around with something more complex it must manage the complexity
>>>>> instead of artificially pushing it down to the ORTE layer.
>>>> 
>>>> I was thinking some more about this, and wonder if we aren't 
>>>> over-complicating the question.
>>>> 
>>>> Do you need to actually control the sequence of callbacks, or just ensure 
>>>> that your callback gets called prior to the default one that calls abort?
>>>> 
>>>> Meeting the latter requirement is trivial - subsequent calls to 
>>>> register_callback get pushed onto the top of the callback list. Since the 
>>>> default one always gets registered first (which we can ensure since it 
>>>> occurs in MPI_Init), it will always be at the bottom of the callback list 
>>>> and hence called last.
>>>> 
>>>> Keeping that list in ORTE is simple and probably the right place to do it.
>>>> 
>>>> However, if you truly want to control the callback order in detail - then 
>>>> yeah, that should go up in  OMPI. I sure don't want to write all that code 
>>>> :-)
>>>> 
>>>> 
>>>>> 
>>>>> -- Josh
>>>>> 
>>>>>>> 
>>>>>>> -- Josh
>>>>>>> 
>>>>>>>> 
>>>>>>>> george.
>>>>>>>> 
>>>>>>>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>>>>>>>> 
>>>>>>>>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
>>>>>>>>> -------------
>>>>>>>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
>>>>>>>>> -------------
>>>>>>>>> 
>>>>>>>>> Which is a callback that just calls abort (which is what we want to do
>>>>>>>>> by default):
>>>>>>>>> -------------
>>>>>>>>> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>>>>>>>>>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
>>>>>>>>> }
>>>>>>>>> -------------
>>>>>>>>> 
>>>>>>>>> This is what I want to replace. I do -not- want ompi to abort just
>>>>>>>>> because a process failed. So I need a way to replace or remove this
>>>>>>>>> callback, and put in my own callback that 'does the right thing'.
>>>>>>>>> 
>>>>>>>>> The current patch allows me to overwrite the callback when I call:
>>>>>>>>> -------------
>>>>>>>>> orte_errmgr.set_fault_callback(&my_callback);
>>>>>>>>> -------------
>>>>>>>>> Which is fine with me.
>>>>>>>>> 
>>>>>>>>> At the point I do not want my_callback to be active any more (say in
>>>>>>>>> MPI_Finalize) I would like to replace it with the old callback. To do
>>>>>>>>> so, with the patch's interface, I would have to know what the previous
>>>>>>>>> callback was and do:
>>>>>>>>> -------------
>>>>>>>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
>>>>>>>>> -------------
>>>>>>>>> 
>>>>>>>>> This comes at a slight maintenance burden since now there will be two
>>>>>>>>> places in the code that must explicitly reference
>>>>>>>>> 'ompi_errhandler_runtime_callback' - if it ever changed then both
>>>>>>>>> sites would have to be updated.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> If you use the 'sigaction-like' interface then upon registration I
>>>>>>>>> would get the previous handler back (which would point to
>>>>>>>>> 'ompi_errhandler_runtime_callback), and I can store it for later:
>>>>>>>>> -------------
>>>>>>>>> orte_errmgr.set_fault_callback(&my_callback, prev_callback);
>>>>>>>>> -------------
>>>>>>>>> 
>>>>>>>>> And when it comes time to deregister my callback all I need to do is
>>>>>>>>> replace it with the previous callback - which I have a reference to,
>>>>>>>>> but do not need the explicit name of (passing NULL as the second
>>>>>>>>> argument tells the registration function that I don't care about the
>>>>>>>>> current callback):
>>>>>>>>> -------------
>>>>>>>>> orte_errmgr.set_fault_callback(&prev_callback, NULL);
>>>>>>>>> -------------
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> So the API in the patch is fine, and I can work with it. I just
>>>>>>>>> suggested that it might be slightly better to return the previous
>>>>>>>>> callback (as is done in other standard interfaces - e.g., sigaction)
>>>>>>>>> in case we wanted to do something with it later.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> What seems to be proposed now is making the errmgr keep a list of all
>>>>>>>>> registered callbacks and call them in some order. This seems odd, and
>>>>>>>>> definitely more complex. Maybe it was just not well explained.
>>>>>>>>> 
>>>>>>>>> Maybe that is just the "computer scientist" in me :)
>>>>>>>>> 
>>>>>>>>> -- Josh
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 9, 2011 at 1:05 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>> wrote:
>>>>>>>>>> You mean you want the abort API to point somewhere else, without 
>>>>>>>>>> using a new
>>>>>>>>>> component?
>>>>>>>>>> Perhaps a telecon would help resolve this quicker? I'm available 
>>>>>>>>>> tomorrow or
>>>>>>>>>> anytime next week, if that helps.
>>>>>>>>>> 
>>>>>>>>>> On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey <jjhur...@open-mpi.org> 
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> As long as there is the ability to remove and replace a callback I'm
>>>>>>>>>>> fine. I personally think that forcing the errmgr to track ordering 
>>>>>>>>>>> of
>>>>>>>>>>> callback registration makes it a more complex solution, but as long 
>>>>>>>>>>> as
>>>>>>>>>>> it works.
>>>>>>>>>>> 
>>>>>>>>>>> In particular I need to replace the default 'abort' errmgr call in
>>>>>>>>>>> OMPI with something else. If both are called, then this does not 
>>>>>>>>>>> help
>>>>>>>>>>> me at all - since the abort behavior will be activated either before
>>>>>>>>>>> or after my callback. So can you explain how I would do that with 
>>>>>>>>>>> the
>>>>>>>>>>> current or the proposed interface?
>>>>>>>>>>> 
>>>>>>>>>>> -- Josh
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Jun 9, 2011 at 12:54 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> I agree - let's not get overly complex unless we can clearly 
>>>>>>>>>>>> articulate
>>>>>>>>>>>> a
>>>>>>>>>>>> requirement to do so.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca 
>>>>>>>>>>>> <bosi...@eecs.utk.edu>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This will require exactly opposite registration and 
>>>>>>>>>>>>> de-registration
>>>>>>>>>>>>> order,
>>>>>>>>>>>>> or no de-registration at all (aka no way to unload a component). 
>>>>>>>>>>>>> Or
>>>>>>>>>>>>> some
>>>>>>>>>>>>> even more complex code to deal with internally.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If the error manager handle the callbacks it can use the 
>>>>>>>>>>>>> registration
>>>>>>>>>>>>> ordering (which will be what the the approach can do), and can 
>>>>>>>>>>>>> enforce
>>>>>>>>>>>>> that
>>>>>>>>>>>>> all callbacks will be called. I would rather prefer this approach.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> george.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jun 9, 2011, at 08:36 , Josh Hursey wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I would prefer returning the previous callback instead of 
>>>>>>>>>>>>>> relying on
>>>>>>>>>>>>>> the errmgr to get the ordering right. Additionally, when I want 
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> unregister (or replace) a call back it is easy to do that with a
>>>>>>>>>>>>>> single interface, than introducing a new one to remove a 
>>>>>>>>>>>>>> particular
>>>>>>>>>>>>>> callback.
>>>>>>>>>>>>>> Register:
>>>>>>>>>>>>>> ompi_errmgr.set_fault_callback(my_callback, prev_callback);
>>>>>>>>>>>>>> Deregister:
>>>>>>>>>>>>>> ompi_errmgr.set_fault_callback(prev_callback, old_callback);
>>>>>>>>>>>>>> or to eliminate all callbacks (if you needed that for somme 
>>>>>>>>>>>>>> reason):
>>>>>>>>>>>>>> ompi_errmgr.set_fault_callback(NULL, old_callback);
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Joshua Hursey
>>>>>>>>>>> Postdoctoral Research Associate
>>>>>>>>>>> Oak Ridge National Laboratory
>>>>>>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Joshua Hursey
>>>>>>>>> Postdoctoral Research Associate
>>>>>>>>> Oak Ridge National Laboratory
>>>>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Joshua Hursey
>>>>> Postdoctoral Research Associate
>>>>> Oak Ridge National Laboratory
>>>>> http://users.nccs.gov/~jjhursey
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to