No issue - just trying to get ahead of the game instead of running into an issue later.
We can leave it for now. On Jun 10, 2011, at 2:47 PM, Josh Hursey wrote: > We could, but we could also just replace the callback. I will never > what to use it in my scenario, and if I did then I could just call it > directly instead of relying on the errmgr to do the right thing. So > why complicate the errmgr with additional complexity for something > that we don't need at the moment? > > On Fri, Jun 10, 2011 at 4:40 PM, Ralph Castain <r...@open-mpi.org> wrote: >> So why not have the callback return an int, and your callback returns "go no >> further"? >> >> >> On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote: >> >>> Yeah I do not want the default fatal callback in OMPI. I want to >>> replace it with something that allows OMPI to continue running when >>> there are process failures (if the error handlers associated with the >>> communicators permit such an action). So having the default fatal >>> callback called after mine would not be useful, since I do not want >>> the fatal action. >>> >>> As long as I can replace that callback, or selectively get rid of it >>> then I'm ok. >>> >>> >>> On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>> On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: >>>> >>>>> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>> >>>>>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >>>>>> >>>>>>> >>>>>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >>>>>>> >>>>>>>> Well, you're way to trusty. ;) >>>>>>> >>>>>>> It's the midwestern boy in me :) >>>>>> >>>>>> Still need to shake that corn out of your head... :-) >>>>>> >>>>>>> >>>>>>>> >>>>>>>> This only works if all component play the game, and even then there it >>>>>>>> is difficult if you want to allow components to deregister themselves >>>>>>>> in the middle of the execution. The problem is that a callback will be >>>>>>>> previous for some component, and that when you want to remove a >>>>>>>> callback you have to inform the "next" component on the callback >>>>>>>> chain to change its previous. >>>>>>> >>>>>>> This is a fair point. I think hiding the ordering of callbacks in the >>>>>>> errmgr could be dangerous since it takes control from the upper layers, >>>>>>> but, conversely, trusting the upper layers to 'do the right thing' with >>>>>>> the previous callback is probably too optimistic, esp. for layers that >>>>>>> are not designed together. >>>>>>> >>>>>>> To that I would suggest that you leave the code as is - registering a >>>>>>> callback overwrites the existing callback. That will allow me to >>>>>>> replace the default OMPI callback when I am able to in MPI_Init, and, >>>>>>> if I need to, swap back in the default version at MPI_Finalize. >>>>>>> >>>>>>> Does that sound like a reasonable way forward on this design point? >>>>>> >>>>>> It doesn't solve the problem that George alluded to - just because you >>>>>> overwrite the callback, it doesn't mean that someone else won't >>>>>> overwrite you when their component initializes. Only the last one wins - >>>>>> the rest of you lose. >>>>>> >>>>>> I'm not sure how you guarantee that you win, which is why I'm unclear >>>>>> how this callback can really work unless everyone agrees that only one >>>>>> place gets it. Put that callback in a base function of a new error >>>>>> handling framework, and then let everyone create components within that >>>>>> for handling desired error responses? >>>>> >>>>> Yep, that is a problem, but one that we can deal with in the immediate >>>>> case. Since OMPI is the only layer registering the callback, when I >>>>> replace it in OMPI I will have to make sure that no other place in >>>>> OMPI replaces the callback. >>>>> >>>>> If at some point we need more than one callback above ORTE then we may >>>>> want to revisit this point. But since we only have one layer on top of >>>>> ORTE, it is the responsibility of that layer to be internally >>>>> consistent with regard to which callback it wants to be triggered. >>>>> >>>>> If the layers above ORTE want more than one callback I would suggest >>>>> that that layer design some mechanism for coordinating these multiple >>>>> - possibly conflicting - callbacks (by the way this is policy >>>>> management, which can get complex fast as you add more interested >>>>> parties). Meaning that if OMPI wanted multiple callbacks to be active >>>>> at the same time, then OMPI would create a mechanism for managing >>>>> these callbacks, not ORTE. ORTE should just have one callback provided >>>>> to the upper layer, and keep it -simple-. If the upper layer wants to >>>>> toy around with something more complex it must manage the complexity >>>>> instead of artificially pushing it down to the ORTE layer. >>>> >>>> I was thinking some more about this, and wonder if we aren't >>>> over-complicating the question. >>>> >>>> Do you need to actually control the sequence of callbacks, or just ensure >>>> that your callback gets called prior to the default one that calls abort? >>>> >>>> Meeting the latter requirement is trivial - subsequent calls to >>>> register_callback get pushed onto the top of the callback list. Since the >>>> default one always gets registered first (which we can ensure since it >>>> occurs in MPI_Init), it will always be at the bottom of the callback list >>>> and hence called last. >>>> >>>> Keeping that list in ORTE is simple and probably the right place to do it. >>>> >>>> However, if you truly want to control the callback order in detail - then >>>> yeah, that should go up in OMPI. I sure don't want to write all that code >>>> :-) >>>> >>>> >>>>> >>>>> -- Josh >>>>> >>>>>>> >>>>>>> -- Josh >>>>>>> >>>>>>>> >>>>>>>> george. >>>>>>>> >>>>>>>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote: >>>>>>>> >>>>>>>>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: >>>>>>>>> ------------- >>>>>>>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); >>>>>>>>> ------------- >>>>>>>>> >>>>>>>>> Which is a callback that just calls abort (which is what we want to do >>>>>>>>> by default): >>>>>>>>> ------------- >>>>>>>>> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) { >>>>>>>>> ompi_mpi_abort(MPI_COMM_WORLD, 1, false); >>>>>>>>> } >>>>>>>>> ------------- >>>>>>>>> >>>>>>>>> This is what I want to replace. I do -not- want ompi to abort just >>>>>>>>> because a process failed. So I need a way to replace or remove this >>>>>>>>> callback, and put in my own callback that 'does the right thing'. >>>>>>>>> >>>>>>>>> The current patch allows me to overwrite the callback when I call: >>>>>>>>> ------------- >>>>>>>>> orte_errmgr.set_fault_callback(&my_callback); >>>>>>>>> ------------- >>>>>>>>> Which is fine with me. >>>>>>>>> >>>>>>>>> At the point I do not want my_callback to be active any more (say in >>>>>>>>> MPI_Finalize) I would like to replace it with the old callback. To do >>>>>>>>> so, with the patch's interface, I would have to know what the previous >>>>>>>>> callback was and do: >>>>>>>>> ------------- >>>>>>>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); >>>>>>>>> ------------- >>>>>>>>> >>>>>>>>> This comes at a slight maintenance burden since now there will be two >>>>>>>>> places in the code that must explicitly reference >>>>>>>>> 'ompi_errhandler_runtime_callback' - if it ever changed then both >>>>>>>>> sites would have to be updated. >>>>>>>>> >>>>>>>>> >>>>>>>>> If you use the 'sigaction-like' interface then upon registration I >>>>>>>>> would get the previous handler back (which would point to >>>>>>>>> 'ompi_errhandler_runtime_callback), and I can store it for later: >>>>>>>>> ------------- >>>>>>>>> orte_errmgr.set_fault_callback(&my_callback, prev_callback); >>>>>>>>> ------------- >>>>>>>>> >>>>>>>>> And when it comes time to deregister my callback all I need to do is >>>>>>>>> replace it with the previous callback - which I have a reference to, >>>>>>>>> but do not need the explicit name of (passing NULL as the second >>>>>>>>> argument tells the registration function that I don't care about the >>>>>>>>> current callback): >>>>>>>>> ------------- >>>>>>>>> orte_errmgr.set_fault_callback(&prev_callback, NULL); >>>>>>>>> ------------- >>>>>>>>> >>>>>>>>> >>>>>>>>> So the API in the patch is fine, and I can work with it. I just >>>>>>>>> suggested that it might be slightly better to return the previous >>>>>>>>> callback (as is done in other standard interfaces - e.g., sigaction) >>>>>>>>> in case we wanted to do something with it later. >>>>>>>>> >>>>>>>>> >>>>>>>>> What seems to be proposed now is making the errmgr keep a list of all >>>>>>>>> registered callbacks and call them in some order. This seems odd, and >>>>>>>>> definitely more complex. Maybe it was just not well explained. >>>>>>>>> >>>>>>>>> Maybe that is just the "computer scientist" in me :) >>>>>>>>> >>>>>>>>> -- Josh >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jun 9, 2011 at 1:05 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>> wrote: >>>>>>>>>> You mean you want the abort API to point somewhere else, without >>>>>>>>>> using a new >>>>>>>>>> component? >>>>>>>>>> Perhaps a telecon would help resolve this quicker? I'm available >>>>>>>>>> tomorrow or >>>>>>>>>> anytime next week, if that helps. >>>>>>>>>> >>>>>>>>>> On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey <jjhur...@open-mpi.org> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> As long as there is the ability to remove and replace a callback I'm >>>>>>>>>>> fine. I personally think that forcing the errmgr to track ordering >>>>>>>>>>> of >>>>>>>>>>> callback registration makes it a more complex solution, but as long >>>>>>>>>>> as >>>>>>>>>>> it works. >>>>>>>>>>> >>>>>>>>>>> In particular I need to replace the default 'abort' errmgr call in >>>>>>>>>>> OMPI with something else. If both are called, then this does not >>>>>>>>>>> help >>>>>>>>>>> me at all - since the abort behavior will be activated either before >>>>>>>>>>> or after my callback. So can you explain how I would do that with >>>>>>>>>>> the >>>>>>>>>>> current or the proposed interface? >>>>>>>>>>> >>>>>>>>>>> -- Josh >>>>>>>>>>> >>>>>>>>>>> On Thu, Jun 9, 2011 at 12:54 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>>> wrote: >>>>>>>>>>>> I agree - let's not get overly complex unless we can clearly >>>>>>>>>>>> articulate >>>>>>>>>>>> a >>>>>>>>>>>> requirement to do so. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca >>>>>>>>>>>> <bosi...@eecs.utk.edu> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> This will require exactly opposite registration and >>>>>>>>>>>>> de-registration >>>>>>>>>>>>> order, >>>>>>>>>>>>> or no de-registration at all (aka no way to unload a component). >>>>>>>>>>>>> Or >>>>>>>>>>>>> some >>>>>>>>>>>>> even more complex code to deal with internally. >>>>>>>>>>>>> >>>>>>>>>>>>> If the error manager handle the callbacks it can use the >>>>>>>>>>>>> registration >>>>>>>>>>>>> ordering (which will be what the the approach can do), and can >>>>>>>>>>>>> enforce >>>>>>>>>>>>> that >>>>>>>>>>>>> all callbacks will be called. I would rather prefer this approach. >>>>>>>>>>>>> >>>>>>>>>>>>> george. >>>>>>>>>>>>> >>>>>>>>>>>>> On Jun 9, 2011, at 08:36 , Josh Hursey wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I would prefer returning the previous callback instead of >>>>>>>>>>>>>> relying on >>>>>>>>>>>>>> the errmgr to get the ordering right. Additionally, when I want >>>>>>>>>>>>>> to >>>>>>>>>>>>>> unregister (or replace) a call back it is easy to do that with a >>>>>>>>>>>>>> single interface, than introducing a new one to remove a >>>>>>>>>>>>>> particular >>>>>>>>>>>>>> callback. >>>>>>>>>>>>>> Register: >>>>>>>>>>>>>> ompi_errmgr.set_fault_callback(my_callback, prev_callback); >>>>>>>>>>>>>> Deregister: >>>>>>>>>>>>>> ompi_errmgr.set_fault_callback(prev_callback, old_callback); >>>>>>>>>>>>>> or to eliminate all callbacks (if you needed that for somme >>>>>>>>>>>>>> reason): >>>>>>>>>>>>>> ompi_errmgr.set_fault_callback(NULL, old_callback); >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Joshua Hursey >>>>>>>>>>> Postdoctoral Research Associate >>>>>>>>>>> Oak Ridge National Laboratory >>>>>>>>>>> http://users.nccs.gov/~jjhursey >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Joshua Hursey >>>>>>>>> Postdoctoral Research Associate >>>>>>>>> Oak Ridge National Laboratory >>>>>>>>> http://users.nccs.gov/~jjhursey >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Joshua Hursey >>>>> Postdoctoral Research Associate >>>>> Oak Ridge National Laboratory >>>>> http://users.nccs.gov/~jjhursey >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>> >>> >>> >>> -- >>> Joshua Hursey >>> Postdoctoral Research Associate >>> Oak Ridge National Laboratory >>> http://users.nccs.gov/~jjhursey >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > > > -- > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel