Re: [OMPI devel] RFC: Resilient ORTE
On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: > > On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: > >> Well, you're way to trusty. ;) > > It's the midwestern boy in me :) Still need to shake that corn out of your head... :-) > >> >> This only works if all component play the game, and even then there it is >> difficult if you want to allow components to deregister themselves in the >> middle of the execution. The problem is that a callback will be previous for >> some component, and that when you want to remove a callback you have to >> inform the "next" component on the callback chain to change its previous. > > This is a fair point. I think hiding the ordering of callbacks in the errmgr > could be dangerous since it takes control from the upper layers, but, > conversely, trusting the upper layers to 'do the right thing' with the > previous callback is probably too optimistic, esp. for layers that are not > designed together. > > To that I would suggest that you leave the code as is - registering a > callback overwrites the existing callback. That will allow me to replace the > default OMPI callback when I am able to in MPI_Init, and, if I need to, swap > back in the default version at MPI_Finalize. > > Does that sound like a reasonable way forward on this design point? It doesn't solve the problem that George alluded to - just because you overwrite the callback, it doesn't mean that someone else won't overwrite you when their component initializes. Only the last one wins - the rest of you lose. I'm not sure how you guarantee that you win, which is why I'm unclear how this callback can really work unless everyone agrees that only one place gets it. Put that callback in a base function of a new error handling framework, and then let everyone create components within that for handling desired error responses? > > -- Josh > >> >> george. >> >> On Jun 9, 2011, at 13:21 , Josh Hursey wrote: >> >>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: >>> - >>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); >>> - >>> >>> Which is a callback that just calls abort (which is what we want to do >>> by default): >>> - >>> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) { >>> ompi_mpi_abort(MPI_COMM_WORLD, 1, false); >>> } >>> - >>> >>> This is what I want to replace. I do -not- want ompi to abort just >>> because a process failed. So I need a way to replace or remove this >>> callback, and put in my own callback that 'does the right thing'. >>> >>> The current patch allows me to overwrite the callback when I call: >>> - >>> orte_errmgr.set_fault_callback(&my_callback); >>> - >>> Which is fine with me. >>> >>> At the point I do not want my_callback to be active any more (say in >>> MPI_Finalize) I would like to replace it with the old callback. To do >>> so, with the patch's interface, I would have to know what the previous >>> callback was and do: >>> - >>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); >>> - >>> >>> This comes at a slight maintenance burden since now there will be two >>> places in the code that must explicitly reference >>> 'ompi_errhandler_runtime_callback' - if it ever changed then both >>> sites would have to be updated. >>> >>> >>> If you use the 'sigaction-like' interface then upon registration I >>> would get the previous handler back (which would point to >>> 'ompi_errhandler_runtime_callback), and I can store it for later: >>> - >>> orte_errmgr.set_fault_callback(&my_callback, prev_callback); >>> - >>> >>> And when it comes time to deregister my callback all I need to do is >>> replace it with the previous callback - which I have a reference to, >>> but do not need the explicit name of (passing NULL as the second >>> argument tells the registration function that I don't care about the >>> current callback): >>> - >>> orte_errmgr.set_fault_callback(&prev_callback, NULL); >>> - >>> >>> >>> So the API in the patch is fine, and I can work with it. I just >>> suggested that it might be slightly better to return the previous >>> callback (as is done in other standard interfaces - e.g., sigaction) >>> in case we wanted to do something with it later. >>> >>> >>> What seems to be proposed now is making the errmgr keep a list of all >>> registered callbacks and call them in some order. This seems odd, and >>> definitely more complex. Maybe it was just not well explained. >>> >>> Maybe that is just the "computer scientist" in me :) >>> >>> -- Josh >>> >>> >>> On Thu, Jun 9, 2011 at 1:05 PM, Ralph Castain wrote: You mean you want the abort API to point somewhere else, without using a new component? Perhaps a telecon would help resolve this quicker? I'm available tomorrow or anytime next week, if th
Re: [OMPI devel] RFC: Resilient ORTE
Something else you might want to address in here: the current code sends an RML message from the proc calling abort to its local daemon telling the daemon that we are exiting due to the app calling "abort". We needed to do this because we wanted to flag the proc termination as one induced by the app itself as opposed to something like a segfault or termination by signal. However, the problem is that the app may be calling abort from within an event handler. Hence, the RML send (which is currently blocking) will never complete once we no longer allow event lib recursion (coming soon). If we use a non-blocking send, then we can't know for sure that the message has been sent before we terminate. What we need is a non-messaging way of communicating that this was an ordered abort as opposed to a segfault or other failure. Prior to the current method, we had the app drop a file that the daemon looked for as an "abort marker", but that was ugly as it sometimes caused us to not properly cleanup the session directory tree. I'm open to suggestion - perhaps it isn't actually all that critical for us to distinguish "aborted by call to abort" from "aborted by signal", and we can just have the app commit suicide via self-imposed SIGKILL? It is only the message output to the user at the end of the job that differs - and since MPI_Abort already provides a message indicating "we called abort", is it really necessary that we have orte aware of that distinction? On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: > > On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: > >> Well, you're way to trusty. ;) > > It's the midwestern boy in me :) > >> >> This only works if all component play the game, and even then there it is >> difficult if you want to allow components to deregister themselves in the >> middle of the execution. The problem is that a callback will be previous for >> some component, and that when you want to remove a callback you have to >> inform the "next" component on the callback chain to change its previous. > > This is a fair point. I think hiding the ordering of callbacks in the errmgr > could be dangerous since it takes control from the upper layers, but, > conversely, trusting the upper layers to 'do the right thing' with the > previous callback is probably too optimistic, esp. for layers that are not > designed together. > > To that I would suggest that you leave the code as is - registering a > callback overwrites the existing callback. That will allow me to replace the > default OMPI callback when I am able to in MPI_Init, and, if I need to, swap > back in the default version at MPI_Finalize. > > Does that sound like a reasonable way forward on this design point? > > -- Josh > >> >> george. >> >> On Jun 9, 2011, at 13:21 , Josh Hursey wrote: >> >>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: >>> - >>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); >>> - >>> >>> Which is a callback that just calls abort (which is what we want to do >>> by default): >>> - >>> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) { >>> ompi_mpi_abort(MPI_COMM_WORLD, 1, false); >>> } >>> - >>> >>> This is what I want to replace. I do -not- want ompi to abort just >>> because a process failed. So I need a way to replace or remove this >>> callback, and put in my own callback that 'does the right thing'. >>> >>> The current patch allows me to overwrite the callback when I call: >>> - >>> orte_errmgr.set_fault_callback(&my_callback); >>> - >>> Which is fine with me. >>> >>> At the point I do not want my_callback to be active any more (say in >>> MPI_Finalize) I would like to replace it with the old callback. To do >>> so, with the patch's interface, I would have to know what the previous >>> callback was and do: >>> - >>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); >>> - >>> >>> This comes at a slight maintenance burden since now there will be two >>> places in the code that must explicitly reference >>> 'ompi_errhandler_runtime_callback' - if it ever changed then both >>> sites would have to be updated. >>> >>> >>> If you use the 'sigaction-like' interface then upon registration I >>> would get the previous handler back (which would point to >>> 'ompi_errhandler_runtime_callback), and I can store it for later: >>> - >>> orte_errmgr.set_fault_callback(&my_callback, prev_callback); >>> - >>> >>> And when it comes time to deregister my callback all I need to do is >>> replace it with the previous callback - which I have a reference to, >>> but do not need the explicit name of (passing NULL as the second >>> argument tells the registration function that I don't care about the >>> current callback): >>> - >>> orte_errmgr.set_fault_callback(&prev_callback, NULL); >>>
Re: [OMPI devel] VT support for 1.5
It's a Libtool issue (once again) which occurs if a previous build is re- configured without subsequent "make clean" and the LIBC developer library "libutil" is added to LIBS. The error is simple to reproduce by the following steps: 1. configure 2. make -C ompi/contrib/vt/vt/util 3. configure or 3. touch ompi/contrib/vt/vt/util/installdirs_conf.h # created by configure 4. make -C ompi/contrib/vt/vt/util ar: /home/jurenz/devel/ompi/v1.5/BUILD_gnu/ompi/contrib/vt/vt/util/.libs/libutil.a: No such file or directory make: *** [libutil.la] Error 9 When re-building the VT's libutil Libtool detects the system's libutil as dependency and tries to find a corresponding Libtool library (*.la). And here is the problem: Libtool finds ompi/contrib/vt/vt/util/libutil.la which is still present from the previous build and has nothing to do with the system's libutil. Afterwards, Libtool fails on extracting the archive ompi/contrib/vt/vt/util/.libs/libutil.a which isn't present for any reason. There are different ways to fix the problem: 1. Apply the attached patch on ltmain.sh. This patch excludes the target library name from searching *.la libraries. 2. Rename the VT's libutil This would prevents name conflicts with dependency libraries. 3. Clear list of dependency libraries when building VT's libutil. This could be done by adding LIBS= to the Makefile.am in ompi/contrib/vt/vt/util/. The VT's libutil has no dependencies to other libraries except libc. 4. Perform "make clean" or remove ompi/contrib/vt/vt/util/libutil.la after re- configure. Nonsense - it cannot be required from the user. My favorite suggestion is 1. It would be just another patch in addition to the set of Libtool patches invoked by autogen. What do you think? Matthias On Tuesday 07 June 2011 16:56:39 Jeff Squyres wrote: > You might want to try a new checkout, just in case there's something in > there that is svn:ignored...? > > (yes, I'm grasping at straws here, but I'm able to build ok with a clean > checkout...?) > > On Jun 7, 2011, at 10:38 AM, George Bosilca wrote: > > My 'svn status' indicates no differences. I always build using a VPATH, > > and in this case I did remove the build directory. However, the issue > > persisted. > > > > george. > > > > On Jun 7, 2011, at 10:31 , Jeff Squyres wrote: > >> I've seen VT builds get confused sometimes. I'm not sure of the exact > >> cause, but if I get a new checkout, all the problems seem to go away. > >> I've never had the time to track it down. > >> > >> Can you get a clean / new checkout and see if that fixes the problem? > >> > >> On Jun 7, 2011, at 10:27 AM, George Bosilca wrote: > >>> I can't compile the 1.5 is I do not disable VT. Using the following > >>> configure line: > >>> > >>> ../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug > >>> --enable-mpirun-prefix-by-default --with-knem=/usr/local/knem > >>> --with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug > >>> > >>> I get: > >>> > >>> ar: > >>> /home/bosilca/unstable/1.5/debug/ompi/contrib/vt/vt/util/.libs/libutil > >>> .a: No such file or directory > >>> > >>> Any ideas? > >>> > >>> george. > >>> > >>> > >>> ___ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel --- config/ltmain.sh.orig 2011-06-09 12:50:08.911201988 +0200 +++ config/ltmain.sh 2011-06-09 12:51:20.530015482 +0200 @@ -5099,7 +5099,7 @@ # Search the libtool library lib="$searchdir/lib${name}${search_ext}" if test -f "$lib"; then - if test "$search_ext" = ".la"; then + if test "$search_ext" = ".la" -a "$lib" != "`pwd`/$outputname"; then found=yes else found=no smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality
I have no issue with uncommenting the code. However, I do see a future littered with lots of zombied processes and complaints over poor cleanup again On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote: > Ah I see what you are getting at now. > > The construction of the list of connected processes is something I, > intentionally, did not modify from the current Open MPI code. The list is > calculated based on the locally known set of local and remote process groups > attached to the communicator. So this is the set of directly connected > processes in the specified communicator known to the calling process at the > OMPI level. > > ORTE is asked to abort this defined set of processes. Once those processes > are terminated then ORTE needs to eventually inform all of the processes (in > the jobid(s) specified - maybe other jobids too?) that these processes have > failed/aborted. Upon notification of the failed/aborted processes the local > process (at the OMPI level) needs to determine if that process loss is > critical based upon the error handlers attached to communicators that it > shares with the failed/aborted processes. That should be handled in the > callback from the errmgr at the OMPI level, since connectedness is an MPI > construct. If the process failure/abort is critical to the local process, > then upon notification the local process can call abort on the communicator > effected. > > So this has the possibility for a rolling abort effect [the abort of one > communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From > which (depending upon the error handlers at the user level) the system will > eventually converge to either some stable subset of process or all processes > aborting resulting in job termination. > > The rolling abort effect relies heavily upon the ability of the runtime to > make sure that all process failures/abort are eventually known to all alive > processes. Since all alive processes will know of the failure/abort, it can > then determine if they are transitively effected by the failure based upon > the local list of communicators and associated error handlers. But to > complete this aspect of the abort procedure, we do need the callback > mechanism from the runtime - but since ORTE (today) will kill the job for > OMPI then it is not a big deal for end users since the job will terminate > anyway. Once we have the callback, then we can finish tightening up the OMPI > layer code. > > It is not perfect, but I think it does address the transitive nature of the > connectivity of MPI processes by relying on the runtime to provide uniform > notification of failures. I figure that we will need to look over this code > again and verify that the implementation of MPI_Comm_disconnect and > associated underpinnings do the 'right thing' with regard to updating the > communicator structures. But I think that is best addressed as a second set > of patches. > > > The goal of this patch is to put back in functionality that was commented out > during the last reorganization of the errmgr. What will likely follow, once > we have notification of failure/abort at the OMPI level, is a cleanup of the > connected groups code paths. > > > -- Josh > > > On Jun 9, 2011, at 6:13 PM, George Bosilca wrote: > >> What I'm saying is that there is no reason to have any other type of >> MPI_Abort if we are not able to compute the set of connected processes. >> >> With this RFC the processes on the communicator on MPI_Abort will abort. >> Then the other processes in the same MPI_COMM_WORLD (in fact jobid) will be >> notified (if we suppose that the ORTE will not make a difference between >> aborted and faulty). As a result the entire MPI_COMM_WORLD will be aborted, >> if we consider a sane application where everyone use the same type of error >> handler. However, this is not enough. We have to distribute the abort signal >> to every other process "connected", and I don't see how we can compute this >> list of connected processes in Open MPI today.It is not that I don't see it >> in your patch, it is that the definition of the connectivity in the MPI >> standard is transitive and relies heavily on a correct implementation for >> the MPI_Comm_disconnect. >> >> george. >> >> On Jun 9, 2011, at 16:59 , Josh Hursey wrote: >> >>> On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca wrote: If this change the behavior of MPI_Abort to only abort processes on the specified communicator how this doesn't affects the default user experience (when today it aborts everything)? >>> >>> Open MPI does abort everything by default - decided by the runtime at >>> the moment (but addressed in your RFC). So it does not matter if one >>> process aborts or if many do. So the behavior of MPI_Abort experienced >>> by the user will not change. Effectively the only change is an extra >>> message in the runtime before the process actually calls >
Re: [OMPI devel] VT support for 1.5
+ attachment On Friday 10 June 2011 12:00:49 you wrote: > It's a Libtool issue (once again) which occurs if a previous build is re- > configured without subsequent "make clean" and the LIBC developer library > "libutil" is added to LIBS. > > The error is simple to reproduce by the following steps: > > 1. configure > 2. make -C ompi/contrib/vt/vt/util > 3. configure > or > 3. touch ompi/contrib/vt/vt/util/installdirs_conf.h # created by configure > 4. make -C ompi/contrib/vt/vt/util > ar: > /home/jurenz/devel/ompi/v1.5/BUILD_gnu/ompi/contrib/vt/vt/util/.libs/libuti > l.a: No such file or directory > make: *** [libutil.la] Error 9 > > When re-building the VT's libutil Libtool detects the system's libutil as > dependency and tries to find a corresponding Libtool library (*.la). And > here is the problem: Libtool finds ompi/contrib/vt/vt/util/libutil.la > which is still present from the previous build and has nothing to do with > the system's libutil. Afterwards, Libtool fails on extracting the archive > ompi/contrib/vt/vt/util/.libs/libutil.a which isn't present for any reason. > > > There are different ways to fix the problem: > > 1. Apply the attached patch on ltmain.sh. > > This patch excludes the target library name from searching *.la libraries. > > 2. Rename the VT's libutil > > This would prevents name conflicts with dependency libraries. > > 3. Clear list of dependency libraries when building VT's libutil. > > This could be done by adding LIBS= to the Makefile.am in > ompi/contrib/vt/vt/util/. The VT's libutil has no dependencies to other > libraries except libc. > > 4. Perform "make clean" or remove ompi/contrib/vt/vt/util/libutil.la after > re- configure. > > Nonsense - it cannot be required from the user. > > > My favorite suggestion is 1. It would be just another patch in addition to > the set of Libtool patches invoked by autogen. > > What do you think? > > > Matthias > > On Tuesday 07 June 2011 16:56:39 Jeff Squyres wrote: > > You might want to try a new checkout, just in case there's something in > > there that is svn:ignored...? > > > > (yes, I'm grasping at straws here, but I'm able to build ok with a clean > > checkout...?) > > > > On Jun 7, 2011, at 10:38 AM, George Bosilca wrote: > > > My 'svn status' indicates no differences. I always build using a VPATH, > > > and in this case I did remove the build directory. However, the issue > > > persisted. > > > > > > george. > > > > > > On Jun 7, 2011, at 10:31 , Jeff Squyres wrote: > > >> I've seen VT builds get confused sometimes. I'm not sure of the exact > > >> cause, but if I get a new checkout, all the problems seem to go away. > > >> I've never had the time to track it down. > > >> > > >> Can you get a clean / new checkout and see if that fixes the problem? > > >> > > >> On Jun 7, 2011, at 10:27 AM, George Bosilca wrote: > > >>> I can't compile the 1.5 is I do not disable VT. Using the following > > >>> configure line: > > >>> > > >>> ../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug > > >>> --enable-mpirun-prefix-by-default --with-knem=/usr/local/knem > > >>> --with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug > > >>> > > >>> I get: > > >>> > > >>> ar: > > >>> /home/bosilca/unstable/1.5/debug/ompi/contrib/vt/vt/util/.libs/libuti > > >>> l .a: No such file or directory > > >>> > > >>> Any ideas? > > >>> > > >>> george. > > >>> > > >>> > > >>> ___ > > >>> devel mailing list > > >>> de...@open-mpi.org > > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > ___ > > > devel mailing list > > > de...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel --- config/ltmain.sh.orig 2011-06-09 12:50:08.911201988 +0200 +++ config/ltmain.sh 2011-06-09 12:51:20.530015482 +0200 @@ -5099,7 +5099,7 @@ # Search the libtool library lib="$searchdir/lib${name}${search_ext}" if test -f "$lib"; then - if test "$search_ext" = ".la"; then + if test "$search_ext" = ".la" -a "$lib" != "`pwd`/$outputname"; then found=yes else found=no
Re: [OMPI devel] VT support for 1.5
It's a Libtool issue (once again) which occurs if a previous build is re- configured without subsequent "make clean" and the LIBC developer library "libutil" is added to LIBS. The error is simple to reproduce by the following steps: 1. configure 2. make -C ompi/contrib/vt/vt/util 3. configure or 3. touch ompi/contrib/vt/vt/util/installdirs_conf.h # created by configure 4. make -C ompi/contrib/vt/vt/util ar: /home/jurenz/devel/ompi/v1.5/BUILD_gnu/ompi/contrib/vt/vt/util/.libs/libutil.a: No such file or directory make: *** [libutil.la] Error 9 When re-building the VT's libutil Libtool detects the system's libutil as dependency and tries to find a corresponding Libtool library (*.la). And here is the problem: Libtool finds ompi/contrib/vt/vt/util/libutil.la which is still present from the previous build and has nothing to do with the system's libutil. Afterwards, Libtool fails on extracting the archive ompi/contrib/vt/vt/util/.libs/libutil.a which isn't present for any reason. There are different ways to fix the problem: 1. Apply the attached patch on ltmain.sh. This patch excludes the target library name from searching *.la libraries. 2. Rename the VT's libutil This would prevents name conflicts with dependency libraries. 3. Clear list of dependency libraries when building VT's libutil. This could be done by adding LIBS= to the Makefile.am in ompi/contrib/vt/vt/util/. The VT's libutil has no dependencies to other libraries except libc. 4. Perform "make clean" or remove ompi/contrib/vt/vt/util/libutil.la after re- configure. Nonsense - it cannot be required from the user. My favorite suggestion is 1. It would be just another patch in addition to the set of Libtool patches invoked by autogen. What do you think? Matthias On Tuesday 07 June 2011 16:56:39 Jeff Squyres wrote: > You might want to try a new checkout, just in case there's something in > there that is svn:ignored...? > > (yes, I'm grasping at straws here, but I'm able to build ok with a clean > checkout...?) > > On Jun 7, 2011, at 10:38 AM, George Bosilca wrote: > > My 'svn status' indicates no differences. I always build using a VPATH, > > and in this case I did remove the build directory. However, the issue > > persisted. > > > > george. > > > > On Jun 7, 2011, at 10:31 , Jeff Squyres wrote: > >> I've seen VT builds get confused sometimes. I'm not sure of the exact > >> cause, but if I get a new checkout, all the problems seem to go away. > >> I've never had the time to track it down. > >> > >> Can you get a clean / new checkout and see if that fixes the problem? > >> > >> On Jun 7, 2011, at 10:27 AM, George Bosilca wrote: > >>> I can't compile the 1.5 is I do not disable VT. Using the following > >>> configure line: > >>> > >>> ../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug > >>> --enable-mpirun-prefix-by-default --with-knem=/usr/local/knem > >>> --with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug > >>> > >>> I get: > >>> > >>> ar: > >>> /home/bosilca/unstable/1.5/debug/ompi/contrib/vt/vt/util/.libs/libutil > >>> .a: No such file or directory > >>> > >>> Any ideas? > >>> > >>> george. > >>> > >>> > >>> ___ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Resilient ORTE
On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: > > On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: > >> >> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >> >>> Well, you're way to trusty. ;) >> >> It's the midwestern boy in me :) > > Still need to shake that corn out of your head... :-) > >> >>> >>> This only works if all component play the game, and even then there it is >>> difficult if you want to allow components to deregister themselves in the >>> middle of the execution. The problem is that a callback will be previous >>> for some component, and that when you want to remove a callback you have to >>> inform the "next" component on the callback chain to change its previous. >> >> This is a fair point. I think hiding the ordering of callbacks in the errmgr >> could be dangerous since it takes control from the upper layers, but, >> conversely, trusting the upper layers to 'do the right thing' with the >> previous callback is probably too optimistic, esp. for layers that are not >> designed together. >> >> To that I would suggest that you leave the code as is - registering a >> callback overwrites the existing callback. That will allow me to replace the >> default OMPI callback when I am able to in MPI_Init, and, if I need to, swap >> back in the default version at MPI_Finalize. >> >> Does that sound like a reasonable way forward on this design point? > > It doesn't solve the problem that George alluded to - just because you > overwrite the callback, it doesn't mean that someone else won't overwrite you > when their component initializes. Only the last one wins - the rest of you > lose. > > I'm not sure how you guarantee that you win, which is why I'm unclear how > this callback can really work unless everyone agrees that only one place gets > it. Put that callback in a base function of a new error handling framework, > and then let everyone create components within that for handling desired > error responses? Yep, that is a problem, but one that we can deal with in the immediate case. Since OMPI is the only layer registering the callback, when I replace it in OMPI I will have to make sure that no other place in OMPI replaces the callback. If at some point we need more than one callback above ORTE then we may want to revisit this point. But since we only have one layer on top of ORTE, it is the responsibility of that layer to be internally consistent with regard to which callback it wants to be triggered. If the layers above ORTE want more than one callback I would suggest that that layer design some mechanism for coordinating these multiple - possibly conflicting - callbacks (by the way this is policy management, which can get complex fast as you add more interested parties). Meaning that if OMPI wanted multiple callbacks to be active at the same time, then OMPI would create a mechanism for managing these callbacks, not ORTE. ORTE should just have one callback provided to the upper layer, and keep it -simple-. If the upper layer wants to toy around with something more complex it must manage the complexity instead of artificially pushing it down to the ORTE layer. -- Josh >> >> -- Josh >> >>> >>> george. >>> >>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote: >>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: - orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); - Which is a callback that just calls abort (which is what we want to do by default): - void ompi_errhandler_runtime_callback(orte_process_name_t *proc) { ompi_mpi_abort(MPI_COMM_WORLD, 1, false); } - This is what I want to replace. I do -not- want ompi to abort just because a process failed. So I need a way to replace or remove this callback, and put in my own callback that 'does the right thing'. The current patch allows me to overwrite the callback when I call: - orte_errmgr.set_fault_callback(&my_callback); - Which is fine with me. At the point I do not want my_callback to be active any more (say in MPI_Finalize) I would like to replace it with the old callback. To do so, with the patch's interface, I would have to know what the previous callback was and do: - orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); - This comes at a slight maintenance burden since now there will be two places in the code that must explicitly reference 'ompi_errhandler_runtime_callback' - if it ever changed then both sites would have to be updated. If you use the 'sigaction-like' interface then upon registration I would get the previous handler back (which would point to 'ompi_errhandler_runtime_callback), and I can store it for later: - orte_errmgr.set_fault_callba
Re: [OMPI devel] RFC: Resilient ORTE
Okay, finally have time to sit down and review this. It looks pretty much identical to what was done in ORCM - we just kept "epoch" separate from the process name, and use multicast to notify all procs that someone failed. I do have a few questions/comments about your proposed patch: 1. I note that in some places you just set peer_name.epoch = proc_name.epoch, and in others you make the assignment by calling a new API orte_ess.proc_get_epoch(&proc_name). Ditto for proc_set_epoch. What are the rules for when each method should be used? Which leads to... 2. I'm puzzled as to why you are storing process state and epoch number in the modex as well as in the process name and orte_proc_t struct. This creates a bit of a race condition as the two will be out-of-sync for some (probably small) period of time, and looks like unnecessary duplication. Is there some reason for doing this? We are trying to eliminate duplicate storage because of the data confusion and memory issues, hence my question. 3. as a follow on to #2, I am bothered that we now have the ESS storing proc state. That isn't the functional purpose of the ESS - that's a PLM function. Is there some reason for doing this in the ESS? Why aren't we just looking at the orte_proc_t for that proc and using its state field? I guess I can understand if you want to get that via an API (instead of having code to lookup the proc_t in multiple places), but then let's put it in the PLM please. I note that it is only used in the binomial routing code, so why not just put a static function in there to get the state of a proc rather than creating another API? 4. ess_base_open.c: the default orte_ess module appears to be missing an entry for proc_set_epoch. 5. I really don't think that notification of proc failure belongs in the orted_comm - messages notifying of proc failure should be received in the errmgr. This allows people who want to handle things differently (e.g., orcm) the ability to create their own errmgr component(s) for daemons and HNP that send the messages over their desired messaging system, decide how they want to respond, etc. Putting it in orted_comm forces everyone to use only this one method, which conflicts with allowing freedom for others to explore alternative methods, and frankly, I don't see any strong reason that outweighs that limitation. 6. I don't think this errmgr_fault_callback registration is going to work, per my response to Josh's RFC. I'll leave the discussion in that thread. On Jun 6, 2011, at 1:00 PM, George Bosilca wrote: > WHAT: Allow the runtime to handle fail-stop failures for both runtime > (daemons) or application level processes. This patch extends the > orte_process_name_t structure with a field to store the process epoch (the > number of times it died so far), and add an application failure notification > callback function to be registered in the runtime. > > WHY: Necessary to correctly implement the error handling in the MPI 2.2 > standard. In addition, such a resilient runtime is a cornerstone for any > level of fault tolerance support we want to provide in the future (such as > the MPI-3 Run-Through Stabilization or FT-MPI). > > WHEN: > > WHERE: Patch attached to this email, based on trunk r24747. > TIMEOUT: 2 weeks from now, on Monday 20 June. > > -- > > MORE DETAILS: > > Currently the infrastructure required to enable any kind of fault tolerance > development in Open MPI (with the exception of the checkpoint/restart) is > missing. However, before developing any fault tolerant support at the > application (MPI) level, we need to have a resilient runtime. The changes in > this patch address this lack of support and would allow anyone to implement a > fault tolerance protocol at the MPI layer without having to worry about the > ORTE stabilization. > > This patch will allow the runtime to drop any dead daemons, and re-route all > communications around the holes in order to __ALWAYS__ deliver a message as > long as the destination process is alive. The application is informed (via a > callback) about the loss of the processes with the same jobid. In this patch > we do not address the MPI_ERROR_RETURN type of failures, we focused on the > MPI_ERROR_ABORT ones. Moreover, we empowered the application level with the > decision, instead of taking it down in the runtime. > > NEW STUFF: > > Epoch - A counter that tracks the number of times a process has been detected > to have terminated, either from a failure or an expected termination. After > the termination is detected, the HNP coordinates all other process’s > knowledge of the new epoch. Each ORTED will know the epoch of the other > processes in the job, but it will not actually store anything until the > epochs change. > > Run-Through Stabilization - When an ORTED (or HNP) detects that another > process has terminated, it repairs the routing layer and informs the HNP. The > HNP tells all other proc
Re: [OMPI devel] RFC: Resilient ORTE
Another problem with this patch, that I mentioned to Wesley and George off list, is that it does not handle the case when mpirun/HNP is also hosting processes that might fail. In my testing of the patch it worked fine if mpirun/HNP was -not- hosting any processes, but once it had to host processes then unexpected behavior occurred when a process failed. So for those just listening to this thread, Wesley is working on a revised patch to address this problem that he will post when it is ready. As far as the RML issue, doesn't the ORTE state machine branch handle that case? If it does, then let's push the solution to that problem until that branch comes around instead of solving it twice. -- Josh On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain wrote: > Something else you might want to address in here: the current code sends an > RML message from the proc calling abort to its local daemon telling the > daemon that we are exiting due to the app calling "abort". We needed to do > this because we wanted to flag the proc termination as one induced by the app > itself as opposed to something like a segfault or termination by signal. > > However, the problem is that the app may be calling abort from within an > event handler. Hence, the RML send (which is currently blocking) will never > complete once we no longer allow event lib recursion (coming soon). If we use > a non-blocking send, then we can't know for sure that the message has been > sent before we terminate. > > What we need is a non-messaging way of communicating that this was an ordered > abort as opposed to a segfault or other failure. Prior to the current method, > we had the app drop a file that the daemon looked for as an "abort marker", > but that was ugly as it sometimes caused us to not properly cleanup the > session directory tree. > > I'm open to suggestion - perhaps it isn't actually all that critical for us > to distinguish "aborted by call to abort" from "aborted by signal", and we > can just have the app commit suicide via self-imposed SIGKILL? It is only the > message output to the user at the end of the job that differs - and since > MPI_Abort already provides a message indicating "we called abort", is it > really necessary that we have orte aware of that distinction? > > > On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: > >> >> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >> >>> Well, you're way to trusty. ;) >> >> It's the midwestern boy in me :) >> >>> >>> This only works if all component play the game, and even then there it is >>> difficult if you want to allow components to deregister themselves in the >>> middle of the execution. The problem is that a callback will be previous >>> for some component, and that when you want to remove a callback you have to >>> inform the "next" component on the callback chain to change its previous. >> >> This is a fair point. I think hiding the ordering of callbacks in the errmgr >> could be dangerous since it takes control from the upper layers, but, >> conversely, trusting the upper layers to 'do the right thing' with the >> previous callback is probably too optimistic, esp. for layers that are not >> designed together. >> >> To that I would suggest that you leave the code as is - registering a >> callback overwrites the existing callback. That will allow me to replace the >> default OMPI callback when I am able to in MPI_Init, and, if I need to, swap >> back in the default version at MPI_Finalize. >> >> Does that sound like a reasonable way forward on this design point? >> >> -- Josh >> >>> >>> george. >>> >>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote: >>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: - orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); - Which is a callback that just calls abort (which is what we want to do by default): - void ompi_errhandler_runtime_callback(orte_process_name_t *proc) { ompi_mpi_abort(MPI_COMM_WORLD, 1, false); } - This is what I want to replace. I do -not- want ompi to abort just because a process failed. So I need a way to replace or remove this callback, and put in my own callback that 'does the right thing'. The current patch allows me to overwrite the callback when I call: - orte_errmgr.set_fault_callback(&my_callback); - Which is fine with me. At the point I do not want my_callback to be active any more (say in MPI_Finalize) I would like to replace it with the old callback. To do so, with the patch's interface, I would have to know what the previous callback was and do: - orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); - This comes at a slight maintenance burden since now there will be two places i
Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality
Why would this patch result in zombied processes and poor cleanup? When ORTE receive notification of a process terminating/aborting then it triggers the termination of the job (without UTK's RFC) which should ensure a clean shutdown. This patch just tells ORTE that a few other processes should be the first to die, which will trigger the same response in ORTE. I guess I'm unclear about this concern since it should be a concern in the current ORTE as well then. I agree that it will be a concern once we have the OMPI layer handling error management triggered off of a callback, but that is a different RFC. Something that might help those listening to this thread. The current behavior of MPI_Abort in OMPI results in the semantics of: -- internal_MPI_Abort(MPI_COMM_SELF, exit_code) -- regardless of the communicator actually passed to the MPI_Abort at the application level. It should be: -- internal_MPI_Abort(comm_provided, exit_code) -- Semantically, this patch just makes the group actually being aborted match the communicator provided. In practicality, the job will terminate when any process in the job calls abort - so the result (in todays codebase) will be the same. -- Josh On Fri, Jun 10, 2011 at 7:30 AM, Ralph Castain wrote: > I have no issue with uncommenting the code. However, I do see a future > littered with lots of zombied processes and complaints over poor cleanup > again > > > On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote: > >> Ah I see what you are getting at now. >> >> The construction of the list of connected processes is something I, >> intentionally, did not modify from the current Open MPI code. The list is >> calculated based on the locally known set of local and remote process groups >> attached to the communicator. So this is the set of directly connected >> processes in the specified communicator known to the calling process at the >> OMPI level. >> >> ORTE is asked to abort this defined set of processes. Once those processes >> are terminated then ORTE needs to eventually inform all of the processes (in >> the jobid(s) specified - maybe other jobids too?) that these processes have >> failed/aborted. Upon notification of the failed/aborted processes the local >> process (at the OMPI level) needs to determine if that process loss is >> critical based upon the error handlers attached to communicators that it >> shares with the failed/aborted processes. That should be handled in the >> callback from the errmgr at the OMPI level, since connectedness is an MPI >> construct. If the process failure/abort is critical to the local process, >> then upon notification the local process can call abort on the communicator >> effected. >> >> So this has the possibility for a rolling abort effect [the abort of one >> communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From >> which (depending upon the error handlers at the user level) the system will >> eventually converge to either some stable subset of process or all processes >> aborting resulting in job termination. >> >> The rolling abort effect relies heavily upon the ability of the runtime to >> make sure that all process failures/abort are eventually known to all alive >> processes. Since all alive processes will know of the failure/abort, it can >> then determine if they are transitively effected by the failure based upon >> the local list of communicators and associated error handlers. But to >> complete this aspect of the abort procedure, we do need the callback >> mechanism from the runtime - but since ORTE (today) will kill the job for >> OMPI then it is not a big deal for end users since the job will terminate >> anyway. Once we have the callback, then we can finish tightening up the OMPI >> layer code. >> >> It is not perfect, but I think it does address the transitive nature of the >> connectivity of MPI processes by relying on the runtime to provide uniform >> notification of failures. I figure that we will need to look over this code >> again and verify that the implementation of MPI_Comm_disconnect and >> associated underpinnings do the 'right thing' with regard to updating the >> communicator structures. But I think that is best addressed as a second set >> of patches. >> >> >> The goal of this patch is to put back in functionality that was commented >> out during the last reorganization of the errmgr. What will likely follow, >> once we have notification of failure/abort at the OMPI level, is a cleanup >> of the connected groups code paths. >> >> >> -- Josh >> >> >> On Jun 9, 2011, at 6:13 PM, George Bosilca wrote: >> >>> What I'm saying is that there is no reason to have any other type of >>> MPI_Abort if we are not able to compute the set of connected processes. >>> >>> With this RFC the processes on the communicator on MPI_Abort will abort. >>> Then the other processes in the same MPI_COMM_WORLD (in fact jobid) w
Re: [OMPI devel] RFC: Resilient ORTE
On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote: > Another problem with this patch, that I mentioned to Wesley and George > off list, is that it does not handle the case when mpirun/HNP is also > hosting processes that might fail. In my testing of the patch it > worked fine if mpirun/HNP was -not- hosting any processes, but once it > had to host processes then unexpected behavior occurred when a process > failed. So for those just listening to this thread, Wesley is working > on a revised patch to address this problem that he will post when it > is ready. See my other response to the patch - I think we need to understand why we are storing state in multiple places as it can create unexpected behavior when things are out-of-sync. > > > As far as the RML issue, doesn't the ORTE state machine branch handle > that case? If it does, then let's push the solution to that problem > until that branch comes around instead of solving it twice. No, it doesn't - in fact, it's what breaks the current method. Because we no longer allow event recursion, the RML message never gets out of the app. Hence my question. I honestly don't think we need to have orte be aware of the distinction between "aborted by cmd" and "aborted by signal" as the only diff is in the error message. There ought to be some other way of resolving this? > > -- Josh > > > On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain wrote: >> Something else you might want to address in here: the current code sends an >> RML message from the proc calling abort to its local daemon telling the >> daemon that we are exiting due to the app calling "abort". We needed to do >> this because we wanted to flag the proc termination as one induced by the >> app itself as opposed to something like a segfault or termination by signal. >> >> However, the problem is that the app may be calling abort from within an >> event handler. Hence, the RML send (which is currently blocking) will never >> complete once we no longer allow event lib recursion (coming soon). If we >> use a non-blocking send, then we can't know for sure that the message has >> been sent before we terminate. >> >> What we need is a non-messaging way of communicating that this was an >> ordered abort as opposed to a segfault or other failure. Prior to the >> current method, we had the app drop a file that the daemon looked for as an >> "abort marker", but that was ugly as it sometimes caused us to not properly >> cleanup the session directory tree. >> >> I'm open to suggestion - perhaps it isn't actually all that critical for us >> to distinguish "aborted by call to abort" from "aborted by signal", and we >> can just have the app commit suicide via self-imposed SIGKILL? It is only >> the message output to the user at the end of the job that differs - and >> since MPI_Abort already provides a message indicating "we called abort", is >> it really necessary that we have orte aware of that distinction? >> >> >> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >> >>> >>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >>> Well, you're way to trusty. ;) >>> >>> It's the midwestern boy in me :) >>> This only works if all component play the game, and even then there it is difficult if you want to allow components to deregister themselves in the middle of the execution. The problem is that a callback will be previous for some component, and that when you want to remove a callback you have to inform the "next" component on the callback chain to change its previous. >>> >>> This is a fair point. I think hiding the ordering of callbacks in the >>> errmgr could be dangerous since it takes control from the upper layers, >>> but, conversely, trusting the upper layers to 'do the right thing' with the >>> previous callback is probably too optimistic, esp. for layers that are not >>> designed together. >>> >>> To that I would suggest that you leave the code as is - registering a >>> callback overwrites the existing callback. That will allow me to replace >>> the default OMPI callback when I am able to in MPI_Init, and, if I need to, >>> swap back in the default version at MPI_Finalize. >>> >>> Does that sound like a reasonable way forward on this design point? >>> >>> -- Josh >>> george. On Jun 9, 2011, at 13:21 , Josh Hursey wrote: > So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: > - > orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); > - > > Which is a callback that just calls abort (which is what we want to do > by default): > - > void ompi_errhandler_runtime_callback(orte_process_name_t *proc) { > ompi_mpi_abort(MPI_COMM_WORLD, 1, false); > } > - > > This is what I want to replace. I do -not- want ompi to abort just > because a process failed. So
Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality
On Jun 10, 2011, at 6:48 AM, Josh Hursey wrote: > Why would this patch result in zombied processes and poor cleanup? > When ORTE receive notification of a process terminating/aborting then > it triggers the termination of the job (without UTK's RFC) which > should ensure a clean shutdown. This patch just tells ORTE that a few > other processes should be the first to die, which will trigger the > same response in ORTE. > > I guess I'm unclear about this concern since it should be a concern in > the current ORTE as well then. I agree that it will be a concern once > we have the OMPI layer handling error management triggered off of a > callback, but that is a different RFC. My comment was to "the future" - i.e., looking to the point where we get layered, rolling aborts. I agree that this specific RFC won't change the current behavior, and as I said, I have no issue with it. > > > Something that might help those listening to this thread. The current > behavior of MPI_Abort in OMPI results in the semantics of: > -- > internal_MPI_Abort(MPI_COMM_SELF, exit_code) > -- > regardless of the communicator actually passed to the MPI_Abort at the > application level. It should be: > -- > internal_MPI_Abort(comm_provided, exit_code) > -- > > Semantically, this patch just makes the group actually being aborted > match the communicator provided. In practicality, the job will > terminate when any process in the job calls abort - so the result (in > todays codebase) will be the same. > > -- Josh > > > On Fri, Jun 10, 2011 at 7:30 AM, Ralph Castain wrote: >> I have no issue with uncommenting the code. However, I do see a future >> littered with lots of zombied processes and complaints over poor cleanup >> again >> >> >> On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote: >> >>> Ah I see what you are getting at now. >>> >>> The construction of the list of connected processes is something I, >>> intentionally, did not modify from the current Open MPI code. The list is >>> calculated based on the locally known set of local and remote process >>> groups attached to the communicator. So this is the set of directly >>> connected processes in the specified communicator known to the calling >>> process at the OMPI level. >>> >>> ORTE is asked to abort this defined set of processes. Once those processes >>> are terminated then ORTE needs to eventually inform all of the processes >>> (in the jobid(s) specified - maybe other jobids too?) that these processes >>> have failed/aborted. Upon notification of the failed/aborted processes the >>> local process (at the OMPI level) needs to determine if that process loss >>> is critical based upon the error handlers attached to communicators that it >>> shares with the failed/aborted processes. That should be handled in the >>> callback from the errmgr at the OMPI level, since connectedness is an MPI >>> construct. If the process failure/abort is critical to the local process, >>> then upon notification the local process can call abort on the communicator >>> effected. >>> >>> So this has the possibility for a rolling abort effect [the abort of one >>> communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From >>> which (depending upon the error handlers at the user level) the system will >>> eventually converge to either some stable subset of process or all >>> processes aborting resulting in job termination. >>> >>> The rolling abort effect relies heavily upon the ability of the runtime to >>> make sure that all process failures/abort are eventually known to all alive >>> processes. Since all alive processes will know of the failure/abort, it can >>> then determine if they are transitively effected by the failure based upon >>> the local list of communicators and associated error handlers. But to >>> complete this aspect of the abort procedure, we do need the callback >>> mechanism from the runtime - but since ORTE (today) will kill the job for >>> OMPI then it is not a big deal for end users since the job will terminate >>> anyway. Once we have the callback, then we can finish tightening up the >>> OMPI layer code. >>> >>> It is not perfect, but I think it does address the transitive nature of the >>> connectivity of MPI processes by relying on the runtime to provide uniform >>> notification of failures. I figure that we will need to look over this code >>> again and verify that the implementation of MPI_Comm_disconnect and >>> associated underpinnings do the 'right thing' with regard to updating the >>> communicator structures. But I think that is best addressed as a second set >>> of patches. >>> >>> >>> The goal of this patch is to put back in functionality that was commented >>> out during the last reorganization of the errmgr. What will likely follow, >>> once we have notification of failure/abort at the OMPI level, is a cleanup >>> of the connected groups co
Re: [OMPI devel] RFC: Resilient ORTE
On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: >> >> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >> >>> >>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >>> Well, you're way to trusty. ;) >>> >>> It's the midwestern boy in me :) >> >> Still need to shake that corn out of your head... :-) >> >>> This only works if all component play the game, and even then there it is difficult if you want to allow components to deregister themselves in the middle of the execution. The problem is that a callback will be previous for some component, and that when you want to remove a callback you have to inform the "next" component on the callback chain to change its previous. >>> >>> This is a fair point. I think hiding the ordering of callbacks in the >>> errmgr could be dangerous since it takes control from the upper layers, >>> but, conversely, trusting the upper layers to 'do the right thing' with the >>> previous callback is probably too optimistic, esp. for layers that are not >>> designed together. >>> >>> To that I would suggest that you leave the code as is - registering a >>> callback overwrites the existing callback. That will allow me to replace >>> the default OMPI callback when I am able to in MPI_Init, and, if I need to, >>> swap back in the default version at MPI_Finalize. >>> >>> Does that sound like a reasonable way forward on this design point? >> >> It doesn't solve the problem that George alluded to - just because you >> overwrite the callback, it doesn't mean that someone else won't overwrite >> you when their component initializes. Only the last one wins - the rest of >> you lose. >> >> I'm not sure how you guarantee that you win, which is why I'm unclear how >> this callback can really work unless everyone agrees that only one place >> gets it. Put that callback in a base function of a new error handling >> framework, and then let everyone create components within that for handling >> desired error responses? > > Yep, that is a problem, but one that we can deal with in the immediate > case. Since OMPI is the only layer registering the callback, when I > replace it in OMPI I will have to make sure that no other place in > OMPI replaces the callback. > > If at some point we need more than one callback above ORTE then we may > want to revisit this point. But since we only have one layer on top of > ORTE, it is the responsibility of that layer to be internally > consistent with regard to which callback it wants to be triggered. > > If the layers above ORTE want more than one callback I would suggest > that that layer design some mechanism for coordinating these multiple > - possibly conflicting - callbacks (by the way this is policy > management, which can get complex fast as you add more interested > parties). Meaning that if OMPI wanted multiple callbacks to be active > at the same time, then OMPI would create a mechanism for managing > these callbacks, not ORTE. ORTE should just have one callback provided > to the upper layer, and keep it -simple-. If the upper layer wants to > toy around with something more complex it must manage the complexity > instead of artificially pushing it down to the ORTE layer. I agree - I was just proposing one way of doing that in the MPI layer so you wouldn't have to play policeman on the rest of the code base to ensure nobody else inserts a callback without realizing they overwrote yours. I can envision, for example, UTK wanting to do something different from you, and perhaps committing a callback that unintentionally overrode you. Up to you...just making a suggestion. > > -- Josh > >>> >>> -- Josh >>> george. On Jun 9, 2011, at 13:21 , Josh Hursey wrote: > So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: > - > orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); > - > > Which is a callback that just calls abort (which is what we want to do > by default): > - > void ompi_errhandler_runtime_callback(orte_process_name_t *proc) { > ompi_mpi_abort(MPI_COMM_WORLD, 1, false); > } > - > > This is what I want to replace. I do -not- want ompi to abort just > because a process failed. So I need a way to replace or remove this > callback, and put in my own callback that 'does the right thing'. > > The current patch allows me to overwrite the callback when I call: > - > orte_errmgr.set_fault_callback(&my_callback); > - > Which is fine with me. > > At the point I do not want my_callback to be active any more (say in > MPI_Finalize) I would like to replace it with the old callback. To do > so, with the patch's interface, I would have to know what the previous > callback was and do: >>>
Re: [OMPI devel] RFC: Resilient ORTE
On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain wrote: > > On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote: > >> Another problem with this patch, that I mentioned to Wesley and George >> off list, is that it does not handle the case when mpirun/HNP is also >> hosting processes that might fail. In my testing of the patch it >> worked fine if mpirun/HNP was -not- hosting any processes, but once it >> had to host processes then unexpected behavior occurred when a process >> failed. So for those just listening to this thread, Wesley is working >> on a revised patch to address this problem that he will post when it >> is ready. > > See my other response to the patch - I think we need to understand why we are > storing state in multiple places as it can create unexpected behavior when > things are out-of-sync. > > >> >> >> As far as the RML issue, doesn't the ORTE state machine branch handle >> that case? If it does, then let's push the solution to that problem >> until that branch comes around instead of solving it twice. > > No, it doesn't - in fact, it's what breaks the current method. Because we no > longer allow event recursion, the RML message never gets out of the app. > Hence my question. > > I honestly don't think we need to have orte be aware of the distinction > between "aborted by cmd" and "aborted by signal" as the only diff is in the > error message. There ought to be some other way of resolving this? MPI_Abort will need to tell ORTE which processes should be 'aborted by signal' along with the calling process. So there needs to be a mechanism for that was well. Not sure if I have a good solution to this in mind just yet. A thought though, in the state machine version, the process calling MPI_Abort could post a message to the processing thread and return from the callback. The callback would have a check at the bottom to determine if MPI_Abort was triggered within the callback, and just sleep. The processing thread would progress the RML message and once finished call exit(). This implies that the application process has a separate processing thread. But I think we might be able to post the RML message in the callback, then wait for it to complete outside of the callback before returning control to the user. :/ Interesting. -- Josh > > >> >> -- Josh >> >> >> On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain wrote: >>> Something else you might want to address in here: the current code sends an >>> RML message from the proc calling abort to its local daemon telling the >>> daemon that we are exiting due to the app calling "abort". We needed to do >>> this because we wanted to flag the proc termination as one induced by the >>> app itself as opposed to something like a segfault or termination by signal. >>> >>> However, the problem is that the app may be calling abort from within an >>> event handler. Hence, the RML send (which is currently blocking) will never >>> complete once we no longer allow event lib recursion (coming soon). If we >>> use a non-blocking send, then we can't know for sure that the message has >>> been sent before we terminate. >>> >>> What we need is a non-messaging way of communicating that this was an >>> ordered abort as opposed to a segfault or other failure. Prior to the >>> current method, we had the app drop a file that the daemon looked for as an >>> "abort marker", but that was ugly as it sometimes caused us to not >>> properly cleanup the session directory tree. >>> >>> I'm open to suggestion - perhaps it isn't actually all that critical for us >>> to distinguish "aborted by call to abort" from "aborted by signal", and we >>> can just have the app commit suicide via self-imposed SIGKILL? It is only >>> the message output to the user at the end of the job that differs - and >>> since MPI_Abort already provides a message indicating "we called abort", is >>> it really necessary that we have orte aware of that distinction? >>> >>> >>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: > Well, you're way to trusty. ;) It's the midwestern boy in me :) > > This only works if all component play the game, and even then there it is > difficult if you want to allow components to deregister themselves in the > middle of the execution. The problem is that a callback will be previous > for some component, and that when you want to remove a callback you have > to inform the "next" component on the callback chain to change its > previous. This is a fair point. I think hiding the ordering of callbacks in the errmgr could be dangerous since it takes control from the upper layers, but, conversely, trusting the upper layers to 'do the right thing' with the previous callback is probably too optimistic, esp. for layers that are not designed together. To that I would suggest that you leave the code as is -
Re: [OMPI devel] VT support for 1.5
On Jun 10, 2011, at 5:16 AM, Matthias Jurenz wrote: > There are different ways to fix the problem: > > 1. Apply the attached patch on ltmain.sh. > > This patch excludes the target library name from searching *.la libraries. Does your patch work for vpath builds, too? If so, isn't this something that should be submitted upstream? > 2. Rename the VT's libutil > > This would prevents name conflicts with dependency libraries. This is my preference; can't it just be renamed to libvtutil or something? > 3. Clear list of dependency libraries when building VT's libutil. > > This could be done by adding LIBS= to the Makefile.am in > ompi/contrib/vt/vt/util/. The VT's libutil has no dependencies to other > libraries except libc. That seems like it would work, but feels a bit hack-ish. > 4. Perform "make clean" or remove ompi/contrib/vt/vt/util/libutil.la after re- > configure. > > Nonsense - it cannot be required from the user. Agreed. > My favorite suggestion is 1. It would be just another patch in addition to > the set of Libtool patches invoked by autogen. Keep in mind that most (all?) of those are for handling older versions of the GNU Autotools, and/or for patches that have been submitted upstream but are not part of an official release yet. Or, they are for v1.5.x where we have "locked in" the versions of the GNU autotools for the entire series and won't upgrade, even if never versions fix the things we've put in patches for. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Resilient ORTE
On Jun 10, 2011, at 7:01 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain wrote: >> >> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote: >> >>> Another problem with this patch, that I mentioned to Wesley and George >>> off list, is that it does not handle the case when mpirun/HNP is also >>> hosting processes that might fail. In my testing of the patch it >>> worked fine if mpirun/HNP was -not- hosting any processes, but once it >>> had to host processes then unexpected behavior occurred when a process >>> failed. So for those just listening to this thread, Wesley is working >>> on a revised patch to address this problem that he will post when it >>> is ready. >> >> See my other response to the patch - I think we need to understand why we >> are storing state in multiple places as it can create unexpected behavior >> when things are out-of-sync. >> >> >>> >>> >>> As far as the RML issue, doesn't the ORTE state machine branch handle >>> that case? If it does, then let's push the solution to that problem >>> until that branch comes around instead of solving it twice. >> >> No, it doesn't - in fact, it's what breaks the current method. Because we no >> longer allow event recursion, the RML message never gets out of the app. >> Hence my question. >> >> I honestly don't think we need to have orte be aware of the distinction >> between "aborted by cmd" and "aborted by signal" as the only diff is in the >> error message. There ought to be some other way of resolving this? > > MPI_Abort will need to tell ORTE which processes should be 'aborted by > signal' along with the calling process. So there needs to be a > mechanism for that was well. Not sure if I have a good solution to > this in mind just yet. Ah yes - that would require a communication anyway. > > A thought though, in the state machine version, the process calling > MPI_Abort could post a message to the processing thread and return > from the callback. The callback would have a check at the bottom to > determine if MPI_Abort was triggered within the callback, and just > sleep. The processing thread would progress the RML message and once > finished call exit(). This implies that the application process has a > separate processing thread. But I think we might be able to post the > RML message in the callback, then wait for it to complete outside of > the callback before returning control to the user. :/ Interesting. Could work, though it does require a thread. You would have to be tricky about it, though, as it is possible the call to "abort" could occur in an event handler. If you block in that handler waiting for the message to have been sent, it never will leave as the RML uses the event lib to trigger the actual send. I may have a solution to the latter problem. For similar reasons, I've had to change the errmgr so it doesn't immediately process errors - otherwise, it's actions become constrained by the question of "am I in an event handler or not". To remove the uncertainty, I'm rigging it so that all errmgr processing is done in an event - basically, reporting an error causes the errmgr to push the error into a pipe, that triggers an event which actually processes it. Only way I could deal with the uncertainty. So if that mechanism is in place, the only thing you would have to do is (a) call abort, and then (b) cycle opal_progress until the errmgr.abort function callback occurred. Of course, we would then have to modify the errmgr so that abort took a callback function that it called when the app is free to exit. no perfect solution, I fear. > > -- Josh > >> >> >>> >>> -- Josh >>> >>> >>> On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain wrote: Something else you might want to address in here: the current code sends an RML message from the proc calling abort to its local daemon telling the daemon that we are exiting due to the app calling "abort". We needed to do this because we wanted to flag the proc termination as one induced by the app itself as opposed to something like a segfault or termination by signal. However, the problem is that the app may be calling abort from within an event handler. Hence, the RML send (which is currently blocking) will never complete once we no longer allow event lib recursion (coming soon). If we use a non-blocking send, then we can't know for sure that the message has been sent before we terminate. What we need is a non-messaging way of communicating that this was an ordered abort as opposed to a segfault or other failure. Prior to the current method, we had the app drop a file that the daemon looked for as an "abort marker", but that was ugly as it sometimes caused us to not properly cleanup the session directory tree. I'm open to suggestion - perhaps it isn't actually all that critical for us to distinguish "aborted by call
Re: [OMPI devel] RFC: Fortran support in Open MPI Extensions
Reminder that this RFC goes in later today. On Wed, Jun 8, 2011 at 10:32 AM, Jeff Squyres wrote: > This one's a no-brainer, folks. :-) > > Josh [re]discovered that we didn't initially support Fortran interfaces for > the extensions when he was trying to make a complete implementation for an > MPI-3 Forum proposal. > > +1 > > > On Jun 8, 2011, at 10:11 AM, Josh Hursey wrote: > >> WHAT: Fortran 77 and 90 support for the Open MPI Extensions >> >> WHY: Trunk only supports C. >> >> WHERE: build system updates, ompi/mpiext >> >> WHEN: Open MPI trunk >> >> TIMEOUT: Friday, June 10, 2011 COB >> >> Details: >> --- >> A bitbucket branch is available here (last sync to r24757 of trunk) >> https://bitbucket.org/jjhursey/ompi-ext-fortran >> >> The current Open MPI trunk supports only C interfaces to Open MPI >> interface extensions. This branch adds support for f77 and f90. >> Supporting these three language interfaces enables Fortran >> applications to take advantage of available interface extensions. >> Configure detects if the extension supports C, f77, and/or f90 and >> takes the appropriate action. The C interfaces are required, and the >> f77/f90 interfaces are optional. This fix does not require changes to >> any existing extensions. >> >> -- >> Joshua Hursey >> Postdoctoral Research Associate >> Oak Ridge National Laboratory >> http://users.nccs.gov/~jjhursey >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI devel] RFC: Resilient ORTE
On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: >> >> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >> >>> >>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >>> Well, you're way to trusty. ;) >>> >>> It's the midwestern boy in me :) >> >> Still need to shake that corn out of your head... :-) >> >>> This only works if all component play the game, and even then there it is difficult if you want to allow components to deregister themselves in the middle of the execution. The problem is that a callback will be previous for some component, and that when you want to remove a callback you have to inform the "next" component on the callback chain to change its previous. >>> >>> This is a fair point. I think hiding the ordering of callbacks in the >>> errmgr could be dangerous since it takes control from the upper layers, >>> but, conversely, trusting the upper layers to 'do the right thing' with the >>> previous callback is probably too optimistic, esp. for layers that are not >>> designed together. >>> >>> To that I would suggest that you leave the code as is - registering a >>> callback overwrites the existing callback. That will allow me to replace >>> the default OMPI callback when I am able to in MPI_Init, and, if I need to, >>> swap back in the default version at MPI_Finalize. >>> >>> Does that sound like a reasonable way forward on this design point? >> >> It doesn't solve the problem that George alluded to - just because you >> overwrite the callback, it doesn't mean that someone else won't overwrite >> you when their component initializes. Only the last one wins - the rest of >> you lose. >> >> I'm not sure how you guarantee that you win, which is why I'm unclear how >> this callback can really work unless everyone agrees that only one place >> gets it. Put that callback in a base function of a new error handling >> framework, and then let everyone create components within that for handling >> desired error responses? > > Yep, that is a problem, but one that we can deal with in the immediate > case. Since OMPI is the only layer registering the callback, when I > replace it in OMPI I will have to make sure that no other place in > OMPI replaces the callback. > > If at some point we need more than one callback above ORTE then we may > want to revisit this point. But since we only have one layer on top of > ORTE, it is the responsibility of that layer to be internally > consistent with regard to which callback it wants to be triggered. > > If the layers above ORTE want more than one callback I would suggest > that that layer design some mechanism for coordinating these multiple > - possibly conflicting - callbacks (by the way this is policy > management, which can get complex fast as you add more interested > parties). Meaning that if OMPI wanted multiple callbacks to be active > at the same time, then OMPI would create a mechanism for managing > these callbacks, not ORTE. ORTE should just have one callback provided > to the upper layer, and keep it -simple-. If the upper layer wants to > toy around with something more complex it must manage the complexity > instead of artificially pushing it down to the ORTE layer. I was thinking some more about this, and wonder if we aren't over-complicating the question. Do you need to actually control the sequence of callbacks, or just ensure that your callback gets called prior to the default one that calls abort? Meeting the latter requirement is trivial - subsequent calls to register_callback get pushed onto the top of the callback list. Since the default one always gets registered first (which we can ensure since it occurs in MPI_Init), it will always be at the bottom of the callback list and hence called last. Keeping that list in ORTE is simple and probably the right place to do it. However, if you truly want to control the callback order in detail - then yeah, that should go up in OMPI. I sure don't want to write all that code :-) > > -- Josh > >>> >>> -- Josh >>> george. On Jun 9, 2011, at 13:21 , Josh Hursey wrote: > So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: > - > orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); > - > > Which is a callback that just calls abort (which is what we want to do > by default): > - > void ompi_errhandler_runtime_callback(orte_process_name_t *proc) { > ompi_mpi_abort(MPI_COMM_WORLD, 1, false); > } > - > > This is what I want to replace. I do -not- want ompi to abort just > because a process failed. So I need a way to replace or remove this > callback, and put in my own callback that 'does the right thing'. > > The current patch allows me to overwrite the callback when I call: > --
Re: [OMPI devel] RFC: Fortran support in Open MPI Extensions
Committed in r24772: https://svn.open-mpi.org/trac/ompi/changeset/24772 Thanks folks, Josh On Fri, Jun 10, 2011 at 12:56 PM, Josh Hursey wrote: > Reminder that this RFC goes in later today. > > On Wed, Jun 8, 2011 at 10:32 AM, Jeff Squyres wrote: >> This one's a no-brainer, folks. :-) >> >> Josh [re]discovered that we didn't initially support Fortran interfaces for >> the extensions when he was trying to make a complete implementation for an >> MPI-3 Forum proposal. >> >> +1 >> >> >> On Jun 8, 2011, at 10:11 AM, Josh Hursey wrote: >> >>> WHAT: Fortran 77 and 90 support for the Open MPI Extensions >>> >>> WHY: Trunk only supports C. >>> >>> WHERE: build system updates, ompi/mpiext >>> >>> WHEN: Open MPI trunk >>> >>> TIMEOUT: Friday, June 10, 2011 COB >>> >>> Details: >>> --- >>> A bitbucket branch is available here (last sync to r24757 of trunk) >>> https://bitbucket.org/jjhursey/ompi-ext-fortran >>> >>> The current Open MPI trunk supports only C interfaces to Open MPI >>> interface extensions. This branch adds support for f77 and f90. >>> Supporting these three language interfaces enables Fortran >>> applications to take advantage of available interface extensions. >>> Configure detects if the extension supports C, f77, and/or f90 and >>> takes the appropriate action. The C interfaces are required, and the >>> f77/f90 interfaces are optional. This fix does not require changes to >>> any existing extensions. >>> >>> -- >>> Joshua Hursey >>> Postdoctoral Research Associate >>> Oak Ridge National Laboratory >>> http://users.nccs.gov/~jjhursey >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > > > -- > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI devel] RFC: Resilient ORTE
Yeah I do not want the default fatal callback in OMPI. I want to replace it with something that allows OMPI to continue running when there are process failures (if the error handlers associated with the communicators permit such an action). So having the default fatal callback called after mine would not be useful, since I do not want the fatal action. As long as I can replace that callback, or selectively get rid of it then I'm ok. On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain wrote: > > On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: > >> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: >>> >>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: > Well, you're way to trusty. ;) It's the midwestern boy in me :) >>> >>> Still need to shake that corn out of your head... :-) >>> > > This only works if all component play the game, and even then there it is > difficult if you want to allow components to deregister themselves in the > middle of the execution. The problem is that a callback will be previous > for some component, and that when you want to remove a callback you have > to inform the "next" component on the callback chain to change its > previous. This is a fair point. I think hiding the ordering of callbacks in the errmgr could be dangerous since it takes control from the upper layers, but, conversely, trusting the upper layers to 'do the right thing' with the previous callback is probably too optimistic, esp. for layers that are not designed together. To that I would suggest that you leave the code as is - registering a callback overwrites the existing callback. That will allow me to replace the default OMPI callback when I am able to in MPI_Init, and, if I need to, swap back in the default version at MPI_Finalize. Does that sound like a reasonable way forward on this design point? >>> >>> It doesn't solve the problem that George alluded to - just because you >>> overwrite the callback, it doesn't mean that someone else won't overwrite >>> you when their component initializes. Only the last one wins - the rest of >>> you lose. >>> >>> I'm not sure how you guarantee that you win, which is why I'm unclear how >>> this callback can really work unless everyone agrees that only one place >>> gets it. Put that callback in a base function of a new error handling >>> framework, and then let everyone create components within that for handling >>> desired error responses? >> >> Yep, that is a problem, but one that we can deal with in the immediate >> case. Since OMPI is the only layer registering the callback, when I >> replace it in OMPI I will have to make sure that no other place in >> OMPI replaces the callback. >> >> If at some point we need more than one callback above ORTE then we may >> want to revisit this point. But since we only have one layer on top of >> ORTE, it is the responsibility of that layer to be internally >> consistent with regard to which callback it wants to be triggered. >> >> If the layers above ORTE want more than one callback I would suggest >> that that layer design some mechanism for coordinating these multiple >> - possibly conflicting - callbacks (by the way this is policy >> management, which can get complex fast as you add more interested >> parties). Meaning that if OMPI wanted multiple callbacks to be active >> at the same time, then OMPI would create a mechanism for managing >> these callbacks, not ORTE. ORTE should just have one callback provided >> to the upper layer, and keep it -simple-. If the upper layer wants to >> toy around with something more complex it must manage the complexity >> instead of artificially pushing it down to the ORTE layer. > > I was thinking some more about this, and wonder if we aren't > over-complicating the question. > > Do you need to actually control the sequence of callbacks, or just ensure > that your callback gets called prior to the default one that calls abort? > > Meeting the latter requirement is trivial - subsequent calls to > register_callback get pushed onto the top of the callback list. Since the > default one always gets registered first (which we can ensure since it occurs > in MPI_Init), it will always be at the bottom of the callback list and hence > called last. > > Keeping that list in ORTE is simple and probably the right place to do it. > > However, if you truly want to control the callback order in detail - then > yeah, that should go up in OMPI. I sure don't want to write all that code :-) > > >> >> -- Josh >> -- Josh > > george. > > On Jun 9, 2011, at 13:21 , Josh Hursey wrote: > >> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: >> - >> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); >> -
Re: [OMPI devel] RFC: Resilient ORTE
So why not have the callback return an int, and your callback returns "go no further"? On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote: > Yeah I do not want the default fatal callback in OMPI. I want to > replace it with something that allows OMPI to continue running when > there are process failures (if the error handlers associated with the > communicators permit such an action). So having the default fatal > callback called after mine would not be useful, since I do not want > the fatal action. > > As long as I can replace that callback, or selectively get rid of it > then I'm ok. > > > On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain wrote: >> >> On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: >> >>> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: > > On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: > >> Well, you're way to trusty. ;) > > It's the midwestern boy in me :) Still need to shake that corn out of your head... :-) > >> >> This only works if all component play the game, and even then there it >> is difficult if you want to allow components to deregister themselves in >> the middle of the execution. The problem is that a callback will be >> previous for some component, and that when you want to remove a callback >> you have to inform the "next" component on the callback chain to change >> its previous. > > This is a fair point. I think hiding the ordering of callbacks in the > errmgr could be dangerous since it takes control from the upper layers, > but, conversely, trusting the upper layers to 'do the right thing' with > the previous callback is probably too optimistic, esp. for layers that > are not designed together. > > To that I would suggest that you leave the code as is - registering a > callback overwrites the existing callback. That will allow me to replace > the default OMPI callback when I am able to in MPI_Init, and, if I need > to, swap back in the default version at MPI_Finalize. > > Does that sound like a reasonable way forward on this design point? It doesn't solve the problem that George alluded to - just because you overwrite the callback, it doesn't mean that someone else won't overwrite you when their component initializes. Only the last one wins - the rest of you lose. I'm not sure how you guarantee that you win, which is why I'm unclear how this callback can really work unless everyone agrees that only one place gets it. Put that callback in a base function of a new error handling framework, and then let everyone create components within that for handling desired error responses? >>> >>> Yep, that is a problem, but one that we can deal with in the immediate >>> case. Since OMPI is the only layer registering the callback, when I >>> replace it in OMPI I will have to make sure that no other place in >>> OMPI replaces the callback. >>> >>> If at some point we need more than one callback above ORTE then we may >>> want to revisit this point. But since we only have one layer on top of >>> ORTE, it is the responsibility of that layer to be internally >>> consistent with regard to which callback it wants to be triggered. >>> >>> If the layers above ORTE want more than one callback I would suggest >>> that that layer design some mechanism for coordinating these multiple >>> - possibly conflicting - callbacks (by the way this is policy >>> management, which can get complex fast as you add more interested >>> parties). Meaning that if OMPI wanted multiple callbacks to be active >>> at the same time, then OMPI would create a mechanism for managing >>> these callbacks, not ORTE. ORTE should just have one callback provided >>> to the upper layer, and keep it -simple-. If the upper layer wants to >>> toy around with something more complex it must manage the complexity >>> instead of artificially pushing it down to the ORTE layer. >> >> I was thinking some more about this, and wonder if we aren't >> over-complicating the question. >> >> Do you need to actually control the sequence of callbacks, or just ensure >> that your callback gets called prior to the default one that calls abort? >> >> Meeting the latter requirement is trivial - subsequent calls to >> register_callback get pushed onto the top of the callback list. Since the >> default one always gets registered first (which we can ensure since it >> occurs in MPI_Init), it will always be at the bottom of the callback list >> and hence called last. >> >> Keeping that list in ORTE is simple and probably the right place to do it. >> >> However, if you truly want to control the callback order in detail - then >> yeah, that should go up in OMPI. I sure don't want to write all that code >> :-) >> >> >>> >>> -- Josh >>> > >
Re: [OMPI devel] RFC: Resilient ORTE
We could, but we could also just replace the callback. I will never what to use it in my scenario, and if I did then I could just call it directly instead of relying on the errmgr to do the right thing. So why complicate the errmgr with additional complexity for something that we don't need at the moment? On Fri, Jun 10, 2011 at 4:40 PM, Ralph Castain wrote: > So why not have the callback return an int, and your callback returns "go no > further"? > > > On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote: > >> Yeah I do not want the default fatal callback in OMPI. I want to >> replace it with something that allows OMPI to continue running when >> there are process failures (if the error handlers associated with the >> communicators permit such an action). So having the default fatal >> callback called after mine would not be useful, since I do not want >> the fatal action. >> >> As long as I can replace that callback, or selectively get rid of it >> then I'm ok. >> >> >> On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain wrote: >>> >>> On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: >>> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: > > On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: > >> >> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >> >>> Well, you're way to trusty. ;) >> >> It's the midwestern boy in me :) > > Still need to shake that corn out of your head... :-) > >> >>> >>> This only works if all component play the game, and even then there it >>> is difficult if you want to allow components to deregister themselves >>> in the middle of the execution. The problem is that a callback will be >>> previous for some component, and that when you want to remove a >>> callback you have to inform the "next" component on the callback chain >>> to change its previous. >> >> This is a fair point. I think hiding the ordering of callbacks in the >> errmgr could be dangerous since it takes control from the upper layers, >> but, conversely, trusting the upper layers to 'do the right thing' with >> the previous callback is probably too optimistic, esp. for layers that >> are not designed together. >> >> To that I would suggest that you leave the code as is - registering a >> callback overwrites the existing callback. That will allow me to replace >> the default OMPI callback when I am able to in MPI_Init, and, if I need >> to, swap back in the default version at MPI_Finalize. >> >> Does that sound like a reasonable way forward on this design point? > > It doesn't solve the problem that George alluded to - just because you > overwrite the callback, it doesn't mean that someone else won't overwrite > you when their component initializes. Only the last one wins - the rest > of you lose. > > I'm not sure how you guarantee that you win, which is why I'm unclear how > this callback can really work unless everyone agrees that only one place > gets it. Put that callback in a base function of a new error handling > framework, and then let everyone create components within that for > handling desired error responses? Yep, that is a problem, but one that we can deal with in the immediate case. Since OMPI is the only layer registering the callback, when I replace it in OMPI I will have to make sure that no other place in OMPI replaces the callback. If at some point we need more than one callback above ORTE then we may want to revisit this point. But since we only have one layer on top of ORTE, it is the responsibility of that layer to be internally consistent with regard to which callback it wants to be triggered. If the layers above ORTE want more than one callback I would suggest that that layer design some mechanism for coordinating these multiple - possibly conflicting - callbacks (by the way this is policy management, which can get complex fast as you add more interested parties). Meaning that if OMPI wanted multiple callbacks to be active at the same time, then OMPI would create a mechanism for managing these callbacks, not ORTE. ORTE should just have one callback provided to the upper layer, and keep it -simple-. If the upper layer wants to toy around with something more complex it must manage the complexity instead of artificially pushing it down to the ORTE layer. >>> >>> I was thinking some more about this, and wonder if we aren't >>> over-complicating the question. >>> >>> Do you need to actually control the sequence of callbacks, or just ensure >>> that your callback gets called prior to the default one that calls abort? >>> >>> Meeting the latter requirement is trivial - subsequent calls to >>> register_callback get pushed onto the top of the callback list. Since the >>> default one always gets registe
Re: [OMPI devel] RFC: Resilient ORTE
No issue - just trying to get ahead of the game instead of running into an issue later. We can leave it for now. On Jun 10, 2011, at 2:47 PM, Josh Hursey wrote: > We could, but we could also just replace the callback. I will never > what to use it in my scenario, and if I did then I could just call it > directly instead of relying on the errmgr to do the right thing. So > why complicate the errmgr with additional complexity for something > that we don't need at the moment? > > On Fri, Jun 10, 2011 at 4:40 PM, Ralph Castain wrote: >> So why not have the callback return an int, and your callback returns "go no >> further"? >> >> >> On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote: >> >>> Yeah I do not want the default fatal callback in OMPI. I want to >>> replace it with something that allows OMPI to continue running when >>> there are process failures (if the error handlers associated with the >>> communicators permit such an action). So having the default fatal >>> callback called after mine would not be useful, since I do not want >>> the fatal action. >>> >>> As long as I can replace that callback, or selectively get rid of it >>> then I'm ok. >>> >>> >>> On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain wrote: On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: >> >> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >> >>> >>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >>> Well, you're way to trusty. ;) >>> >>> It's the midwestern boy in me :) >> >> Still need to shake that corn out of your head... :-) >> >>> This only works if all component play the game, and even then there it is difficult if you want to allow components to deregister themselves in the middle of the execution. The problem is that a callback will be previous for some component, and that when you want to remove a callback you have to inform the "next" component on the callback chain to change its previous. >>> >>> This is a fair point. I think hiding the ordering of callbacks in the >>> errmgr could be dangerous since it takes control from the upper layers, >>> but, conversely, trusting the upper layers to 'do the right thing' with >>> the previous callback is probably too optimistic, esp. for layers that >>> are not designed together. >>> >>> To that I would suggest that you leave the code as is - registering a >>> callback overwrites the existing callback. That will allow me to >>> replace the default OMPI callback when I am able to in MPI_Init, and, >>> if I need to, swap back in the default version at MPI_Finalize. >>> >>> Does that sound like a reasonable way forward on this design point? >> >> It doesn't solve the problem that George alluded to - just because you >> overwrite the callback, it doesn't mean that someone else won't >> overwrite you when their component initializes. Only the last one wins - >> the rest of you lose. >> >> I'm not sure how you guarantee that you win, which is why I'm unclear >> how this callback can really work unless everyone agrees that only one >> place gets it. Put that callback in a base function of a new error >> handling framework, and then let everyone create components within that >> for handling desired error responses? > > Yep, that is a problem, but one that we can deal with in the immediate > case. Since OMPI is the only layer registering the callback, when I > replace it in OMPI I will have to make sure that no other place in > OMPI replaces the callback. > > If at some point we need more than one callback above ORTE then we may > want to revisit this point. But since we only have one layer on top of > ORTE, it is the responsibility of that layer to be internally > consistent with regard to which callback it wants to be triggered. > > If the layers above ORTE want more than one callback I would suggest > that that layer design some mechanism for coordinating these multiple > - possibly conflicting - callbacks (by the way this is policy > management, which can get complex fast as you add more interested > parties). Meaning that if OMPI wanted multiple callbacks to be active > at the same time, then OMPI would create a mechanism for managing > these callbacks, not ORTE. ORTE should just have one callback provided > to the upper layer, and keep it -simple-. If the upper layer wants to > toy around with something more complex it must manage the complexity > instead of artificially pushing it down to the ORTE layer. I was thinking some more about this, and wonder if we aren't over-complicating the question. Do you need to actually control the sequen