Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:

> 
> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
> 
>> Well, you're way to trusty. ;)
> 
> It's the midwestern boy in me :)

Still need to shake that corn out of your head... :-)

> 
>> 
>> This only works if all component play the game, and even then there it is 
>> difficult if you want to allow components to deregister themselves in the 
>> middle of the execution. The problem is that a callback will be previous for 
>> some component, and that when you want to remove a callback you have to 
>> inform the "next"  component on the callback chain to change its previous.
> 
> This is a fair point. I think hiding the ordering of callbacks in the errmgr 
> could be dangerous since it takes control from the upper layers, but, 
> conversely, trusting the upper layers to 'do the right thing' with the 
> previous callback is probably too optimistic, esp. for layers that are not 
> designed together.
> 
> To that I would suggest that you leave the code as is - registering a 
> callback overwrites the existing callback. That will allow me to replace the 
> default OMPI callback when I am able to in MPI_Init, and, if I need to, swap 
> back in the default version at MPI_Finalize.
> 
> Does that sound like a reasonable way forward on this design point?

It doesn't solve the problem that George alluded to - just because you 
overwrite the callback, it doesn't mean that someone else won't overwrite you 
when their component initializes. Only the last one wins - the rest of you lose.

I'm not sure how you guarantee that you win, which is why I'm unclear how this 
callback can really work unless everyone agrees that only one place gets it. 
Put that callback in a base function of a new error handling framework, and 
then let everyone create components within that for handling desired error 
responses?


> 
> -- Josh
> 
>> 
>> george.
>> 
>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>> 
>>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
>>> -
>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
>>> -
>>> 
>>> Which is a callback that just calls abort (which is what we want to do
>>> by default):
>>> -
>>> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>>>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
>>> }
>>> -
>>> 
>>> This is what I want to replace. I do -not- want ompi to abort just
>>> because a process failed. So I need a way to replace or remove this
>>> callback, and put in my own callback that 'does the right thing'.
>>> 
>>> The current patch allows me to overwrite the callback when I call:
>>> -
>>> orte_errmgr.set_fault_callback(&my_callback);
>>> -
>>> Which is fine with me.
>>> 
>>> At the point I do not want my_callback to be active any more (say in
>>> MPI_Finalize) I would like to replace it with the old callback. To do
>>> so, with the patch's interface, I would have to know what the previous
>>> callback was and do:
>>> -
>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
>>> -
>>> 
>>> This comes at a slight maintenance burden since now there will be two
>>> places in the code that must explicitly reference
>>> 'ompi_errhandler_runtime_callback' - if it ever changed then both
>>> sites would have to be updated.
>>> 
>>> 
>>> If you use the 'sigaction-like' interface then upon registration I
>>> would get the previous handler back (which would point to
>>> 'ompi_errhandler_runtime_callback), and I can store it for later:
>>> -
>>> orte_errmgr.set_fault_callback(&my_callback, prev_callback);
>>> -
>>> 
>>> And when it comes time to deregister my callback all I need to do is
>>> replace it with the previous callback - which I have a reference to,
>>> but do not need the explicit name of (passing NULL as the second
>>> argument tells the registration function that I don't care about the
>>> current callback):
>>> -
>>> orte_errmgr.set_fault_callback(&prev_callback, NULL);
>>> -
>>> 
>>> 
>>> So the API in the patch is fine, and I can work with it. I just
>>> suggested that it might be slightly better to return the previous
>>> callback (as is done in other standard interfaces - e.g., sigaction)
>>> in case we wanted to do something with it later.
>>> 
>>> 
>>> What seems to be proposed now is making the errmgr keep a list of all
>>> registered callbacks and call them in some order. This seems odd, and
>>> definitely more complex. Maybe it was just not well explained.
>>> 
>>> Maybe that is just the "computer scientist" in me :)
>>> 
>>> -- Josh
>>> 
>>> 
>>> On Thu, Jun 9, 2011 at 1:05 PM, Ralph Castain  wrote:
 You mean you want the abort API to point somewhere else, without using a 
 new
 component?
 Perhaps a telecon would help resolve this quicker? I'm available tomorrow 
 or
 anytime next week, if th

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
Something else you might want to address in here: the current code sends an RML 
message from the proc calling abort to its local daemon telling the daemon that 
we are exiting due to the app calling "abort". We needed to do this because we 
wanted to flag the proc termination as one induced by the app itself as opposed 
to something like a segfault or termination by signal.

However, the problem is that the app may be calling abort from within an event 
handler. Hence, the RML send (which is currently blocking) will never complete 
once we no longer allow event lib recursion (coming soon). If we use a 
non-blocking send, then we can't know for sure that the message has been sent 
before we terminate.

What we need is a non-messaging way of communicating that this was an ordered 
abort as opposed to a segfault or other failure. Prior to the current method, 
we had the app drop a file that the daemon looked for as an "abort  marker", 
but that was ugly as it sometimes caused us to not properly cleanup the session 
directory tree.

I'm open to suggestion - perhaps it isn't actually all that critical for us to 
distinguish "aborted by call to abort" from "aborted by signal", and we can 
just have the app commit suicide via self-imposed SIGKILL? It is only the 
message output  to the user at the end of the job that differs - and since 
MPI_Abort already provides a message indicating "we called abort", is it really 
necessary that we have orte aware of that distinction?


On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:

> 
> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
> 
>> Well, you're way to trusty. ;)
> 
> It's the midwestern boy in me :)
> 
>> 
>> This only works if all component play the game, and even then there it is 
>> difficult if you want to allow components to deregister themselves in the 
>> middle of the execution. The problem is that a callback will be previous for 
>> some component, and that when you want to remove a callback you have to 
>> inform the "next"  component on the callback chain to change its previous.
> 
> This is a fair point. I think hiding the ordering of callbacks in the errmgr 
> could be dangerous since it takes control from the upper layers, but, 
> conversely, trusting the upper layers to 'do the right thing' with the 
> previous callback is probably too optimistic, esp. for layers that are not 
> designed together.
> 
> To that I would suggest that you leave the code as is - registering a 
> callback overwrites the existing callback. That will allow me to replace the 
> default OMPI callback when I am able to in MPI_Init, and, if I need to, swap 
> back in the default version at MPI_Finalize.
> 
> Does that sound like a reasonable way forward on this design point?
> 
> -- Josh
> 
>> 
>> george.
>> 
>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>> 
>>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
>>> -
>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
>>> -
>>> 
>>> Which is a callback that just calls abort (which is what we want to do
>>> by default):
>>> -
>>> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>>>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
>>> }
>>> -
>>> 
>>> This is what I want to replace. I do -not- want ompi to abort just
>>> because a process failed. So I need a way to replace or remove this
>>> callback, and put in my own callback that 'does the right thing'.
>>> 
>>> The current patch allows me to overwrite the callback when I call:
>>> -
>>> orte_errmgr.set_fault_callback(&my_callback);
>>> -
>>> Which is fine with me.
>>> 
>>> At the point I do not want my_callback to be active any more (say in
>>> MPI_Finalize) I would like to replace it with the old callback. To do
>>> so, with the patch's interface, I would have to know what the previous
>>> callback was and do:
>>> -
>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
>>> -
>>> 
>>> This comes at a slight maintenance burden since now there will be two
>>> places in the code that must explicitly reference
>>> 'ompi_errhandler_runtime_callback' - if it ever changed then both
>>> sites would have to be updated.
>>> 
>>> 
>>> If you use the 'sigaction-like' interface then upon registration I
>>> would get the previous handler back (which would point to
>>> 'ompi_errhandler_runtime_callback), and I can store it for later:
>>> -
>>> orte_errmgr.set_fault_callback(&my_callback, prev_callback);
>>> -
>>> 
>>> And when it comes time to deregister my callback all I need to do is
>>> replace it with the previous callback - which I have a reference to,
>>> but do not need the explicit name of (passing NULL as the second
>>> argument tells the registration function that I don't care about the
>>> current callback):
>>> -
>>> orte_errmgr.set_fault_callback(&prev_callback, NULL);
>>> 

Re: [OMPI devel] VT support for 1.5

2011-06-10 Thread Matthias Jurenz
It's a Libtool issue (once again) which occurs if a previous build is re-
configured without subsequent "make clean" and the LIBC developer library 
"libutil" is added to LIBS.

The error is simple to reproduce by the following steps:

1. configure
2. make -C ompi/contrib/vt/vt/util
3. configure
or
3. touch ompi/contrib/vt/vt/util/installdirs_conf.h # created by configure
4. make -C ompi/contrib/vt/vt/util
ar: 
/home/jurenz/devel/ompi/v1.5/BUILD_gnu/ompi/contrib/vt/vt/util/.libs/libutil.a: 
No such file or directory
make: *** [libutil.la] Error 9

When re-building the VT's libutil Libtool detects the system's libutil as 
dependency and tries to find a corresponding Libtool library (*.la). And here 
is the problem: Libtool finds ompi/contrib/vt/vt/util/libutil.la which is still 
present from the previous build and has nothing to do with the system's 
libutil. Afterwards, Libtool fails on extracting the archive 
ompi/contrib/vt/vt/util/.libs/libutil.a which isn't present for any reason.


There are different ways to fix the problem:

1. Apply the attached patch on ltmain.sh.

This patch excludes the target library name from searching *.la libraries.

2. Rename the VT's libutil

This would prevents name conflicts with dependency libraries.

3. Clear list of dependency libraries when building VT's libutil.

This could be done by adding LIBS= to the Makefile.am in 
ompi/contrib/vt/vt/util/. The VT's libutil has no dependencies to other 
libraries except libc.

4. Perform "make clean" or remove ompi/contrib/vt/vt/util/libutil.la after re-
configure.

Nonsense - it cannot be required from the user.


My favorite suggestion is 1. It would be just another patch in addition to the 
set of Libtool patches invoked by autogen.

What do you think?


Matthias

On Tuesday 07 June 2011 16:56:39 Jeff Squyres wrote:
> You might want to try a new checkout, just in case there's something in
> there that is svn:ignored...?
> 
> (yes, I'm grasping at straws here, but I'm able to build ok with a clean
> checkout...?)
> 
> On Jun 7, 2011, at 10:38 AM, George Bosilca wrote:
> > My 'svn status' indicates no differences. I always build using a VPATH,
> > and in this case I did remove the build directory. However, the issue
> > persisted.
> > 
> >  george.
> > 
> > On Jun 7, 2011, at 10:31 , Jeff Squyres wrote:
> >> I've seen VT builds get confused sometimes.  I'm not sure of the exact
> >> cause, but if I get a new checkout, all the problems seem to go away. 
> >> I've never had the time to track it down.
> >> 
> >> Can you get a clean / new checkout and see if that fixes the problem?
> >> 
> >> On Jun 7, 2011, at 10:27 AM, George Bosilca wrote:
> >>> I can't compile the 1.5 is I do not disable VT. Using the following
> >>> configure line:
> >>> 
> >>> ../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug
> >>> --enable-mpirun-prefix-by-default --with-knem=/usr/local/knem
> >>> --with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug
> >>> 
> >>> I get:
> >>> 
> >>> ar:
> >>> /home/bosilca/unstable/1.5/debug/ompi/contrib/vt/vt/util/.libs/libutil
> >>> .a: No such file or directory
> >>> 
> >>> Any ideas?
> >>> 
> >>> george.
> >>> 
> >>> 
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
--- config/ltmain.sh.orig	2011-06-09 12:50:08.911201988 +0200
+++ config/ltmain.sh	2011-06-09 12:51:20.530015482 +0200
@@ -5099,7 +5099,7 @@
 	  # Search the libtool library
 	  lib="$searchdir/lib${name}${search_ext}"
 	  if test -f "$lib"; then
-		if test "$search_ext" = ".la"; then
+		if test "$search_ext" = ".la" -a "$lib" != "`pwd`/$outputname"; then
 		  found=yes
 		else
 		  found=no


smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

2011-06-10 Thread Ralph Castain
I have no issue with uncommenting the code. However, I do see a future littered 
with lots of zombied processes and complaints over poor cleanup again


On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote:

> Ah I see what you are getting at now.
> 
> The construction of the list of connected processes is something I, 
> intentionally, did not modify from the current Open MPI code. The list is 
> calculated based on the locally known set of local and remote process groups 
> attached to the communicator. So this is the set of directly connected 
> processes in the specified communicator known to the calling process at the 
> OMPI level.
> 
> ORTE is asked to abort this defined set of processes. Once those processes 
> are terminated then ORTE needs to eventually inform all of the processes (in 
> the jobid(s) specified - maybe other jobids too?) that these processes have 
> failed/aborted. Upon notification of the failed/aborted processes the local 
> process (at the OMPI level) needs to determine if that process loss is 
> critical based upon the error handlers attached to communicators that it 
> shares with the failed/aborted processes.  That should be handled in the 
> callback from the errmgr at the OMPI level, since connectedness is an MPI 
> construct. If the process failure/abort is critical to the local process, 
> then upon notification the local process can call abort on the communicator 
> effected.
> 
> So this has the possibility for a rolling abort effect [the abort of one 
> communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From 
> which (depending upon the error handlers at the user level) the system will 
> eventually converge to either some stable subset of process or all processes 
> aborting resulting in job termination.
> 
> The rolling abort effect relies heavily upon the ability of the runtime to 
> make sure that all process failures/abort are eventually known to all alive 
> processes. Since all alive processes will know of the failure/abort, it can 
> then determine if they are transitively effected by the failure based upon 
> the local list of communicators and associated error handlers. But to 
> complete this aspect of the abort procedure, we do need the callback 
> mechanism from the runtime - but since ORTE (today) will kill the job for 
> OMPI then it is not a big deal for end users since the job will terminate 
> anyway. Once we have the callback, then we can finish tightening up the OMPI 
> layer code.
> 
> It is not perfect, but I think it does address the transitive nature of the 
> connectivity of MPI processes by relying on the runtime to provide uniform 
> notification of failures. I figure that we will need to look over this code 
> again and verify that the implementation of MPI_Comm_disconnect and 
> associated underpinnings do the 'right thing' with regard to updating the 
> communicator structures. But I think that is best addressed as a second set 
> of patches.
> 
> 
> The goal of this patch is to put back in functionality that was commented out 
> during the last reorganization of the errmgr. What will likely follow, once 
> we have notification of failure/abort at the OMPI level, is a cleanup of the 
> connected groups code paths.
> 
> 
> -- Josh
> 
> 
> On Jun 9, 2011, at 6:13 PM, George Bosilca wrote:
> 
>> What I'm saying is that there is no reason to have any other type of 
>> MPI_Abort if we are not able to compute the set of connected processes. 
>> 
>> With this RFC the processes on the communicator on MPI_Abort will abort. 
>> Then the other processes in the same MPI_COMM_WORLD (in fact jobid) will be 
>> notified (if we suppose that the ORTE will not make a difference between 
>> aborted and faulty). As a result the entire MPI_COMM_WORLD will be aborted, 
>> if we consider a sane application where everyone use the same type of error 
>> handler. However, this is not enough. We have to distribute the abort signal 
>> to every other process "connected", and I don't see how we can compute this 
>> list of connected processes in Open MPI today.It is not that I don't see it 
>> in your patch, it is that the definition of the connectivity in the MPI 
>> standard is transitive and relies heavily on a correct implementation for 
>> the MPI_Comm_disconnect.
>> 
>> george.
>> 
>> On Jun 9, 2011, at 16:59 , Josh Hursey wrote:
>> 
>>> On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca  wrote:
 If this change the behavior of MPI_Abort to only abort processes on the 
 specified communicator how this doesn't affects the default user 
 experience (when today it aborts everything)?
>>> 
>>> Open MPI does abort everything by default - decided by the runtime at
>>> the moment (but addressed in your RFC). So it does not matter if one
>>> process aborts or if many do. So the behavior of MPI_Abort experienced
>>> by the user will not change. Effectively the only change is an extra
>>> message in the runtime before the process actually calls
>

Re: [OMPI devel] VT support for 1.5

2011-06-10 Thread Matthias Jurenz
+ attachment

On Friday 10 June 2011 12:00:49 you wrote:
> It's a Libtool issue (once again) which occurs if a previous build is re-
> configured without subsequent "make clean" and the LIBC developer library
> "libutil" is added to LIBS.
> 
> The error is simple to reproduce by the following steps:
> 
> 1. configure
> 2. make -C ompi/contrib/vt/vt/util
> 3. configure
> or
> 3. touch ompi/contrib/vt/vt/util/installdirs_conf.h # created by configure
> 4. make -C ompi/contrib/vt/vt/util
> ar:
> /home/jurenz/devel/ompi/v1.5/BUILD_gnu/ompi/contrib/vt/vt/util/.libs/libuti
> l.a: No such file or directory
> make: *** [libutil.la] Error 9
> 
> When re-building the VT's libutil Libtool detects the system's libutil as
> dependency and tries to find a corresponding Libtool library (*.la). And
> here is the problem: Libtool finds ompi/contrib/vt/vt/util/libutil.la
> which is still present from the previous build and has nothing to do with
> the system's libutil. Afterwards, Libtool fails on extracting the archive
> ompi/contrib/vt/vt/util/.libs/libutil.a which isn't present for any reason.
> 
> 
> There are different ways to fix the problem:
> 
> 1. Apply the attached patch on ltmain.sh.
> 
> This patch excludes the target library name from searching *.la libraries.
> 
> 2. Rename the VT's libutil
> 
> This would prevents name conflicts with dependency libraries.
> 
> 3. Clear list of dependency libraries when building VT's libutil.
> 
> This could be done by adding LIBS= to the Makefile.am in
> ompi/contrib/vt/vt/util/. The VT's libutil has no dependencies to other
> libraries except libc.
> 
> 4. Perform "make clean" or remove ompi/contrib/vt/vt/util/libutil.la after
> re- configure.
> 
> Nonsense - it cannot be required from the user.
> 
> 
> My favorite suggestion is 1. It would be just another patch in addition to
> the set of Libtool patches invoked by autogen.
> 
> What do you think?
> 
> 
> Matthias
> 
> On Tuesday 07 June 2011 16:56:39 Jeff Squyres wrote:
> > You might want to try a new checkout, just in case there's something in
> > there that is svn:ignored...?
> > 
> > (yes, I'm grasping at straws here, but I'm able to build ok with a clean
> > checkout...?)
> > 
> > On Jun 7, 2011, at 10:38 AM, George Bosilca wrote:
> > > My 'svn status' indicates no differences. I always build using a VPATH,
> > > and in this case I did remove the build directory. However, the issue
> > > persisted.
> > > 
> > >  george.
> > > 
> > > On Jun 7, 2011, at 10:31 , Jeff Squyres wrote:
> > >> I've seen VT builds get confused sometimes.  I'm not sure of the exact
> > >> cause, but if I get a new checkout, all the problems seem to go away.
> > >> I've never had the time to track it down.
> > >> 
> > >> Can you get a clean / new checkout and see if that fixes the problem?
> > >> 
> > >> On Jun 7, 2011, at 10:27 AM, George Bosilca wrote:
> > >>> I can't compile the 1.5 is I do not disable VT. Using the following
> > >>> configure line:
> > >>> 
> > >>> ../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug
> > >>> --enable-mpirun-prefix-by-default --with-knem=/usr/local/knem
> > >>> --with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug
> > >>> 
> > >>> I get:
> > >>> 
> > >>> ar:
> > >>> /home/bosilca/unstable/1.5/debug/ompi/contrib/vt/vt/util/.libs/libuti
> > >>> l .a: No such file or directory
> > >>> 
> > >>> Any ideas?
> > >>> 
> > >>> george.
> > >>> 
> > >>> 
> > >>> ___
> > >>> devel mailing list
> > >>> de...@open-mpi.org
> > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > 
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
--- config/ltmain.sh.orig	2011-06-09 12:50:08.911201988 +0200
+++ config/ltmain.sh	2011-06-09 12:51:20.530015482 +0200
@@ -5099,7 +5099,7 @@
 	  # Search the libtool library
 	  lib="$searchdir/lib${name}${search_ext}"
 	  if test -f "$lib"; then
-		if test "$search_ext" = ".la"; then
+		if test "$search_ext" = ".la" -a "$lib" != "`pwd`/$outputname"; then
 		  found=yes
 		else
 		  found=no


Re: [OMPI devel] VT support for 1.5

2011-06-10 Thread Matthias Jurenz
It's a Libtool issue (once again) which occurs if a previous build is re-
configured without subsequent "make clean" and the LIBC developer library 
"libutil" is added to LIBS.

The error is simple to reproduce by the following steps:

1. configure
2. make -C ompi/contrib/vt/vt/util
3. configure
or
3. touch ompi/contrib/vt/vt/util/installdirs_conf.h # created by configure
4. make -C ompi/contrib/vt/vt/util
ar: 
/home/jurenz/devel/ompi/v1.5/BUILD_gnu/ompi/contrib/vt/vt/util/.libs/libutil.a: 
No such file or directory
make: *** [libutil.la] Error 9

When re-building the VT's libutil Libtool detects the system's libutil as 
dependency and tries to find a corresponding Libtool library (*.la). And here 
is the problem: Libtool finds ompi/contrib/vt/vt/util/libutil.la which is still 
present from the previous build and has nothing to do with the system's 
libutil. Afterwards, Libtool fails on extracting the archive 
ompi/contrib/vt/vt/util/.libs/libutil.a which isn't present for any reason.


There are different ways to fix the problem:

1. Apply the attached patch on ltmain.sh.

This patch excludes the target library name from searching *.la libraries.

2. Rename the VT's libutil

This would prevents name conflicts with dependency libraries.

3. Clear list of dependency libraries when building VT's libutil.

This could be done by adding LIBS= to the Makefile.am in 
ompi/contrib/vt/vt/util/. The VT's libutil has no dependencies to other 
libraries except libc.

4. Perform "make clean" or remove ompi/contrib/vt/vt/util/libutil.la after re-
configure.

Nonsense - it cannot be required from the user.


My favorite suggestion is 1. It would be just another patch in addition to the 
set of Libtool patches invoked by autogen.

What do you think?


Matthias

On Tuesday 07 June 2011 16:56:39 Jeff Squyres wrote:
> You might want to try a new checkout, just in case there's something in
> there that is svn:ignored...?
> 
> (yes, I'm grasping at straws here, but I'm able to build ok with a clean
> checkout...?)
> 
> On Jun 7, 2011, at 10:38 AM, George Bosilca wrote:
> > My 'svn status' indicates no differences. I always build using a VPATH,
> > and in this case I did remove the build directory. However, the issue
> > persisted.
> > 
> >  george.
> > 
> > On Jun 7, 2011, at 10:31 , Jeff Squyres wrote:
> >> I've seen VT builds get confused sometimes.  I'm not sure of the exact
> >> cause, but if I get a new checkout, all the problems seem to go away. 
> >> I've never had the time to track it down.
> >> 
> >> Can you get a clean / new checkout and see if that fixes the problem?
> >> 
> >> On Jun 7, 2011, at 10:27 AM, George Bosilca wrote:
> >>> I can't compile the 1.5 is I do not disable VT. Using the following
> >>> configure line:
> >>> 
> >>> ../ompi/configure --prefix=/home/bosilca/opt/stable/1.5/debug
> >>> --enable-mpirun-prefix-by-default --with-knem=/usr/local/knem
> >>> --with-mx=/usr/local/mx-1.2.11 --enable-picky --enable-debug
> >>> 
> >>> I get:
> >>> 
> >>> ar:
> >>> /home/bosilca/unstable/1.5/debug/ompi/contrib/vt/vt/util/.libs/libutil
> >>> .a: No such file or directory
> >>> 
> >>> Any ideas?
> >>> 
> >>> george.
> >>> 
> >>> 
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
>
> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>
>>
>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>
>>> Well, you're way to trusty. ;)
>>
>> It's the midwestern boy in me :)
>
> Still need to shake that corn out of your head... :-)
>
>>
>>>
>>> This only works if all component play the game, and even then there it is 
>>> difficult if you want to allow components to deregister themselves in the 
>>> middle of the execution. The problem is that a callback will be previous 
>>> for some component, and that when you want to remove a callback you have to 
>>> inform the "next"  component on the callback chain to change its previous.
>>
>> This is a fair point. I think hiding the ordering of callbacks in the errmgr 
>> could be dangerous since it takes control from the upper layers, but, 
>> conversely, trusting the upper layers to 'do the right thing' with the 
>> previous callback is probably too optimistic, esp. for layers that are not 
>> designed together.
>>
>> To that I would suggest that you leave the code as is - registering a 
>> callback overwrites the existing callback. That will allow me to replace the 
>> default OMPI callback when I am able to in MPI_Init, and, if I need to, swap 
>> back in the default version at MPI_Finalize.
>>
>> Does that sound like a reasonable way forward on this design point?
>
> It doesn't solve the problem that George alluded to - just because you 
> overwrite the callback, it doesn't mean that someone else won't overwrite you 
> when their component initializes. Only the last one wins - the rest of you 
> lose.
>
> I'm not sure how you guarantee that you win, which is why I'm unclear how 
> this callback can really work unless everyone agrees that only one place gets 
> it. Put that callback in a base function of a new error handling framework, 
> and then let everyone create components within that for handling desired 
> error responses?

Yep, that is a problem, but one that we can deal with in the immediate
case. Since OMPI is the only layer registering the callback, when I
replace it in OMPI I will have to make sure that no other place in
OMPI replaces the callback.

If at some point we need more than one callback above ORTE then we may
want to revisit this point. But since we only have one layer on top of
ORTE, it is the responsibility of that layer to be internally
consistent with regard to which callback it wants to be triggered.

If the layers above ORTE want more than one callback I would suggest
that that layer design some mechanism for coordinating these multiple
- possibly conflicting - callbacks (by the way this is policy
management, which can get complex fast as you add more interested
parties). Meaning that if OMPI wanted multiple callbacks to be active
at the same time, then OMPI would create a mechanism for managing
these callbacks, not ORTE. ORTE should just have one callback provided
to the upper layer, and keep it -simple-. If the upper layer wants to
toy around with something more complex it must manage the complexity
instead of artificially pushing it down to the ORTE layer.

-- Josh

>>
>> -- Josh
>>
>>>
>>> george.
>>>
>>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>>>
 So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
 -
 orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
 -

 Which is a callback that just calls abort (which is what we want to do
 by default):
 -
 void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
 }
 -

 This is what I want to replace. I do -not- want ompi to abort just
 because a process failed. So I need a way to replace or remove this
 callback, and put in my own callback that 'does the right thing'.

 The current patch allows me to overwrite the callback when I call:
 -
 orte_errmgr.set_fault_callback(&my_callback);
 -
 Which is fine with me.

 At the point I do not want my_callback to be active any more (say in
 MPI_Finalize) I would like to replace it with the old callback. To do
 so, with the patch's interface, I would have to know what the previous
 callback was and do:
 -
 orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
 -

 This comes at a slight maintenance burden since now there will be two
 places in the code that must explicitly reference
 'ompi_errhandler_runtime_callback' - if it ever changed then both
 sites would have to be updated.


 If you use the 'sigaction-like' interface then upon registration I
 would get the previous handler back (which would point to
 'ompi_errhandler_runtime_callback), and I can store it for later:
 -
 orte_errmgr.set_fault_callba

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
Okay, finally have time to sit down and review this. It looks pretty much 
identical to what was done in ORCM - we just kept "epoch" separate from the 
process name, and use multicast to notify all procs that someone failed. I do 
have a few questions/comments about your proposed patch:

1. I note that in some places you just set peer_name.epoch = proc_name.epoch, 
and in others you make the assignment by calling a new API 
orte_ess.proc_get_epoch(&proc_name). Ditto for proc_set_epoch. What are the 
rules for when each method should be used? Which leads to...

2. I'm puzzled as to why you are storing process state and epoch number in the 
modex as well as in the process name and orte_proc_t struct. This creates a bit 
of a race condition as the two will be out-of-sync for some (probably small) 
period of time, and looks like unnecessary duplication. Is there some reason 
for doing this? We are trying to eliminate duplicate storage because of the 
data confusion and memory issues, hence my question.

3. as a follow on to #2, I am bothered that we now have the ESS storing proc 
state. That isn't the functional purpose of the ESS - that's a PLM function. Is 
there some reason for doing this in the ESS? Why aren't we just looking at the 
orte_proc_t for that proc and using its state field? I guess I can understand 
if you want to get that via an API (instead of having code to lookup the proc_t 
in multiple places), but then let's put it in the PLM please. I note that it is 
only used in the binomial routing code, so why not just put a static function 
in there to get the state of a proc rather than creating another API?

4. ess_base_open.c: the default orte_ess module appears to be missing an entry 
for proc_set_epoch.

5. I really don't think that notification of proc failure belongs in the 
orted_comm - messages notifying of proc failure should be received in the 
errmgr. This allows people who want to handle things differently (e.g., orcm) 
the ability to create their own errmgr component(s) for daemons and HNP that 
send the messages over their desired messaging system, decide how they want to 
respond, etc. Putting it in orted_comm forces everyone to use only this one 
method, which conflicts with allowing freedom for others to explore alternative 
methods, and frankly, I don't see any strong reason that outweighs that 
limitation.

6. I don't think this errmgr_fault_callback registration is going to work, per 
my response to Josh's RFC. I'll leave the discussion in that thread.


On Jun 6, 2011, at 1:00 PM, George Bosilca wrote:

> WHAT: Allow the runtime to handle fail-stop failures for both runtime 
> (daemons) or application level processes. This patch extends the 
> orte_process_name_t structure with a field to store the process epoch (the 
> number of times it died so far), and add an application failure notification 
> callback function to be registered in the runtime. 
> 
> WHY: Necessary to correctly implement the error handling in the MPI 2.2 
> standard. In addition, such a resilient runtime is a cornerstone for any 
> level of fault tolerance support we want to provide in the future (such as 
> the MPI-3 Run-Through Stabilization or FT-MPI).
> 
> WHEN:
> 
> WHERE: Patch attached to this email, based on trunk r24747.
> TIMEOUT: 2 weeks from now, on Monday 20 June.
> 
> --
> 
> MORE DETAILS:
> 
> Currently the infrastructure required to enable any kind of fault tolerance 
> development in Open MPI (with the exception of the checkpoint/restart) is 
> missing. However, before developing any fault tolerant support at the 
> application (MPI) level, we need to have a resilient runtime. The changes in 
> this patch address this lack of support and would allow anyone to implement a 
> fault tolerance protocol at the MPI layer without having to worry about the 
> ORTE stabilization.
> 
> This patch will allow the runtime to drop any dead daemons, and re-route all 
> communications around the holes in order to __ALWAYS__ deliver a message as 
> long as the destination process is alive. The application is informed (via a 
> callback) about the loss of the processes with the same jobid. In this patch 
> we do not address the MPI_ERROR_RETURN type of failures, we focused on the 
> MPI_ERROR_ABORT ones. Moreover, we empowered the application level with the 
> decision, instead of taking it down in the runtime.
> 
> NEW STUFF:
> 
> Epoch - A counter that tracks the number of times a process has been detected 
> to have terminated, either from a failure or an expected termination. After 
> the termination is detected, the HNP coordinates all other process’s 
> knowledge of the new epoch. Each ORTED will know the epoch of the other 
> processes in the job, but it will not actually store anything until the 
> epochs change. 
> 
> Run-Through Stabilization - When an ORTED (or HNP) detects that another 
> process has terminated, it repairs the routing layer and informs the HNP. The 
> HNP tells all other proc

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
Another problem with this patch, that I mentioned to Wesley and George
off list, is that it does not handle the case when mpirun/HNP is also
hosting processes that might fail. In my testing of the patch it
worked fine if mpirun/HNP was -not- hosting any processes, but once it
had to host processes then unexpected behavior occurred when a process
failed. So for those just listening to this thread, Wesley is working
on a revised patch to address this problem that he will post when it
is ready.


As far as the RML issue, doesn't the ORTE state machine branch handle
that case? If it does, then let's push the solution to that problem
until that branch comes around instead of solving it twice.

-- Josh


On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain  wrote:
> Something else you might want to address in here: the current code sends an 
> RML message from the proc calling abort to its local daemon telling the 
> daemon that we are exiting due to the app calling "abort". We needed to do 
> this because we wanted to flag the proc termination as one induced by the app 
> itself as opposed to something like a segfault or termination by signal.
>
> However, the problem is that the app may be calling abort from within an 
> event handler. Hence, the RML send (which is currently blocking) will never 
> complete once we no longer allow event lib recursion (coming soon). If we use 
> a non-blocking send, then we can't know for sure that the message has been 
> sent before we terminate.
>
> What we need is a non-messaging way of communicating that this was an ordered 
> abort as opposed to a segfault or other failure. Prior to the current method, 
> we had the app drop a file that the daemon looked for as an "abort  marker", 
> but that was ugly as it sometimes caused us to not properly cleanup the 
> session directory tree.
>
> I'm open to suggestion - perhaps it isn't actually all that critical for us 
> to distinguish "aborted by call to abort" from "aborted by signal", and we 
> can just have the app commit suicide via self-imposed SIGKILL? It is only the 
> message output  to the user at the end of the job that differs - and since 
> MPI_Abort already provides a message indicating "we called abort", is it 
> really necessary that we have orte aware of that distinction?
>
>
> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>
>>
>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>
>>> Well, you're way to trusty. ;)
>>
>> It's the midwestern boy in me :)
>>
>>>
>>> This only works if all component play the game, and even then there it is 
>>> difficult if you want to allow components to deregister themselves in the 
>>> middle of the execution. The problem is that a callback will be previous 
>>> for some component, and that when you want to remove a callback you have to 
>>> inform the "next"  component on the callback chain to change its previous.
>>
>> This is a fair point. I think hiding the ordering of callbacks in the errmgr 
>> could be dangerous since it takes control from the upper layers, but, 
>> conversely, trusting the upper layers to 'do the right thing' with the 
>> previous callback is probably too optimistic, esp. for layers that are not 
>> designed together.
>>
>> To that I would suggest that you leave the code as is - registering a 
>> callback overwrites the existing callback. That will allow me to replace the 
>> default OMPI callback when I am able to in MPI_Init, and, if I need to, swap 
>> back in the default version at MPI_Finalize.
>>
>> Does that sound like a reasonable way forward on this design point?
>>
>> -- Josh
>>
>>>
>>> george.
>>>
>>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>>>
 So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
 -
 orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
 -

 Which is a callback that just calls abort (which is what we want to do
 by default):
 -
 void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
 }
 -

 This is what I want to replace. I do -not- want ompi to abort just
 because a process failed. So I need a way to replace or remove this
 callback, and put in my own callback that 'does the right thing'.

 The current patch allows me to overwrite the callback when I call:
 -
 orte_errmgr.set_fault_callback(&my_callback);
 -
 Which is fine with me.

 At the point I do not want my_callback to be active any more (say in
 MPI_Finalize) I would like to replace it with the old callback. To do
 so, with the patch's interface, I would have to know what the previous
 callback was and do:
 -
 orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
 -

 This comes at a slight maintenance burden since now there will be two
 places i

Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

2011-06-10 Thread Josh Hursey
Why would this patch result in zombied processes and poor cleanup?
When ORTE receive notification of a process terminating/aborting then
it triggers the termination of the job (without UTK's RFC) which
should ensure a clean shutdown. This patch just tells ORTE that a few
other processes should be the first to die, which will trigger the
same response in ORTE.

I guess I'm unclear about this concern since it should be a concern in
the current ORTE as well then. I agree that it will be a concern once
we have the OMPI layer handling error management triggered off of a
callback, but that is a different RFC.


Something that might help those listening to this thread. The current
behavior of MPI_Abort in OMPI results in the semantics of:
--
internal_MPI_Abort(MPI_COMM_SELF, exit_code)
--
regardless of the communicator actually passed to the MPI_Abort at the
application level. It should be:
--
internal_MPI_Abort(comm_provided, exit_code)
--

Semantically, this patch just makes the group actually being aborted
match the communicator provided. In practicality, the job will
terminate when any process in the job calls abort - so the result (in
todays codebase) will be the same.

-- Josh


On Fri, Jun 10, 2011 at 7:30 AM, Ralph Castain  wrote:
> I have no issue with uncommenting the code. However, I do see a future 
> littered with lots of zombied processes and complaints over poor cleanup 
> again
>
>
> On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote:
>
>> Ah I see what you are getting at now.
>>
>> The construction of the list of connected processes is something I, 
>> intentionally, did not modify from the current Open MPI code. The list is 
>> calculated based on the locally known set of local and remote process groups 
>> attached to the communicator. So this is the set of directly connected 
>> processes in the specified communicator known to the calling process at the 
>> OMPI level.
>>
>> ORTE is asked to abort this defined set of processes. Once those processes 
>> are terminated then ORTE needs to eventually inform all of the processes (in 
>> the jobid(s) specified - maybe other jobids too?) that these processes have 
>> failed/aborted. Upon notification of the failed/aborted processes the local 
>> process (at the OMPI level) needs to determine if that process loss is 
>> critical based upon the error handlers attached to communicators that it 
>> shares with the failed/aborted processes.  That should be handled in the 
>> callback from the errmgr at the OMPI level, since connectedness is an MPI 
>> construct. If the process failure/abort is critical to the local process, 
>> then upon notification the local process can call abort on the communicator 
>> effected.
>>
>> So this has the possibility for a rolling abort effect [the abort of one 
>> communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From 
>> which (depending upon the error handlers at the user level) the system will 
>> eventually converge to either some stable subset of process or all processes 
>> aborting resulting in job termination.
>>
>> The rolling abort effect relies heavily upon the ability of the runtime to 
>> make sure that all process failures/abort are eventually known to all alive 
>> processes. Since all alive processes will know of the failure/abort, it can 
>> then determine if they are transitively effected by the failure based upon 
>> the local list of communicators and associated error handlers. But to 
>> complete this aspect of the abort procedure, we do need the callback 
>> mechanism from the runtime - but since ORTE (today) will kill the job for 
>> OMPI then it is not a big deal for end users since the job will terminate 
>> anyway. Once we have the callback, then we can finish tightening up the OMPI 
>> layer code.
>>
>> It is not perfect, but I think it does address the transitive nature of the 
>> connectivity of MPI processes by relying on the runtime to provide uniform 
>> notification of failures. I figure that we will need to look over this code 
>> again and verify that the implementation of MPI_Comm_disconnect and 
>> associated underpinnings do the 'right thing' with regard to updating the 
>> communicator structures. But I think that is best addressed as a second set 
>> of patches.
>>
>>
>> The goal of this patch is to put back in functionality that was commented 
>> out during the last reorganization of the errmgr. What will likely follow, 
>> once we have notification of failure/abort at the OMPI level, is a cleanup 
>> of the connected groups code paths.
>>
>>
>> -- Josh
>>
>>
>> On Jun 9, 2011, at 6:13 PM, George Bosilca wrote:
>>
>>> What I'm saying is that there is no reason to have any other type of 
>>> MPI_Abort if we are not able to compute the set of connected processes.
>>>
>>> With this RFC the processes on the communicator on MPI_Abort will abort. 
>>> Then the other processes in the same MPI_COMM_WORLD (in fact jobid) w

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:

> Another problem with this patch, that I mentioned to Wesley and George
> off list, is that it does not handle the case when mpirun/HNP is also
> hosting processes that might fail. In my testing of the patch it
> worked fine if mpirun/HNP was -not- hosting any processes, but once it
> had to host processes then unexpected behavior occurred when a process
> failed. So for those just listening to this thread, Wesley is working
> on a revised patch to address this problem that he will post when it
> is ready.

See my other response to the patch - I think we need to understand why we are 
storing state in multiple places as it can create unexpected behavior when 
things are out-of-sync.


> 
> 
> As far as the RML issue, doesn't the ORTE state machine branch handle
> that case? If it does, then let's push the solution to that problem
> until that branch comes around instead of solving it twice.

No, it doesn't - in fact, it's what breaks the current method. Because we no 
longer allow event recursion, the RML message never gets out of the app. Hence 
my question.

I honestly don't think we need to have orte be aware of the distinction between 
"aborted by cmd" and "aborted by signal" as the only diff is in the error 
message. There ought to be some other way of resolving this?


> 
> -- Josh
> 
> 
> On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain  wrote:
>> Something else you might want to address in here: the current code sends an 
>> RML message from the proc calling abort to its local daemon telling the 
>> daemon that we are exiting due to the app calling "abort". We needed to do 
>> this because we wanted to flag the proc termination as one induced by the 
>> app itself as opposed to something like a segfault or termination by signal.
>> 
>> However, the problem is that the app may be calling abort from within an 
>> event handler. Hence, the RML send (which is currently blocking) will never 
>> complete once we no longer allow event lib recursion (coming soon). If we 
>> use a non-blocking send, then we can't know for sure that the message has 
>> been sent before we terminate.
>> 
>> What we need is a non-messaging way of communicating that this was an 
>> ordered abort as opposed to a segfault or other failure. Prior to the 
>> current method, we had the app drop a file that the daemon looked for as an 
>> "abort  marker", but that was ugly as it sometimes caused us to not properly 
>> cleanup the session directory tree.
>> 
>> I'm open to suggestion - perhaps it isn't actually all that critical for us 
>> to distinguish "aborted by call to abort" from "aborted by signal", and we 
>> can just have the app commit suicide via self-imposed SIGKILL? It is only 
>> the message output  to the user at the end of the job that differs - and 
>> since MPI_Abort already provides a message indicating "we called abort", is 
>> it really necessary that we have orte aware of that distinction?
>> 
>> 
>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>> 
>>> 
>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>> 
 Well, you're way to trusty. ;)
>>> 
>>> It's the midwestern boy in me :)
>>> 
 
 This only works if all component play the game, and even then there it is 
 difficult if you want to allow components to deregister themselves in the 
 middle of the execution. The problem is that a callback will be previous 
 for some component, and that when you want to remove a callback you have 
 to inform the "next"  component on the callback chain to change its 
 previous.
>>> 
>>> This is a fair point. I think hiding the ordering of callbacks in the 
>>> errmgr could be dangerous since it takes control from the upper layers, 
>>> but, conversely, trusting the upper layers to 'do the right thing' with the 
>>> previous callback is probably too optimistic, esp. for layers that are not 
>>> designed together.
>>> 
>>> To that I would suggest that you leave the code as is - registering a 
>>> callback overwrites the existing callback. That will allow me to replace 
>>> the default OMPI callback when I am able to in MPI_Init, and, if I need to, 
>>> swap back in the default version at MPI_Finalize.
>>> 
>>> Does that sound like a reasonable way forward on this design point?
>>> 
>>> -- Josh
>>> 
 
 george.
 
 On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
 
> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
> -
> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
> -
> 
> Which is a callback that just calls abort (which is what we want to do
> by default):
> -
> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
> }
> -
> 
> This is what I want to replace. I do -not- want ompi to abort just
> because a process failed. So

Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 6:48 AM, Josh Hursey wrote:

> Why would this patch result in zombied processes and poor cleanup?
> When ORTE receive notification of a process terminating/aborting then
> it triggers the termination of the job (without UTK's RFC) which
> should ensure a clean shutdown. This patch just tells ORTE that a few
> other processes should be the first to die, which will trigger the
> same response in ORTE.
> 
> I guess I'm unclear about this concern since it should be a concern in
> the current ORTE as well then. I agree that it will be a concern once
> we have the OMPI layer handling error management triggered off of a
> callback, but that is a different RFC.

My comment was to "the future" - i.e., looking to the point where we get 
layered, rolling aborts.

I agree that this specific RFC won't change the current behavior, and as I 
said, I have no issue with it.


> 
> 
> Something that might help those listening to this thread. The current
> behavior of MPI_Abort in OMPI results in the semantics of:
> --
> internal_MPI_Abort(MPI_COMM_SELF, exit_code)
> --
> regardless of the communicator actually passed to the MPI_Abort at the
> application level. It should be:
> --
> internal_MPI_Abort(comm_provided, exit_code)
> --
> 
> Semantically, this patch just makes the group actually being aborted
> match the communicator provided. In practicality, the job will
> terminate when any process in the job calls abort - so the result (in
> todays codebase) will be the same.
> 
> -- Josh
> 
> 
> On Fri, Jun 10, 2011 at 7:30 AM, Ralph Castain  wrote:
>> I have no issue with uncommenting the code. However, I do see a future 
>> littered with lots of zombied processes and complaints over poor cleanup 
>> again
>> 
>> 
>> On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote:
>> 
>>> Ah I see what you are getting at now.
>>> 
>>> The construction of the list of connected processes is something I, 
>>> intentionally, did not modify from the current Open MPI code. The list is 
>>> calculated based on the locally known set of local and remote process 
>>> groups attached to the communicator. So this is the set of directly 
>>> connected processes in the specified communicator known to the calling 
>>> process at the OMPI level.
>>> 
>>> ORTE is asked to abort this defined set of processes. Once those processes 
>>> are terminated then ORTE needs to eventually inform all of the processes 
>>> (in the jobid(s) specified - maybe other jobids too?) that these processes 
>>> have failed/aborted. Upon notification of the failed/aborted processes the 
>>> local process (at the OMPI level) needs to determine if that process loss 
>>> is critical based upon the error handlers attached to communicators that it 
>>> shares with the failed/aborted processes.  That should be handled in the 
>>> callback from the errmgr at the OMPI level, since connectedness is an MPI 
>>> construct. If the process failure/abort is critical to the local process, 
>>> then upon notification the local process can call abort on the communicator 
>>> effected.
>>> 
>>> So this has the possibility for a rolling abort effect [the abort of one 
>>> communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From 
>>> which (depending upon the error handlers at the user level) the system will 
>>> eventually converge to either some stable subset of process or all 
>>> processes aborting resulting in job termination.
>>> 
>>> The rolling abort effect relies heavily upon the ability of the runtime to 
>>> make sure that all process failures/abort are eventually known to all alive 
>>> processes. Since all alive processes will know of the failure/abort, it can 
>>> then determine if they are transitively effected by the failure based upon 
>>> the local list of communicators and associated error handlers. But to 
>>> complete this aspect of the abort procedure, we do need the callback 
>>> mechanism from the runtime - but since ORTE (today) will kill the job for 
>>> OMPI then it is not a big deal for end users since the job will terminate 
>>> anyway. Once we have the callback, then we can finish tightening up the 
>>> OMPI layer code.
>>> 
>>> It is not perfect, but I think it does address the transitive nature of the 
>>> connectivity of MPI processes by relying on the runtime to provide uniform 
>>> notification of failures. I figure that we will need to look over this code 
>>> again and verify that the implementation of MPI_Comm_disconnect and 
>>> associated underpinnings do the 'right thing' with regard to updating the 
>>> communicator structures. But I think that is best addressed as a second set 
>>> of patches.
>>> 
>>> 
>>> The goal of this patch is to put back in functionality that was commented 
>>> out during the last reorganization of the errmgr. What will likely follow, 
>>> once we have notification of failure/abort at the OMPI level, is a cleanup 
>>> of the connected groups co

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:

> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
>> 
>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>> 
>>> 
>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>> 
 Well, you're way to trusty. ;)
>>> 
>>> It's the midwestern boy in me :)
>> 
>> Still need to shake that corn out of your head... :-)
>> 
>>> 
 
 This only works if all component play the game, and even then there it is 
 difficult if you want to allow components to deregister themselves in the 
 middle of the execution. The problem is that a callback will be previous 
 for some component, and that when you want to remove a callback you have 
 to inform the "next"  component on the callback chain to change its 
 previous.
>>> 
>>> This is a fair point. I think hiding the ordering of callbacks in the 
>>> errmgr could be dangerous since it takes control from the upper layers, 
>>> but, conversely, trusting the upper layers to 'do the right thing' with the 
>>> previous callback is probably too optimistic, esp. for layers that are not 
>>> designed together.
>>> 
>>> To that I would suggest that you leave the code as is - registering a 
>>> callback overwrites the existing callback. That will allow me to replace 
>>> the default OMPI callback when I am able to in MPI_Init, and, if I need to, 
>>> swap back in the default version at MPI_Finalize.
>>> 
>>> Does that sound like a reasonable way forward on this design point?
>> 
>> It doesn't solve the problem that George alluded to - just because you 
>> overwrite the callback, it doesn't mean that someone else won't overwrite 
>> you when their component initializes. Only the last one wins - the rest of 
>> you lose.
>> 
>> I'm not sure how you guarantee that you win, which is why I'm unclear how 
>> this callback can really work unless everyone agrees that only one place 
>> gets it. Put that callback in a base function of a new error handling 
>> framework, and then let everyone create components within that for handling 
>> desired error responses?
> 
> Yep, that is a problem, but one that we can deal with in the immediate
> case. Since OMPI is the only layer registering the callback, when I
> replace it in OMPI I will have to make sure that no other place in
> OMPI replaces the callback.
> 
> If at some point we need more than one callback above ORTE then we may
> want to revisit this point. But since we only have one layer on top of
> ORTE, it is the responsibility of that layer to be internally
> consistent with regard to which callback it wants to be triggered.
> 
> If the layers above ORTE want more than one callback I would suggest
> that that layer design some mechanism for coordinating these multiple
> - possibly conflicting - callbacks (by the way this is policy
> management, which can get complex fast as you add more interested
> parties). Meaning that if OMPI wanted multiple callbacks to be active
> at the same time, then OMPI would create a mechanism for managing
> these callbacks, not ORTE. ORTE should just have one callback provided
> to the upper layer, and keep it -simple-. If the upper layer wants to
> toy around with something more complex it must manage the complexity
> instead of artificially pushing it down to the ORTE layer.

I agree - I was just proposing one way of doing that in the MPI layer so you 
wouldn't have to play policeman on the rest of the code base to ensure nobody 
else inserts a callback without realizing they overwrote yours. I can envision, 
for example, UTK wanting to do something different from you, and perhaps 
committing a callback that unintentionally overrode you.

Up to you...just making a suggestion.


> 
> -- Josh
> 
>>> 
>>> -- Josh
>>> 
 
 george.
 
 On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
 
> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
> -
> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
> -
> 
> Which is a callback that just calls abort (which is what we want to do
> by default):
> -
> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
> }
> -
> 
> This is what I want to replace. I do -not- want ompi to abort just
> because a process failed. So I need a way to replace or remove this
> callback, and put in my own callback that 'does the right thing'.
> 
> The current patch allows me to overwrite the callback when I call:
> -
> orte_errmgr.set_fault_callback(&my_callback);
> -
> Which is fine with me.
> 
> At the point I do not want my_callback to be active any more (say in
> MPI_Finalize) I would like to replace it with the old callback. To do
> so, with the patch's interface, I would have to know what the previous
> callback was and do:
>>>

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain  wrote:
>
> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:
>
>> Another problem with this patch, that I mentioned to Wesley and George
>> off list, is that it does not handle the case when mpirun/HNP is also
>> hosting processes that might fail. In my testing of the patch it
>> worked fine if mpirun/HNP was -not- hosting any processes, but once it
>> had to host processes then unexpected behavior occurred when a process
>> failed. So for those just listening to this thread, Wesley is working
>> on a revised patch to address this problem that he will post when it
>> is ready.
>
> See my other response to the patch - I think we need to understand why we are 
> storing state in multiple places as it can create unexpected behavior when 
> things are out-of-sync.
>
>
>>
>>
>> As far as the RML issue, doesn't the ORTE state machine branch handle
>> that case? If it does, then let's push the solution to that problem
>> until that branch comes around instead of solving it twice.
>
> No, it doesn't - in fact, it's what breaks the current method. Because we no 
> longer allow event recursion, the RML message never gets out of the app. 
> Hence my question.
>
> I honestly don't think we need to have orte be aware of the distinction 
> between "aborted by cmd" and "aborted by signal" as the only diff is in the 
> error message. There ought to be some other way of resolving this?

MPI_Abort will need to tell ORTE which processes should be 'aborted by
signal' along with the calling process. So there needs to be a
mechanism for that was well. Not sure if I have a good solution to
this in mind just yet.

A thought though, in the state machine version, the process calling
MPI_Abort could post a message to the processing thread and return
from the callback. The callback would have a check at the bottom to
determine if MPI_Abort was triggered within the callback, and just
sleep. The processing thread would progress the RML message and once
finished call exit(). This implies that the application process has a
separate processing thread. But I think we might be able to post the
RML message in the callback, then wait for it to complete outside of
the callback before returning control to the user. :/ Interesting.

-- Josh

>
>
>>
>> -- Josh
>>
>>
>> On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain  wrote:
>>> Something else you might want to address in here: the current code sends an 
>>> RML message from the proc calling abort to its local daemon telling the 
>>> daemon that we are exiting due to the app calling "abort". We needed to do 
>>> this because we wanted to flag the proc termination as one induced by the 
>>> app itself as opposed to something like a segfault or termination by signal.
>>>
>>> However, the problem is that the app may be calling abort from within an 
>>> event handler. Hence, the RML send (which is currently blocking) will never 
>>> complete once we no longer allow event lib recursion (coming soon). If we 
>>> use a non-blocking send, then we can't know for sure that the message has 
>>> been sent before we terminate.
>>>
>>> What we need is a non-messaging way of communicating that this was an 
>>> ordered abort as opposed to a segfault or other failure. Prior to the 
>>> current method, we had the app drop a file that the daemon looked for as an 
>>> "abort  marker", but that was ugly as it sometimes caused us to not 
>>> properly cleanup the session directory tree.
>>>
>>> I'm open to suggestion - perhaps it isn't actually all that critical for us 
>>> to distinguish "aborted by call to abort" from "aborted by signal", and we 
>>> can just have the app commit suicide via self-imposed SIGKILL? It is only 
>>> the message output  to the user at the end of the job that differs - and 
>>> since MPI_Abort already provides a message indicating "we called abort", is 
>>> it really necessary that we have orte aware of that distinction?
>>>
>>>
>>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>>>

 On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:

> Well, you're way to trusty. ;)

 It's the midwestern boy in me :)

>
> This only works if all component play the game, and even then there it is 
> difficult if you want to allow components to deregister themselves in the 
> middle of the execution. The problem is that a callback will be previous 
> for some component, and that when you want to remove a callback you have 
> to inform the "next"  component on the callback chain to change its 
> previous.

 This is a fair point. I think hiding the ordering of callbacks in the 
 errmgr could be dangerous since it takes control from the upper layers, 
 but, conversely, trusting the upper layers to 'do the right thing' with 
 the previous callback is probably too optimistic, esp. for layers that are 
 not designed together.

 To that I would suggest that you leave the code as is - 

Re: [OMPI devel] VT support for 1.5

2011-06-10 Thread Jeff Squyres
On Jun 10, 2011, at 5:16 AM, Matthias Jurenz wrote:

> There are different ways to fix the problem:
> 
> 1. Apply the attached patch on ltmain.sh.
> 
> This patch excludes the target library name from searching *.la libraries.

Does your patch work for vpath builds, too?  If so, isn't this something that 
should be submitted upstream?

> 2. Rename the VT's libutil
> 
> This would prevents name conflicts with dependency libraries.

This is my preference; can't it just be renamed to libvtutil or something?

> 3. Clear list of dependency libraries when building VT's libutil.
> 
> This could be done by adding LIBS= to the Makefile.am in 
> ompi/contrib/vt/vt/util/. The VT's libutil has no dependencies to other 
> libraries except libc.

That seems like it would work, but feels a bit hack-ish.

> 4. Perform "make clean" or remove ompi/contrib/vt/vt/util/libutil.la after re-
> configure.
> 
> Nonsense - it cannot be required from the user.

Agreed.

> My favorite suggestion is 1. It would be just another patch in addition to 
> the set of Libtool patches invoked by autogen.

Keep in mind that most (all?) of those are for handling older versions of the 
GNU Autotools, and/or for patches that have been submitted upstream but are not 
part of an official release yet.  Or, they are for v1.5.x where we have "locked 
in" the versions of the GNU autotools for the entire series and won't upgrade, 
even if never versions fix the things we've put in patches for.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 7:01 AM, Josh Hursey wrote:

> On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain  wrote:
>> 
>> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:
>> 
>>> Another problem with this patch, that I mentioned to Wesley and George
>>> off list, is that it does not handle the case when mpirun/HNP is also
>>> hosting processes that might fail. In my testing of the patch it
>>> worked fine if mpirun/HNP was -not- hosting any processes, but once it
>>> had to host processes then unexpected behavior occurred when a process
>>> failed. So for those just listening to this thread, Wesley is working
>>> on a revised patch to address this problem that he will post when it
>>> is ready.
>> 
>> See my other response to the patch - I think we need to understand why we 
>> are storing state in multiple places as it can create unexpected behavior 
>> when things are out-of-sync.
>> 
>> 
>>> 
>>> 
>>> As far as the RML issue, doesn't the ORTE state machine branch handle
>>> that case? If it does, then let's push the solution to that problem
>>> until that branch comes around instead of solving it twice.
>> 
>> No, it doesn't - in fact, it's what breaks the current method. Because we no 
>> longer allow event recursion, the RML message never gets out of the app. 
>> Hence my question.
>> 
>> I honestly don't think we need to have orte be aware of the distinction 
>> between "aborted by cmd" and "aborted by signal" as the only diff is in the 
>> error message. There ought to be some other way of resolving this?
> 
> MPI_Abort will need to tell ORTE which processes should be 'aborted by
> signal' along with the calling process. So there needs to be a
> mechanism for that was well. Not sure if I have a good solution to
> this in mind just yet.

Ah yes - that would require a communication anyway.

> 
> A thought though, in the state machine version, the process calling
> MPI_Abort could post a message to the processing thread and return
> from the callback. The callback would have a check at the bottom to
> determine if MPI_Abort was triggered within the callback, and just
> sleep. The processing thread would progress the RML message and once
> finished call exit(). This implies that the application process has a
> separate processing thread. But I think we might be able to post the
> RML message in the callback, then wait for it to complete outside of
> the callback before returning control to the user. :/ Interesting.

Could work, though it does require a thread. You would have to be tricky about 
it, though, as it is possible the call to "abort" could occur in an event 
handler. If you block in that handler waiting for the message to have been 
sent, it never will leave as the RML uses the event lib to trigger the actual 
send.

I may have a solution to the latter problem. For similar reasons, I've had to 
change the errmgr so it doesn't immediately process errors - otherwise, it's 
actions become constrained by the question of "am I in an event handler or 
not". To remove the uncertainty, I'm rigging it so that all errmgr processing 
is done in an event - basically, reporting an error causes the errmgr to push 
the error into a pipe, that triggers an event which actually processes it.

Only way I could deal with the uncertainty. So if that mechanism is in place, 
the only thing you would have to do is (a) call abort, and then (b) cycle 
opal_progress until the errmgr.abort function callback occurred. Of course, we 
would then have to modify the errmgr so that abort took a callback function 
that it called when the app is free to exit.

 no perfect solution, I fear.



> 
> -- Josh
> 
>> 
>> 
>>> 
>>> -- Josh
>>> 
>>> 
>>> On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain  wrote:
 Something else you might want to address in here: the current code sends 
 an RML message from the proc calling abort to its local daemon telling the 
 daemon that we are exiting due to the app calling "abort". We needed to do 
 this because we wanted to flag the proc termination as one induced by the 
 app itself as opposed to something like a segfault or termination by 
 signal.
 
 However, the problem is that the app may be calling abort from within an 
 event handler. Hence, the RML send (which is currently blocking) will 
 never complete once we no longer allow event lib recursion (coming soon). 
 If we use a non-blocking send, then we can't know for sure that the 
 message has been sent before we terminate.
 
 What we need is a non-messaging way of communicating that this was an 
 ordered abort as opposed to a segfault or other failure. Prior to the 
 current method, we had the app drop a file that the daemon looked for as 
 an "abort  marker", but that was ugly as it sometimes caused us to not 
 properly cleanup the session directory tree.
 
 I'm open to suggestion - perhaps it isn't actually all that critical for 
 us to distinguish "aborted by call 

Re: [OMPI devel] RFC: Fortran support in Open MPI Extensions

2011-06-10 Thread Josh Hursey
Reminder that this RFC goes in later today.

On Wed, Jun 8, 2011 at 10:32 AM, Jeff Squyres  wrote:
> This one's a no-brainer, folks.  :-)
>
> Josh [re]discovered that we didn't initially support Fortran interfaces for 
> the extensions when he was trying to make a complete implementation for an 
> MPI-3 Forum proposal.
>
> +1
>
>
> On Jun 8, 2011, at 10:11 AM, Josh Hursey wrote:
>
>> WHAT: Fortran 77 and 90 support for the Open MPI Extensions
>>
>> WHY: Trunk only supports C.
>>
>> WHERE: build system updates, ompi/mpiext
>>
>> WHEN: Open MPI trunk
>>
>> TIMEOUT: Friday, June 10, 2011 COB
>>
>> Details:
>> ---
>> A bitbucket branch is available here (last sync to r24757 of trunk)
>>  https://bitbucket.org/jjhursey/ompi-ext-fortran
>>
>> The current Open MPI trunk supports only C interfaces to Open MPI
>> interface extensions. This branch adds support for f77 and f90.
>> Supporting these three language interfaces enables Fortran
>> applications to take advantage of available interface extensions.
>> Configure detects if the extension supports C, f77, and/or f90 and
>> takes the appropriate action. The C interfaces are required, and the
>> f77/f90 interfaces are optional. This fix does not require changes to
>> any existing extensions.
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey



Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain

On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:

> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
>> 
>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>> 
>>> 
>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>> 
 Well, you're way to trusty. ;)
>>> 
>>> It's the midwestern boy in me :)
>> 
>> Still need to shake that corn out of your head... :-)
>> 
>>> 
 
 This only works if all component play the game, and even then there it is 
 difficult if you want to allow components to deregister themselves in the 
 middle of the execution. The problem is that a callback will be previous 
 for some component, and that when you want to remove a callback you have 
 to inform the "next"  component on the callback chain to change its 
 previous.
>>> 
>>> This is a fair point. I think hiding the ordering of callbacks in the 
>>> errmgr could be dangerous since it takes control from the upper layers, 
>>> but, conversely, trusting the upper layers to 'do the right thing' with the 
>>> previous callback is probably too optimistic, esp. for layers that are not 
>>> designed together.
>>> 
>>> To that I would suggest that you leave the code as is - registering a 
>>> callback overwrites the existing callback. That will allow me to replace 
>>> the default OMPI callback when I am able to in MPI_Init, and, if I need to, 
>>> swap back in the default version at MPI_Finalize.
>>> 
>>> Does that sound like a reasonable way forward on this design point?
>> 
>> It doesn't solve the problem that George alluded to - just because you 
>> overwrite the callback, it doesn't mean that someone else won't overwrite 
>> you when their component initializes. Only the last one wins - the rest of 
>> you lose.
>> 
>> I'm not sure how you guarantee that you win, which is why I'm unclear how 
>> this callback can really work unless everyone agrees that only one place 
>> gets it. Put that callback in a base function of a new error handling 
>> framework, and then let everyone create components within that for handling 
>> desired error responses?
> 
> Yep, that is a problem, but one that we can deal with in the immediate
> case. Since OMPI is the only layer registering the callback, when I
> replace it in OMPI I will have to make sure that no other place in
> OMPI replaces the callback.
> 
> If at some point we need more than one callback above ORTE then we may
> want to revisit this point. But since we only have one layer on top of
> ORTE, it is the responsibility of that layer to be internally
> consistent with regard to which callback it wants to be triggered.
> 
> If the layers above ORTE want more than one callback I would suggest
> that that layer design some mechanism for coordinating these multiple
> - possibly conflicting - callbacks (by the way this is policy
> management, which can get complex fast as you add more interested
> parties). Meaning that if OMPI wanted multiple callbacks to be active
> at the same time, then OMPI would create a mechanism for managing
> these callbacks, not ORTE. ORTE should just have one callback provided
> to the upper layer, and keep it -simple-. If the upper layer wants to
> toy around with something more complex it must manage the complexity
> instead of artificially pushing it down to the ORTE layer.

I was thinking some more about this, and wonder if we aren't over-complicating 
the question.

Do you need to actually control the sequence of callbacks, or just ensure that 
your callback gets called prior to the default one that calls abort?

Meeting the latter requirement is trivial - subsequent calls to 
register_callback get pushed onto the top of the callback list. Since the 
default one always gets registered first (which we can ensure since it occurs 
in MPI_Init), it will always be at the bottom of the callback list and hence 
called last.

Keeping that list in ORTE is simple and probably the right place to do it.

However, if you truly want to control the callback order in detail - then yeah, 
that should go up in  OMPI. I sure don't want to write all that code :-)


> 
> -- Josh
> 
>>> 
>>> -- Josh
>>> 
 
 george.
 
 On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
 
> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
> -
> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
> -
> 
> Which is a callback that just calls abort (which is what we want to do
> by default):
> -
> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
> }
> -
> 
> This is what I want to replace. I do -not- want ompi to abort just
> because a process failed. So I need a way to replace or remove this
> callback, and put in my own callback that 'does the right thing'.
> 
> The current patch allows me to overwrite the callback when I call:
> --

Re: [OMPI devel] RFC: Fortran support in Open MPI Extensions

2011-06-10 Thread Josh Hursey
Committed in r24772:
  https://svn.open-mpi.org/trac/ompi/changeset/24772

Thanks folks,
Josh

On Fri, Jun 10, 2011 at 12:56 PM, Josh Hursey  wrote:
> Reminder that this RFC goes in later today.
>
> On Wed, Jun 8, 2011 at 10:32 AM, Jeff Squyres  wrote:
>> This one's a no-brainer, folks.  :-)
>>
>> Josh [re]discovered that we didn't initially support Fortran interfaces for 
>> the extensions when he was trying to make a complete implementation for an 
>> MPI-3 Forum proposal.
>>
>> +1
>>
>>
>> On Jun 8, 2011, at 10:11 AM, Josh Hursey wrote:
>>
>>> WHAT: Fortran 77 and 90 support for the Open MPI Extensions
>>>
>>> WHY: Trunk only supports C.
>>>
>>> WHERE: build system updates, ompi/mpiext
>>>
>>> WHEN: Open MPI trunk
>>>
>>> TIMEOUT: Friday, June 10, 2011 COB
>>>
>>> Details:
>>> ---
>>> A bitbucket branch is available here (last sync to r24757 of trunk)
>>>  https://bitbucket.org/jjhursey/ompi-ext-fortran
>>>
>>> The current Open MPI trunk supports only C interfaces to Open MPI
>>> interface extensions. This branch adds support for f77 and f90.
>>> Supporting these three language interfaces enables Fortran
>>> applications to take advantage of available interface extensions.
>>> Configure detects if the extension supports C, f77, and/or f90 and
>>> takes the appropriate action. The C interfaces are required, and the
>>> f77/f90 interfaces are optional. This fix does not require changes to
>>> any existing extensions.
>>>
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey



Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
Yeah I do not want the default fatal callback in OMPI. I want to
replace it with something that allows OMPI to continue running when
there are process failures (if the error handlers associated with the
communicators permit such an action). So having the default fatal
callback called after mine would not be useful, since I do not want
the fatal action.

As long as I can replace that callback, or selectively get rid of it
then I'm ok.


On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain  wrote:
>
> On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
>
>> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
>>>
>>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>>>

 On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:

> Well, you're way to trusty. ;)

 It's the midwestern boy in me :)
>>>
>>> Still need to shake that corn out of your head... :-)
>>>

>
> This only works if all component play the game, and even then there it is 
> difficult if you want to allow components to deregister themselves in the 
> middle of the execution. The problem is that a callback will be previous 
> for some component, and that when you want to remove a callback you have 
> to inform the "next"  component on the callback chain to change its 
> previous.

 This is a fair point. I think hiding the ordering of callbacks in the 
 errmgr could be dangerous since it takes control from the upper layers, 
 but, conversely, trusting the upper layers to 'do the right thing' with 
 the previous callback is probably too optimistic, esp. for layers that are 
 not designed together.

 To that I would suggest that you leave the code as is - registering a 
 callback overwrites the existing callback. That will allow me to replace 
 the default OMPI callback when I am able to in MPI_Init, and, if I need 
 to, swap back in the default version at MPI_Finalize.

 Does that sound like a reasonable way forward on this design point?
>>>
>>> It doesn't solve the problem that George alluded to - just because you 
>>> overwrite the callback, it doesn't mean that someone else won't overwrite 
>>> you when their component initializes. Only the last one wins - the rest of 
>>> you lose.
>>>
>>> I'm not sure how you guarantee that you win, which is why I'm unclear how 
>>> this callback can really work unless everyone agrees that only one place 
>>> gets it. Put that callback in a base function of a new error handling 
>>> framework, and then let everyone create components within that for handling 
>>> desired error responses?
>>
>> Yep, that is a problem, but one that we can deal with in the immediate
>> case. Since OMPI is the only layer registering the callback, when I
>> replace it in OMPI I will have to make sure that no other place in
>> OMPI replaces the callback.
>>
>> If at some point we need more than one callback above ORTE then we may
>> want to revisit this point. But since we only have one layer on top of
>> ORTE, it is the responsibility of that layer to be internally
>> consistent with regard to which callback it wants to be triggered.
>>
>> If the layers above ORTE want more than one callback I would suggest
>> that that layer design some mechanism for coordinating these multiple
>> - possibly conflicting - callbacks (by the way this is policy
>> management, which can get complex fast as you add more interested
>> parties). Meaning that if OMPI wanted multiple callbacks to be active
>> at the same time, then OMPI would create a mechanism for managing
>> these callbacks, not ORTE. ORTE should just have one callback provided
>> to the upper layer, and keep it -simple-. If the upper layer wants to
>> toy around with something more complex it must manage the complexity
>> instead of artificially pushing it down to the ORTE layer.
>
> I was thinking some more about this, and wonder if we aren't 
> over-complicating the question.
>
> Do you need to actually control the sequence of callbacks, or just ensure 
> that your callback gets called prior to the default one that calls abort?
>
> Meeting the latter requirement is trivial - subsequent calls to 
> register_callback get pushed onto the top of the callback list. Since the 
> default one always gets registered first (which we can ensure since it occurs 
> in MPI_Init), it will always be at the bottom of the callback list and hence 
> called last.
>
> Keeping that list in ORTE is simple and probably the right place to do it.
>
> However, if you truly want to control the callback order in detail - then 
> yeah, that should go up in  OMPI. I sure don't want to write all that code :-)
>
>
>>
>> -- Josh
>>

 -- Josh

>
> george.
>
> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>
>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
>> -
>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
>> -

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
So why not have the callback return an int, and your callback returns "go no 
further"?


On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote:

> Yeah I do not want the default fatal callback in OMPI. I want to
> replace it with something that allows OMPI to continue running when
> there are process failures (if the error handlers associated with the
> communicators permit such an action). So having the default fatal
> callback called after mine would not be useful, since I do not want
> the fatal action.
> 
> As long as I can replace that callback, or selectively get rid of it
> then I'm ok.
> 
> 
> On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain  wrote:
>> 
>> On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
>> 
>>> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
 
 On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
 
> 
> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
> 
>> Well, you're way to trusty. ;)
> 
> It's the midwestern boy in me :)
 
 Still need to shake that corn out of your head... :-)
 
> 
>> 
>> This only works if all component play the game, and even then there it 
>> is difficult if you want to allow components to deregister themselves in 
>> the middle of the execution. The problem is that a callback will be 
>> previous for some component, and that when you want to remove a callback 
>> you have to inform the "next"  component on the callback chain to change 
>> its previous.
> 
> This is a fair point. I think hiding the ordering of callbacks in the 
> errmgr could be dangerous since it takes control from the upper layers, 
> but, conversely, trusting the upper layers to 'do the right thing' with 
> the previous callback is probably too optimistic, esp. for layers that 
> are not designed together.
> 
> To that I would suggest that you leave the code as is - registering a 
> callback overwrites the existing callback. That will allow me to replace 
> the default OMPI callback when I am able to in MPI_Init, and, if I need 
> to, swap back in the default version at MPI_Finalize.
> 
> Does that sound like a reasonable way forward on this design point?
 
 It doesn't solve the problem that George alluded to - just because you 
 overwrite the callback, it doesn't mean that someone else won't overwrite 
 you when their component initializes. Only the last one wins - the rest of 
 you lose.
 
 I'm not sure how you guarantee that you win, which is why I'm unclear how 
 this callback can really work unless everyone agrees that only one place 
 gets it. Put that callback in a base function of a new error handling 
 framework, and then let everyone create components within that for 
 handling desired error responses?
>>> 
>>> Yep, that is a problem, but one that we can deal with in the immediate
>>> case. Since OMPI is the only layer registering the callback, when I
>>> replace it in OMPI I will have to make sure that no other place in
>>> OMPI replaces the callback.
>>> 
>>> If at some point we need more than one callback above ORTE then we may
>>> want to revisit this point. But since we only have one layer on top of
>>> ORTE, it is the responsibility of that layer to be internally
>>> consistent with regard to which callback it wants to be triggered.
>>> 
>>> If the layers above ORTE want more than one callback I would suggest
>>> that that layer design some mechanism for coordinating these multiple
>>> - possibly conflicting - callbacks (by the way this is policy
>>> management, which can get complex fast as you add more interested
>>> parties). Meaning that if OMPI wanted multiple callbacks to be active
>>> at the same time, then OMPI would create a mechanism for managing
>>> these callbacks, not ORTE. ORTE should just have one callback provided
>>> to the upper layer, and keep it -simple-. If the upper layer wants to
>>> toy around with something more complex it must manage the complexity
>>> instead of artificially pushing it down to the ORTE layer.
>> 
>> I was thinking some more about this, and wonder if we aren't 
>> over-complicating the question.
>> 
>> Do you need to actually control the sequence of callbacks, or just ensure 
>> that your callback gets called prior to the default one that calls abort?
>> 
>> Meeting the latter requirement is trivial - subsequent calls to 
>> register_callback get pushed onto the top of the callback list. Since the 
>> default one always gets registered first (which we can ensure since it 
>> occurs in MPI_Init), it will always be at the bottom of the callback list 
>> and hence called last.
>> 
>> Keeping that list in ORTE is simple and probably the right place to do it.
>> 
>> However, if you truly want to control the callback order in detail - then 
>> yeah, that should go up in  OMPI. I sure don't want to write all that code 
>> :-)
>> 
>> 
>>> 
>>> -- Josh
>>> 
> 
> 

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
We could, but we could also just replace the callback. I will never
what to use it in my scenario, and if I did then I could just call it
directly instead of relying on the errmgr to do the right thing. So
why complicate the errmgr with additional complexity for something
that we don't need at the moment?

On Fri, Jun 10, 2011 at 4:40 PM, Ralph Castain  wrote:
> So why not have the callback return an int, and your callback returns "go no 
> further"?
>
>
> On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote:
>
>> Yeah I do not want the default fatal callback in OMPI. I want to
>> replace it with something that allows OMPI to continue running when
>> there are process failures (if the error handlers associated with the
>> communicators permit such an action). So having the default fatal
>> callback called after mine would not be useful, since I do not want
>> the fatal action.
>>
>> As long as I can replace that callback, or selectively get rid of it
>> then I'm ok.
>>
>>
>> On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain  wrote:
>>>
>>> On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
>>>
 On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
>
> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>
>>
>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>
>>> Well, you're way to trusty. ;)
>>
>> It's the midwestern boy in me :)
>
> Still need to shake that corn out of your head... :-)
>
>>
>>>
>>> This only works if all component play the game, and even then there it 
>>> is difficult if you want to allow components to deregister themselves 
>>> in the middle of the execution. The problem is that a callback will be 
>>> previous for some component, and that when you want to remove a 
>>> callback you have to inform the "next"  component on the callback chain 
>>> to change its previous.
>>
>> This is a fair point. I think hiding the ordering of callbacks in the 
>> errmgr could be dangerous since it takes control from the upper layers, 
>> but, conversely, trusting the upper layers to 'do the right thing' with 
>> the previous callback is probably too optimistic, esp. for layers that 
>> are not designed together.
>>
>> To that I would suggest that you leave the code as is - registering a 
>> callback overwrites the existing callback. That will allow me to replace 
>> the default OMPI callback when I am able to in MPI_Init, and, if I need 
>> to, swap back in the default version at MPI_Finalize.
>>
>> Does that sound like a reasonable way forward on this design point?
>
> It doesn't solve the problem that George alluded to - just because you 
> overwrite the callback, it doesn't mean that someone else won't overwrite 
> you when their component initializes. Only the last one wins - the rest 
> of you lose.
>
> I'm not sure how you guarantee that you win, which is why I'm unclear how 
> this callback can really work unless everyone agrees that only one place 
> gets it. Put that callback in a base function of a new error handling 
> framework, and then let everyone create components within that for 
> handling desired error responses?

 Yep, that is a problem, but one that we can deal with in the immediate
 case. Since OMPI is the only layer registering the callback, when I
 replace it in OMPI I will have to make sure that no other place in
 OMPI replaces the callback.

 If at some point we need more than one callback above ORTE then we may
 want to revisit this point. But since we only have one layer on top of
 ORTE, it is the responsibility of that layer to be internally
 consistent with regard to which callback it wants to be triggered.

 If the layers above ORTE want more than one callback I would suggest
 that that layer design some mechanism for coordinating these multiple
 - possibly conflicting - callbacks (by the way this is policy
 management, which can get complex fast as you add more interested
 parties). Meaning that if OMPI wanted multiple callbacks to be active
 at the same time, then OMPI would create a mechanism for managing
 these callbacks, not ORTE. ORTE should just have one callback provided
 to the upper layer, and keep it -simple-. If the upper layer wants to
 toy around with something more complex it must manage the complexity
 instead of artificially pushing it down to the ORTE layer.
>>>
>>> I was thinking some more about this, and wonder if we aren't 
>>> over-complicating the question.
>>>
>>> Do you need to actually control the sequence of callbacks, or just ensure 
>>> that your callback gets called prior to the default one that calls abort?
>>>
>>> Meeting the latter requirement is trivial - subsequent calls to 
>>> register_callback get pushed onto the top of the callback list. Since the 
>>> default one always gets registe

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
No issue - just trying to get ahead of the game instead of running into an 
issue later.

We can leave it for now.

On Jun 10, 2011, at 2:47 PM, Josh Hursey wrote:

> We could, but we could also just replace the callback. I will never
> what to use it in my scenario, and if I did then I could just call it
> directly instead of relying on the errmgr to do the right thing. So
> why complicate the errmgr with additional complexity for something
> that we don't need at the moment?
> 
> On Fri, Jun 10, 2011 at 4:40 PM, Ralph Castain  wrote:
>> So why not have the callback return an int, and your callback returns "go no 
>> further"?
>> 
>> 
>> On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote:
>> 
>>> Yeah I do not want the default fatal callback in OMPI. I want to
>>> replace it with something that allows OMPI to continue running when
>>> there are process failures (if the error handlers associated with the
>>> communicators permit such an action). So having the default fatal
>>> callback called after mine would not be useful, since I do not want
>>> the fatal action.
>>> 
>>> As long as I can replace that callback, or selectively get rid of it
>>> then I'm ok.
>>> 
>>> 
>>> On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain  wrote:
 
 On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
 
> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain  wrote:
>> 
>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>> 
>>> 
>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>> 
 Well, you're way to trusty. ;)
>>> 
>>> It's the midwestern boy in me :)
>> 
>> Still need to shake that corn out of your head... :-)
>> 
>>> 
 
 This only works if all component play the game, and even then there it 
 is difficult if you want to allow components to deregister themselves 
 in the middle of the execution. The problem is that a callback will be 
 previous for some component, and that when you want to remove a 
 callback you have to inform the "next"  component on the callback 
 chain to change its previous.
>>> 
>>> This is a fair point. I think hiding the ordering of callbacks in the 
>>> errmgr could be dangerous since it takes control from the upper layers, 
>>> but, conversely, trusting the upper layers to 'do the right thing' with 
>>> the previous callback is probably too optimistic, esp. for layers that 
>>> are not designed together.
>>> 
>>> To that I would suggest that you leave the code as is - registering a 
>>> callback overwrites the existing callback. That will allow me to 
>>> replace the default OMPI callback when I am able to in MPI_Init, and, 
>>> if I need to, swap back in the default version at MPI_Finalize.
>>> 
>>> Does that sound like a reasonable way forward on this design point?
>> 
>> It doesn't solve the problem that George alluded to - just because you 
>> overwrite the callback, it doesn't mean that someone else won't 
>> overwrite you when their component initializes. Only the last one wins - 
>> the rest of you lose.
>> 
>> I'm not sure how you guarantee that you win, which is why I'm unclear 
>> how this callback can really work unless everyone agrees that only one 
>> place gets it. Put that callback in a base function of a new error 
>> handling framework, and then let everyone create components within that 
>> for handling desired error responses?
> 
> Yep, that is a problem, but one that we can deal with in the immediate
> case. Since OMPI is the only layer registering the callback, when I
> replace it in OMPI I will have to make sure that no other place in
> OMPI replaces the callback.
> 
> If at some point we need more than one callback above ORTE then we may
> want to revisit this point. But since we only have one layer on top of
> ORTE, it is the responsibility of that layer to be internally
> consistent with regard to which callback it wants to be triggered.
> 
> If the layers above ORTE want more than one callback I would suggest
> that that layer design some mechanism for coordinating these multiple
> - possibly conflicting - callbacks (by the way this is policy
> management, which can get complex fast as you add more interested
> parties). Meaning that if OMPI wanted multiple callbacks to be active
> at the same time, then OMPI would create a mechanism for managing
> these callbacks, not ORTE. ORTE should just have one callback provided
> to the upper layer, and keep it -simple-. If the upper layer wants to
> toy around with something more complex it must manage the complexity
> instead of artificially pushing it down to the ORTE layer.
 
 I was thinking some more about this, and wonder if we aren't 
 over-complicating the question.
 
 Do you need to actually control the sequen