Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
No issue - just trying to get ahead of the game instead of running into an issue later. We can leave it for now. On Jun 10, 2011, at 2:47 PM, Josh Hursey wrote: > We could, but we could also just replace the callback. I will never > what to use it in my scenario, and if I did then I could just

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
We could, but we could also just replace the callback. I will never what to use it in my scenario, and if I did then I could just call it directly instead of relying on the errmgr to do the right thing. So why complicate the errmgr with additional complexity for something that we don't need at the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
So why not have the callback return an int, and your callback returns "go no further"? On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote: > Yeah I do not want the default fatal callback in OMPI. I want to > replace it with something that allows OMPI to continue running when > there are process fai

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
Yeah I do not want the default fatal callback in OMPI. I want to replace it with something that allows OMPI to continue running when there are process failures (if the error handlers associated with the communicators permit such an action). So having the default fatal callback called after mine wou

Re: [OMPI devel] RFC: Fortran support in Open MPI Extensions

2011-06-10 Thread Josh Hursey
Committed in r24772: https://svn.open-mpi.org/trac/ompi/changeset/24772 Thanks folks, Josh On Fri, Jun 10, 2011 at 12:56 PM, Josh Hursey wrote: > Reminder that this RFC goes in later today. > > On Wed, Jun 8, 2011 at 10:32 AM, Jeff Squyres wrote: >> This one's a no-brainer, folks.  :-) >> >>

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: >> >> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >> >>> >>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >>> Well, you're way to trusty. ;) >>> >>> It's the midwestern boy

Re: [OMPI devel] RFC: Fortran support in Open MPI Extensions

2011-06-10 Thread Josh Hursey
Reminder that this RFC goes in later today. On Wed, Jun 8, 2011 at 10:32 AM, Jeff Squyres wrote: > This one's a no-brainer, folks.  :-) > > Josh [re]discovered that we didn't initially support Fortran interfaces for > the extensions when he was trying to make a complete implementation for an >

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
On Jun 10, 2011, at 7:01 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain wrote: >> >> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote: >> >>> Another problem with this patch, that I mentioned to Wesley and George >>> off list, is that it does not handle the case when mpi

Re: [OMPI devel] VT support for 1.5

2011-06-10 Thread Jeff Squyres
On Jun 10, 2011, at 5:16 AM, Matthias Jurenz wrote: > There are different ways to fix the problem: > > 1. Apply the attached patch on ltmain.sh. > > This patch excludes the target library name from searching *.la libraries. Does your patch work for vpath builds, too? If so, isn't this somethin

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain wrote: > > On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote: > >> Another problem with this patch, that I mentioned to Wesley and George >> off list, is that it does not handle the case when mpirun/HNP is also >> hosting processes that might fail. In my

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: > On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: >> >> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >> >>> >>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >>> Well, you're way to trusty. ;) >>> >>> It's the midwestern boy

Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

2011-06-10 Thread Ralph Castain
On Jun 10, 2011, at 6:48 AM, Josh Hursey wrote: > Why would this patch result in zombied processes and poor cleanup? > When ORTE receive notification of a process terminating/aborting then > it triggers the termination of the job (without UTK's RFC) which > should ensure a clean shutdown. This pa

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote: > Another problem with this patch, that I mentioned to Wesley and George > off list, is that it does not handle the case when mpirun/HNP is also > hosting processes that might fail. In my testing of the patch it > worked fine if mpirun/HNP was -not-

Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

2011-06-10 Thread Josh Hursey
Why would this patch result in zombied processes and poor cleanup? When ORTE receive notification of a process terminating/aborting then it triggers the termination of the job (without UTK's RFC) which should ensure a clean shutdown. This patch just tells ORTE that a few other processes should be t

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
Another problem with this patch, that I mentioned to Wesley and George off list, is that it does not handle the case when mpirun/HNP is also hosting processes that might fail. In my testing of the patch it worked fine if mpirun/HNP was -not- hosting any processes, but once it had to host processes

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
Okay, finally have time to sit down and review this. It looks pretty much identical to what was done in ORCM - we just kept "epoch" separate from the process name, and use multicast to notify all procs that someone failed. I do have a few questions/comments about your proposed patch: 1. I note

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Josh Hursey
On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain wrote: > > On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: > >> >> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >> >>> Well, you're way to trusty. ;) >> >> It's the midwestern boy in me :) > > Still need to shake that corn out of your head... :-

Re: [OMPI devel] VT support for 1.5

2011-06-10 Thread Matthias Jurenz
It's a Libtool issue (once again) which occurs if a previous build is re- configured without subsequent "make clean" and the LIBC developer library "libutil" is added to LIBS. The error is simple to reproduce by the following steps: 1. configure 2. make -C ompi/contrib/vt/vt/util 3. configure or

Re: [OMPI devel] VT support for 1.5

2011-06-10 Thread Matthias Jurenz
+ attachment On Friday 10 June 2011 12:00:49 you wrote: > It's a Libtool issue (once again) which occurs if a previous build is re- > configured without subsequent "make clean" and the LIBC developer library > "libutil" is added to LIBS. > > The error is simple to reproduce by the following steps

Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

2011-06-10 Thread Ralph Castain
I have no issue with uncommenting the code. However, I do see a future littered with lots of zombied processes and complaints over poor cleanup again On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote: > Ah I see what you are getting at now. > > The construction of the list of connected proce

Re: [OMPI devel] VT support for 1.5

2011-06-10 Thread Matthias Jurenz
It's a Libtool issue (once again) which occurs if a previous build is re- configured without subsequent "make clean" and the LIBC developer library "libutil" is added to LIBS. The error is simple to reproduce by the following steps: 1. configure 2. make -C ompi/contrib/vt/vt/util 3. configure or

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
Something else you might want to address in here: the current code sends an RML message from the proc calling abort to its local daemon telling the daemon that we are exiting due to the app calling "abort". We needed to do this because we wanted to flag the proc termination as one induced by the

Re: [OMPI devel] RFC: Resilient ORTE

2011-06-10 Thread Ralph Castain
On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: > > On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: > >> Well, you're way to trusty. ;) > > It's the midwestern boy in me :) Still need to shake that corn out of your head... :-) > >> >> This only works if all component play the game, and