Ted --

Sorry for jumping in late.  Here's my $0.02...

In the runtime, we can do 4 things:

1. Kill just the process that we forked.
2. Kill just the process(es) that call back and identify themselves as MPI 
processes (we don't track this right now, but we could add that functionality).
3. Union of #1 and #2.
4. Kill all processes (to include any intermediate processes that are not 
included in #1 and #2).

In Open MPI 2.x, #4 is the intended behavior.  There may be a bug or two that 
needs to get fixed (e.g., in your last mail, I don't see offhand why it waits 
until the MPI process finishes sleeping), but we should be killing the process 
group, which -- unless any of the descendant processes have explicitly left the 
process group -- should hit the entire process tree.  

Sidenote: there's actually a way to be a bit more aggressive and do a better 
job of ensuring that we kill *all* processes (via creative use of 
PR_SET_CHILD_SUBREAPER), but that's basically a future enhancement / 
optimization.

I think Gilles and Ralph proposed a good point to you: if you want to be sure 
to be able to do cleanup after an MPI process terminates (normally or 
abnormally), you should trap signals in your intermediate processes to catch 
what Open MPI's runtime throws and therefore know that it is time to cleanup.  

Hypothetically, this should work in all versions of Open MPI...?

I think Ralph made a pull request that adds an MCA param to change the default 
behavior from #4 to #1.

Note, however, that there's a little time between when Open MPI sends the 
SIGTERM and the SIGKILL, so this solution could be racy.  If you find that 
you're running out of time to cleanup, we might be able to make the delay 
between the SIGTERM and SIGKILL be configurable (e.g., via MCA param).




> On Jun 16, 2017, at 10:08 AM, Ted Sussman <ted.suss...@adina.com> wrote:
> 
> Hello Gilles and Ralph,
> 
> Thank you for your advice so far.  I appreciate the time that you have spent 
> to educate me about the details of Open MPI.
> 
> But I think that there is something fundamental that I don't understand.  
> Consider Example 2 run with Open MPI 2.1.1. 
> 
> mpirun --> shell for process 0 -->  executable for process 0 --> MPI calls, 
> MPI_Abort
>        --> shell for process 1 -->  executable for process 1 --> MPI calls
> 
> After the MPI_Abort is called, ps shows that both shells are running, and 
> that the executable for process 1 is running (in this case, process 1 is 
> sleeping).  And mpirun does not exit until process 1 is finished sleeping.
> 
> I cannot reconcile this observed behavior with the statement
> 
> >     >     2.x: each process is put into its own process group upon launch. 
> > When we issue a
> >     >     "kill", we issue it to the process group. Thus, every child proc 
> > of that child proc will
> >     >     receive it. IIRC, this was the intended behavior.
> 
> I assume that, for my example, there are two process groups.  The process 
> group for process 0 contains the shell for process 0 and the executable for 
> process 0; and the process group for process 1 contains the shell for process 
> 1 and the executable for process 1.  So what does MPI_ABORT do?  MPI_ABORT 
> does not kill the process group for process 0, since the shell for process 0 
> continues.  And MPI_ABORT does not kill the process group for process 1, 
> since both the shell and executable for process 1 continue.
> 
> If I hit Ctrl-C after MPI_Abort is called, I get the message
> 
> mpirun: abort is already in progress.. hit ctrl-c again to forcibly terminate
> 
> but I don't need to hit Ctrl-C again because mpirun immediately exits.
> 
> Can you shed some light on all of this?
> 
> Sincerely,
> 
> Ted Sussman
> 
> 
> On 15 Jun 2017 at 14:44, r...@open-mpi.org wrote:
> 
> >
> > You have to understand that we have no way of knowing who is making MPI 
> > calls - all we see is
> > the proc that we started, and we know someone of that rank is running (but 
> > we have no way of
> > knowing which of the procs you sub-spawned it is).
> >
> > So the behavior you are seeking only occurred in some earlier release by 
> > sheer accident. Nor will
> > you find it portable as there is no specification directing that behavior.
> >
> > The behavior I’ve provided is to either deliver the signal to _all_ child 
> > processes (including
> > grandchildren etc.), or _only_ the immediate child of the daemon. It won’t 
> > do what you describe -
> > kill the mPI proc underneath the shell, but not the shell itself.
> >
> > What you can eventually do is use PMIx to ask the runtime to selectively 
> > deliver signals to
> > pid/procs for you. We don’t have that capability implemented just yet, I’m 
> > afraid.
> >
> > Meantime, when I get a chance, I can code an option that will record the 
> > pid of the subproc that
> > calls MPI_Init, and then let’s you deliver signals to just that proc. No 
> > promises as to when that will
> > be done.
> >
> >
> >     On Jun 15, 2017, at 1:37 PM, Ted Sussman <ted.suss...@adina.com> wrote:
> >
> >     Hello Ralph,
> >
> >     I am just an Open MPI end user, so I will need to wait for the next 
> > official release.
> >
> >     mpirun --> shell for process 0 -->  executable for process 0 --> MPI 
> > calls
> >            --> shell for process 1 -->  executable for process 1 --> MPI 
> > calls
> >                                     ...
> >
> >     I guess the question is, should MPI_ABORT kill the executables or the 
> > shells?  I naively
> >     thought, that, since it is the executables that make the MPI calls, it 
> > is the executables that
> >     should be aborted by the call to MPI_ABORT.  Since the shells don't 
> > make MPI calls, the
> >     shells should not be aborted.
> >
> >     And users might have several layers of shells in between mpirun and the 
> > executable.
> >
> >     So now I will look for the latest version of Open MPI that has the 
> > 1.4.3 behavior.
> >
> >     Sincerely,
> >
> >     Ted Sussman
> >
> >     On 15 Jun 2017 at 12:31, r...@open-mpi.org wrote:
> >
> >     >
> >     > Yeah, things jittered a little there as we debated the “right” 
> > behavior. Generally, when we
> >     see that
> >     > happening it means that a param is required, but somehow we never 
> > reached that point.
> >     >
> >     > See if https://github.com/open-mpi/ompi/pull/3704  helps - if so, I 
> > can schedule it for the next
> >     2.x
> >     > release if the RMs agree to take it
> >     >
> >     > Ralph
> >     >
> >     >     On Jun 15, 2017, at 12:20 PM, Ted Sussman <ted.suss...@adina.com 
> > > wrote:
> >     >
> >     >     Thank you for your comments.
> >     >    
> >     >     Our application relies upon "dum.sh" to clean up after the 
> > process exits, either if the
> >     process
> >     >     exits normally, or if the process exits abnormally because of 
> > MPI_ABORT.  If the process
> >     >     group is killed by MPI_ABORT, this clean up will not be 
> > performed.  If exec is used to launch
> >     >     the executable from dum.sh, then dum.sh is terminated by the 
> > exec, so dum.sh cannot
> >     >     perform any clean up.
> >     >    
> >     >     I suppose that other user applications might work similarly, so 
> > it would be good to have an
> >     >     MCA parameter to control the behavior of MPI_ABORT.
> >     >    
> >     >     We could rewrite our shell script that invokes mpirun, so that 
> > the cleanup that is now done
> >     >     by
> >     >     dum.sh is done by the invoking shell script after mpirun exits.  
> > Perhaps this technique is the
> >     >     preferred way to clean up after mpirun is invoked.
> >     >    
> >     >     By the way, I have also tested with Open MPI 1.10.7, and Open MPI 
> > 1.10.7 has different
> >     >     behavior than either Open MPI 1.4.3 or Open MPI 2.1.1.  In this 
> > explanation, it is important to
> >     >     know that the aborttest executable sleeps for 20 sec.
> >     >    
> >     >     When running example 2:
> >     >    
> >     >     1.4.3: process 1 immediately aborts
> >     >     1.10.7: process 1 doesn't abort and never stops.
> >     >     2.1.1 process 1 doesn't abort, but stops after it is finished 
> > sleeping
> >     >    
> >     >     Sincerely,
> >     >    
> >     >     Ted Sussman
> >     >    
> >     >     On 15 Jun 2017 at 9:18, r...@open-mpi.org wrote:
> >     >
> >     >     Here is how the system is working:
> >     >    
> >     >     Master: each process is put into its own process group upon 
> > launch. When we issue a
> >     >     "kill", however, we only issue it to the individual process 
> > (instead of the process group
> >     >     that is headed by that child process). This is probably a bug as 
> > I don´t believe that is
> >     >     what we intended, but set that aside for now.
> >     >    
> >     >     2.x: each process is put into its own process group upon launch. 
> > When we issue a
> >     >     "kill", we issue it to the process group. Thus, every child proc 
> > of that child proc will
> >     >     receive it. IIRC, this was the intended behavior.
> >     >    
> >     >     It is rather trivial to make the change (it only involves 3 lines 
> > of code), but I´m not sure
> >     >     of what our intended behavior is supposed to be. Once we clarify 
> > that, it is also trivial
> >     >     to add another MCA param (you can never have too many!) to allow 
> > you to select the
> >     >     other behavior.
> >     >    
> >     >
> >     >     On Jun 15, 2017, at 5:23 AM, Ted Sussman <ted.suss...@adina.com > 
> > wrote:
> >     >    
> >     >     Hello Gilles,
> >     >    
> >     >     Thank you for your quick answer.  I confirm that if exec is used, 
> > both processes
> >     >     immediately
> >     >     abort.
> >     >    
> >     >     Now suppose that the line
> >     >    
> >     >     echo "After aborttest:
> >     >     OMPI_COMM_WORLD_RANK="$OMPI_COMM_WORLD_RANK
> >     >    
> >     >     is added to the end of dum.sh.
> >     >    
> >     >     If Example 2 is run with Open MPI 1.4.3, the output is
> >     >    
> >     >     After aborttest: OMPI_COMM_WORLD_RANK=0
> >     >    
> >     >     which shows that the shell script for the process with rank 0 
> > continues after the
> >     >     abort,
> >     >     but that the shell script for the process with rank 1 does not 
> > continue after the
> >     >     abort.
> >     >    
> >     >     If Example 2 is run with Open MPI 2.1.1, with exec used to invoke
> >     >     aborttest02.exe, then
> >     >     there is no such output, which shows that both shell scripts do 
> > not continue after
> >     >     the abort.
> >     >    
> >     >     I prefer the Open MPI 1.4.3 behavior because our original 
> > application depends
> >     >     upon the
> >     >     Open MPI 1.4.3 behavior.  (Our original application will also 
> > work if both
> >     >     executables are
> >     >     aborted, and if both shell scripts continue after the abort.)
> >     >    
> >     >     It might be too much to expect, but is there a way to recover the 
> > Open MPI 1.4.3
> >     >     behavior
> >     >     using Open MPI 2.1.1?  
> >     >    
> >     >     Sincerely,
> >     >    
> >     >     Ted Sussman
> >     >    
> >     >    
> >     >     On 15 Jun 2017 at 9:50, Gilles Gouaillardet wrote:
> >     >
> >     >     Ted,
> >     >    
> >     >    
> >     >     fwiw, the 'master' branch has the behavior you expect.
> >     >    
> >     >    
> >     >     meanwhile, you can simple edit your 'dum.sh' script and replace
> >     >    
> >     >     /home/buildadina/src/aborttest02/aborttest02.exe
> >     >    
> >     >     with
> >     >    
> >     >     exec /home/buildadina/src/aborttest02/aborttest02.exe
> >     >    
> >     >    
> >     >     Cheers,
> >     >    
> >     >    
> >     >     Gilles
> >     >    
> >     >    
> >     >     On 6/15/2017 3:01 AM, Ted Sussman wrote:
> >     >     Hello,
> >     >    
> >     >     My question concerns MPI_ABORT, indirect execution of
> >     >     executables by mpirun and Open
> >     >     MPI 2.1.1.  When mpirun runs executables directly, MPI_ABORT
> >     >     works as expected, but
> >     >     when mpirun runs executables indirectly, MPI_ABORT does not
> >     >     work as expected.
> >     >    
> >     >     If Open MPI 1.4.3 is used instead of Open MPI 2.1.1, MPI_ABORT
> >     >     works as expected in all
> >     >     cases.
> >     >    
> >     >     The examples given below have been simplified as far as possible
> >     >     to show the issues.
> >     >    
> >     >     ---
> >     >    
> >     >     Example 1
> >     >    
> >     >     Consider an MPI job run in the following way:
> >     >    
> >     >     mpirun ... -app addmpw1
> >     >    
> >     >     where the appfile addmpw1 lists two executables:
> >     >    
> >     >     -n 1 -host gulftown ... aborttest02.exe
> >     >     -n 1 -host gulftown ... aborttest02.exe
> >     >    
> >     >     The two executables are executed on the local node gulftown.
> >     >      aborttest02 calls MPI_ABORT
> >     >     for rank 0, then sleeps.
> >     >    
> >     >     The above MPI job runs as expected.  Both processes immediately
> >     >     abort when rank 0 calls
> >     >     MPI_ABORT.
> >     >    
> >     >     ---
> >     >    
> >     >     Example 2
> >     >    
> >     >     Now change the above example as follows:
> >     >    
> >     >     mpirun ... -app addmpw2
> >     >    
> >     >     where the appfile addmpw2 lists shell scripts:
> >     >    
> >     >     -n 1 -host gulftown ... dum.sh
> >     >     -n 1 -host gulftown ... dum.sh
> >     >    
> >     >     dum.sh invokes aborttest02.exe.  So aborttest02.exe is executed
> >     >     indirectly by mpirun.
> >     >    
> >     >     In this case, the MPI job only aborts process 0 when rank 0 calls
> >     >     MPI_ABORT.  Process 1
> >     >     continues to run.  This behavior is unexpected.
> >     >    
> >     >     ----
> >     >    
> >     >     I have attached all files to this E-mail.  Since there are 
> > absolute
> >     >     pathnames in the files, to
> >     >     reproduce my findings, you will need to update the pathnames in 
> > the
> >     >     appfiles and shell
> >     >     scripts.  To run example 1,
> >     >    
> >     >     sh run1.sh
> >     >    
> >     >     and to run example 2,
> >     >    
> >     >     sh run2.sh
> >     >    
> >     >     ---
> >     >    
> >     >     I have tested these examples with Open MPI 1.4.3 and 2.0.3.  In
> >     >     Open MPI 1.4.3, both
> >     >     examples work as expected.  Open MPI 2.0.3 has the same behavior
> >     >     as Open MPI 2.1.1.
> >     >    
> >     >     ---
> >     >    
> >     >     I would prefer that Open MPI 2.1.1 aborts both processes, even
> >     >     when the executables are
> >     >     invoked indirectly by mpirun.  If there is an MCA setting that is
> >     >     needed to make Open MPI
> >     >     2.1.1 abort both processes, please let me know.
> >     >    
> >     >    
> >     >     Sincerely,
> >     >    
> >     >     Theodore Sussman
> >     >    
> >     >    
> >     >     The following section of this message contains a file attachment
> >     >     prepared for transmission using the Internet MIME message format.
> >     >     If you are using Pegasus Mail, or any other MIME-compliant system,
> >     >     you should be able to save it or view it from within your mailer.
> >     >     If you cannot, please ask your system administrator for 
> > assistance.
> >     >    
> >     >       ---- File information -----------
> >     >         File:  config.log.bz2
> >     >         Date:  14 Jun 2017, 13:35
> >     >         Size:  146548 bytes.
> >     >         Type:  Binary
> >     >    
> >     >    
> >     >     The following section of this message contains a file attachment
> >     >     prepared for transmission using the Internet MIME message format.
> >     >     If you are using Pegasus Mail, or any other MIME-compliant system,
> >     >     you should be able to save it or view it from within your mailer.
> >     >     If you cannot, please ask your system administrator for 
> > assistance.
> >     >    
> >     >       ---- File information -----------
> >     >         File:  ompi_info.bz2
> >     >         Date:  14 Jun 2017, 13:35
> >     >         Size:  24088 bytes.
> >     >         Type:  Binary
> >     >    
> >     >    
> >     >     The following section of this message contains a file attachment
> >     >     prepared for transmission using the Internet MIME message format.
> >     >     If you are using Pegasus Mail, or any other MIME-compliant system,
> >     >     you should be able to save it or view it from within your mailer.
> >     >     If you cannot, please ask your system administrator for 
> > assistance.
> >     >    
> >     >       ---- File information -----------
> >     >         File:  aborttest02.tgz
> >     >         Date:  14 Jun 2017, 13:52
> >     >         Size:  4285 bytes.
> >     >         Type:  Binary
> >     >    
> >     >    
> >     >     _______________________________________________
> >     >     users mailing list
> >     >     users@lists.open-mpi.org
> >     >     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >     >    
> >     >     _______________________________________________
> >     >     users mailing list
> >     >     users@lists.open-mpi.org
> >     >     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >     >    
> >     >    
> >     >    
> >     >     _______________________________________________
> >     >     users mailing list
> >     >     users@lists.open-mpi.org
> >     >     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >     >    
> >     >     _______________________________________________
> >     >     users mailing list
> >     >     users@lists.open-mpi.org
> >     >     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >     >    
> >     >    
> >     >    
> >     >     _______________________________________________
> >     >     users mailing list
> >     >     users@lists.open-mpi.org
> >     >     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >     >
> >
> >       
> >     _______________________________________________
> >     users mailing list
> >     users@lists.open-mpi.org
> >     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >
> 
>   
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to