Hello Ralph,

I am just an Open MPI end user, so I will need to wait for the next official 
release.

mpirun --> shell for process 0 -->  executable for process 0 --> MPI calls
       --> shell for process 1 -->  executable for process 1 --> MPI calls
                               ...

I guess the question is, should MPI_ABORT kill the executables or the shells?  
I naively
thought, that, since it is the executables that make the MPI calls, it is the 
executables that
should be aborted by the call to MPI_ABORT.  Since the shells don't make MPI 
calls, the
shells should not be aborted.

And users might have several layers of shells in between mpirun and the 
executable.

So now I will look for the latest version of Open MPI that has the 1.4.3 
behavior.

Sincerely,

Ted Sussman

On 15 Jun 2017 at 12:31, r...@open-mpi.org wrote:

>
> Yeah, things jittered a little there as we debated the "right" behavior. 
> Generally, when we see that
> happening it means that a param is required, but somehow we never reached 
> that point.
>
> See if https://github.com/open-mpi/ompi/pull/3704  helps - if so, I can 
> schedule it for the next 2.x
> release if the RMs agree to take it
>
> Ralph
>
>     On Jun 15, 2017, at 12:20 PM, Ted Sussman <ted.suss...@adina.com> wrote:
>
>     Thank you for your comments.
>
>     Our application relies upon "dum.sh" to clean up after the process exits, 
> either if the process
>     exits normally, or if the process exits abnormally because of MPI_ABORT.  
> If the process
>     group is killed by MPI_ABORT, this clean up will not be performed.  If 
> exec is used to launch
>     the executable from dum.sh, then dum.sh is terminated by the exec, so 
> dum.sh cannot
>     perform any clean up.
>
>     I suppose that other user applications might work similarly, so it would 
> be good to have an
>     MCA parameter to control the behavior of MPI_ABORT.
>
>     We could rewrite our shell script that invokes mpirun, so that the 
> cleanup that is now done
>     by
>     dum.sh is done by the invoking shell script after mpirun exits.  Perhaps 
> this technique is the
>     preferred way to clean up after mpirun is invoked.
>
>     By the way, I have also tested with Open MPI 1.10.7, and Open MPI 1.10.7 
> has different
>     behavior than either Open MPI 1.4.3 or Open MPI 2.1.1.  In this 
> explanation, it is important to
>     know that the aborttest executable sleeps for 20 sec.
>
>     When running example 2:
>
>     1.4.3: process 1 immediately aborts
>     1.10.7: process 1 doesn't abort and never stops.
>     2.1.1 process 1 doesn't abort, but stops after it is finished sleeping
>
>     Sincerely,
>
>     Ted Sussman
>
>     On 15 Jun 2017 at 9:18, r...@open-mpi.org wrote:
>
>     Here is how the system is working:
>
>     Master: each process is put into its own process group upon launch. When 
> we issue a
>     "kill", however, we only issue it to the individual process (instead of 
> the process group
>     that is headed by that child process). This is probably a bug as I don´t 
> believe that is
>     what we intended, but set that aside for now.
>
>     2.x: each process is put into its own process group upon launch. When we 
> issue a
>     "kill", we issue it to the process group. Thus, every child proc of that 
> child proc will
>     receive it. IIRC, this was the intended behavior.
>
>     It is rather trivial to make the change (it only involves 3 lines of 
> code), but I´m not sure
>     of what our intended behavior is supposed to be. Once we clarify that, it 
> is also trivial
>     to add another MCA param (you can never have too many!) to allow you to 
> select the
>     other behavior.
>
>
>     On Jun 15, 2017, at 5:23 AM, Ted Sussman <ted.suss...@adina.com> wrote:
>
>     Hello Gilles,
>
>     Thank you for your quick answer.  I confirm that if exec is used, both 
> processes
>     immediately
>     abort.
>
>     Now suppose that the line
>
>     echo "After aborttest:
>     OMPI_COMM_WORLD_RANK="$OMPI_COMM_WORLD_RANK
>
>     is added to the end of dum.sh.
>
>     If Example 2 is run with Open MPI 1.4.3, the output is
>
>     After aborttest: OMPI_COMM_WORLD_RANK=0
>
>     which shows that the shell script for the process with rank 0 continues 
> after the
>     abort,
>     but that the shell script for the process with rank 1 does not continue 
> after the
>     abort.
>
>     If Example 2 is run with Open MPI 2.1.1, with exec used to invoke
>     aborttest02.exe, then
>     there is no such output, which shows that both shell scripts do not 
> continue after
>     the abort.
>
>     I prefer the Open MPI 1.4.3 behavior because our original application 
> depends
>     upon the
>     Open MPI 1.4.3 behavior.  (Our original application will also work if both
>     executables are
>     aborted, and if both shell scripts continue after the abort.)
>
>     It might be too much to expect, but is there a way to recover the Open 
> MPI 1.4.3
>     behavior
>     using Open MPI 2.1.1?  
>
>     Sincerely,
>
>     Ted Sussman
>
>
>     On 15 Jun 2017 at 9:50, Gilles Gouaillardet wrote:
>
>     Ted,
>
>
>     fwiw, the 'master' branch has the behavior you expect.
>
>
>     meanwhile, you can simple edit your 'dum.sh' script and replace
>
>     /home/buildadina/src/aborttest02/aborttest02.exe
>
>     with
>
>     exec /home/buildadina/src/aborttest02/aborttest02.exe
>
>
>     Cheers,
>
>
>     Gilles
>
>
>     On 6/15/2017 3:01 AM, Ted Sussman wrote:
>     Hello,
>
>     My question concerns MPI_ABORT, indirect execution of
>     executables by mpirun and Open
>     MPI 2.1.1.  When mpirun runs executables directly, MPI_ABORT
>     works as expected, but
>     when mpirun runs executables indirectly, MPI_ABORT does not
>     work as expected.
>
>     If Open MPI 1.4.3 is used instead of Open MPI 2.1.1, MPI_ABORT
>     works as expected in all
>     cases.
>
>     The examples given below have been simplified as far as possible
>     to show the issues.
>
>     ---
>
>     Example 1
>
>     Consider an MPI job run in the following way:
>
>     mpirun ... -app addmpw1
>
>     where the appfile addmpw1 lists two executables:
>
>     -n 1 -host gulftown ... aborttest02.exe
>     -n 1 -host gulftown ... aborttest02.exe
>
>     The two executables are executed on the local node gulftown.
>      aborttest02 calls MPI_ABORT
>     for rank 0, then sleeps.
>
>     The above MPI job runs as expected.  Both processes immediately
>     abort when rank 0 calls
>     MPI_ABORT.
>
>     ---
>
>     Example 2
>
>     Now change the above example as follows:
>
>     mpirun ... -app addmpw2
>
>     where the appfile addmpw2 lists shell scripts:
>
>     -n 1 -host gulftown ... dum.sh
>     -n 1 -host gulftown ... dum.sh
>
>     dum.sh invokes aborttest02.exe.  So aborttest02.exe is executed
>     indirectly by mpirun.
>
>     In this case, the MPI job only aborts process 0 when rank 0 calls
>     MPI_ABORT.  Process 1
>     continues to run.  This behavior is unexpected.
>
>     ----
>
>     I have attached all files to this E-mail.  Since there are absolute
>     pathnames in the files, to
>     reproduce my findings, you will need to update the pathnames in the
>     appfiles and shell
>     scripts.  To run example 1,
>
>     sh run1.sh
>
>     and to run example 2,
>
>     sh run2.sh
>
>     ---
>
>     I have tested these examples with Open MPI 1.4.3 and 2.0.3.  In
>     Open MPI 1.4.3, both
>     examples work as expected.  Open MPI 2.0.3 has the same behavior
>     as Open MPI 2.1.1.
>
>     ---
>
>     I would prefer that Open MPI 2.1.1 aborts both processes, even
>     when the executables are
>     invoked indirectly by mpirun.  If there is an MCA setting that is
>     needed to make Open MPI
>     2.1.1 abort both processes, please let me know.
>
>
>     Sincerely,
>
>     Theodore Sussman
>
>
>     The following section of this message contains a file attachment
>     prepared for transmission using the Internet MIME message format.
>     If you are using Pegasus Mail, or any other MIME-compliant system,
>     you should be able to save it or view it from within your mailer.
>     If you cannot, please ask your system administrator for assistance.
>
>       ---- File information -----------
>         File:  config.log.bz2
>         Date:  14 Jun 2017, 13:35
>         Size:  146548 bytes.
>         Type:  Binary
>
>
>     The following section of this message contains a file attachment
>     prepared for transmission using the Internet MIME message format.
>     If you are using Pegasus Mail, or any other MIME-compliant system,
>     you should be able to save it or view it from within your mailer.
>     If you cannot, please ask your system administrator for assistance.
>
>       ---- File information -----------
>         File:  ompi_info.bz2
>         Date:  14 Jun 2017, 13:35
>         Size:  24088 bytes.
>         Type:  Binary
>
>
>     The following section of this message contains a file attachment
>     prepared for transmission using the Internet MIME message format.
>     If you are using Pegasus Mail, or any other MIME-compliant system,
>     you should be able to save it or view it from within your mailer.
>     If you cannot, please ask your system administrator for assistance.
>
>       ---- File information -----------
>         File:  aborttest02.tgz
>         Date:  14 Jun 2017, 13:52
>         Size:  4285 bytes.
>         Type:  Binary
>
>
>     _______________________________________________
>     users mailing list
>     users@lists.open-mpi.org
>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>     _______________________________________________
>     users mailing list
>     users@lists.open-mpi.org
>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
>     _______________________________________________
>     users mailing list
>     users@lists.open-mpi.org
>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>     _______________________________________________
>     users mailing list
>     users@lists.open-mpi.org
>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
>     _______________________________________________
>     users mailing list
>     users@lists.open-mpi.org
>     https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to