Hello Gilles,

Thank you for your quick answer.  I confirm that if exec is used, both 
processes immediately 
abort.

Now suppose that the line

echo "After aborttest: OMPI_COMM_WORLD_RANK="$OMPI_COMM_WORLD_RANK

is added to the end of dum.sh.

If Example 2 is run with Open MPI 1.4.3, the output is

After aborttest: OMPI_COMM_WORLD_RANK=0

which shows that the shell script for the process with rank 0 continues after 
the abort,
but that the shell script for the process with rank 1 does not continue after 
the abort.

If Example 2 is run with Open MPI 2.1.1, with exec used to invoke 
aborttest02.exe, then 
there is no such output, which shows that both shell scripts do not continue 
after the abort.

I prefer the Open MPI 1.4.3 behavior because our original application depends 
upon the 
Open MPI 1.4.3 behavior.  (Our original application will also work if both 
executables are 
aborted, and if both shell scripts continue after the abort.)

It might be too much to expect, but is there a way to recover the Open MPI 
1.4.3 behavior 
using Open MPI 2.1.1?  

Sincerely,

Ted Sussman


On 15 Jun 2017 at 9:50, Gilles Gouaillardet wrote:

> Ted,
> 
> 
> fwiw, the 'master' branch has the behavior you expect.
> 
> 
> meanwhile, you can simple edit your 'dum.sh' script and replace
> 
> /home/buildadina/src/aborttest02/aborttest02.exe
> 
> with
> 
> exec /home/buildadina/src/aborttest02/aborttest02.exe
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> On 6/15/2017 3:01 AM, Ted Sussman wrote:
> > Hello,
> >
> > My question concerns MPI_ABORT, indirect execution of executables by mpirun 
> > and Open
> > MPI 2.1.1.  When mpirun runs executables directly, MPI_ABORT works as 
> > expected, but
> > when mpirun runs executables indirectly, MPI_ABORT does not work as 
> > expected.
> >
> > If Open MPI 1.4.3 is used instead of Open MPI 2.1.1, MPI_ABORT works as 
> > expected in all
> > cases.
> >
> > The examples given below have been simplified as far as possible to show 
> > the issues.
> >
> > ---
> >
> > Example 1
> >
> > Consider an MPI job run in the following way:
> >
> > mpirun ... -app addmpw1
> >
> > where the appfile addmpw1 lists two executables:
> >
> > -n 1 -host gulftown ... aborttest02.exe
> > -n 1 -host gulftown ... aborttest02.exe
> >
> > The two executables are executed on the local node gulftown.  aborttest02 
> > calls MPI_ABORT
> > for rank 0, then sleeps.
> >
> > The above MPI job runs as expected.  Both processes immediately abort when 
> > rank 0 calls
> > MPI_ABORT.
> >
> > ---
> >
> > Example 2
> >
> > Now change the above example as follows:
> >
> > mpirun ... -app addmpw2
> >
> > where the appfile addmpw2 lists shell scripts:
> >
> > -n 1 -host gulftown ... dum.sh
> > -n 1 -host gulftown ... dum.sh
> >
> > dum.sh invokes aborttest02.exe.  So aborttest02.exe is executed indirectly 
> > by mpirun.
> >
> > In this case, the MPI job only aborts process 0 when rank 0 calls 
> > MPI_ABORT.  Process 1
> > continues to run.  This behavior is unexpected.
> >
> > ----
> >
> > I have attached all files to this E-mail.  Since there are absolute 
> > pathnames in the files, to
> > reproduce my findings, you will need to update the pathnames in the 
> > appfiles and shell
> > scripts.  To run example 1,
> >
> > sh run1.sh
> >
> > and to run example 2,
> >
> > sh run2.sh
> >
> > ---
> >
> > I have tested these examples with Open MPI 1.4.3 and 2.0.3.  In Open MPI 
> > 1.4.3, both
> > examples work as expected.  Open MPI 2.0.3 has the same behavior as Open 
> > MPI 2.1.1.
> >
> > ---
> >
> > I would prefer that Open MPI 2.1.1 aborts both processes, even when the 
> > executables are
> > invoked indirectly by mpirun.  If there is an MCA setting that is needed to 
> > make Open MPI
> > 2.1.1 abort both processes, please let me know.
> >
> >
> > Sincerely,
> >
> > Theodore Sussman
> >
> >
> > The following section of this message contains a file attachment
> > prepared for transmission using the Internet MIME message format.
> > If you are using Pegasus Mail, or any other MIME-compliant system,
> > you should be able to save it or view it from within your mailer.
> > If you cannot, please ask your system administrator for assistance.
> >
> >     ---- File information -----------
> >       File:  config.log.bz2
> >       Date:  14 Jun 2017, 13:35
> >       Size:  146548 bytes.
> >       Type:  Binary
> >
> >
> > The following section of this message contains a file attachment
> > prepared for transmission using the Internet MIME message format.
> > If you are using Pegasus Mail, or any other MIME-compliant system,
> > you should be able to save it or view it from within your mailer.
> > If you cannot, please ask your system administrator for assistance.
> >
> >     ---- File information -----------
> >       File:  ompi_info.bz2
> >       Date:  14 Jun 2017, 13:35
> >       Size:  24088 bytes.
> >       Type:  Binary
> >
> >
> > The following section of this message contains a file attachment
> > prepared for transmission using the Internet MIME message format.
> > If you are using Pegasus Mail, or any other MIME-compliant system,
> > you should be able to save it or view it from within your mailer.
> > If you cannot, please ask your system administrator for assistance.
> >
> >     ---- File information -----------
> >       File:  aborttest02.tgz
> >       Date:  14 Jun 2017, 13:52
> >       Size:  4285 bytes.
> >       Type:  Binary
> >
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to