I don't do any setting of process groups.  dum.sh just invokes the executable:

/..../aborttest10.exe


On 19 Jun 2017 at 10:30, r...@open-mpi.org wrote:

> When you fork that process off, do you set its process group? Or is it in the 
> same process group as the shell script?
> 
> > On Jun 19, 2017, at 10:19 AM, Ted Sussman <ted.suss...@adina.com> wrote:
> > 
> > If I replace the sleep with an infinite loop, I get the same behavior.  One 
> > "aborttest" process 
> > remains after all the signals are sent.
> > 
> > On 19 Jun 2017 at 10:10, r...@open-mpi.org wrote:
> > 
> >> 
> >> That is typical behavior when you throw something into "sleep" - not much 
> >> we can do about it, I 
> >> think.
> >> 
> >>    On Jun 19, 2017, at 9:58 AM, Ted Sussman <ted.suss...@adina.com> wrote:
> >> 
> >>    Hello,
> >> 
> >>    I have rebuilt Open MPI 2.1.1 on the same computer, including 
> >> --enable-debug.
> >> 
> >>    I have attached the abort test program aborttest10.tgz.  This version 
> >> sleeps for 5 sec before
> >>    calling MPI_ABORT, so that I can check the pids using ps.
> >> 
> >>    This is what happens (see run2.sh.out).
> >> 
> >>    Open MPI invokes two instances of dum.sh.  Each instance of dum.sh 
> >> invokes aborttest.exe.
> >> 
> >>    Pid    Process
> >>    -------------------
> >>    19565  dum.sh
> >>    19566  dum.sh
> >>    19567 aborttest10.exe
> >>    19568 aborttest10.exe
> >> 
> >>    When MPI_ABORT is called, Open MPI sends SIGCONT, SIGTERM and SIGKILL 
> >> to both
> >>    instances of dum.sh (pids 19565 and 19566).
> >> 
> >>    ps shows that both the shell processes vanish, and that one of the 
> >> aborttest10.exe processes
> >>    vanishes.  But the other aborttest10.exe remains and continues until it 
> >> is finished sleeping.
> >> 
> >>    Hope that this information is useful.
> >> 
> >>    Sincerely,
> >> 
> >>    Ted Sussman
> >> 
> >> 
> >> 
> >>    On 19 Jun 2017 at 23:06,  gil...@rist.or.jp  wrote:
> >> 
> >> 
> >>     Ted,
> >>     
> >>    some traces are missing  because you did not configure with 
> >> --enable-debug
> >>    i am afraid you have to do it (and you probably want to install that 
> >> debug version in an 
> >>    other
> >>    location since its performances are not good for production) in order 
> >> to get all the logs.
> >>     
> >>    Cheers,
> >>     
> >>    Gilles
> >>     
> >>    ----- Original Message -----
> >>       Hello Gilles,
> >> 
> >>       I retried my example, with the same results as I observed before.  
> >> The process with rank 
> >>    1
> >>       does not get killed by MPI_ABORT.
> >> 
> >>       I have attached to this E-mail:
> >> 
> >>         config.log.bz2
> >>         ompi_info.bz2  (uses ompi_info -a)
> >>         aborttest09.tgz
> >> 
> >>       This testing is done on a computer running Linux 3.10.0.  This is a 
> >> different computer 
> >>    than
> >>       the computer that I previously used for testing.  You can confirm 
> >> that I am using Open 
> >>    MPI
> >>       2.1.1.
> >> 
> >>       tar xvzf aborttest09.tgz
> >>       cd aborttest09
> >>       ./sh run2.sh
> >> 
> >>       run2.sh contains the command
> >> 
> >>       /opt/openmpi-2.1.1-GNU/bin/mpirun -np 2 -mca btl tcp,self --mca 
> >> odls_base_verbose 
> >>    10
> >>       ./dum.sh
> >> 
> >>       The output from this run is in aborttest09/run2.sh.out.
> >> 
> >>       The output shows that the the "default" component is selected by 
> >> odls.
> >> 
> >>       The only messages from odls are: odls: launch spawning child ...  
> >> (two messages). 
> >>    There
> >>       are no messages from odls with "kill" and I see no SENDING SIGCONT / 
> >> SIGKILL
> >>       messages.
> >> 
> >>       I am not running from within any batch manager.
> >> 
> >>       Sincerely,
> >> 
> >>       Ted Sussman
> >> 
> >>       On 17 Jun 2017 at 16:02, gil...@rist.or.jp wrote:
> >> 
> >>    Ted,
> >> 
> >>    i do not observe the same behavior you describe with Open MPI 2.1.1
> >> 
> >>    # mpirun -np 2 -mca btl tcp,self --mca odls_base_verbose 5 ./abort.sh
> >> 
> >>    abort.sh 31361 launching abort
> >>    abort.sh 31362 launching abort
> >>    I am rank 0 with pid 31363
> >>    I am rank 1 with pid 31364
> >>    ------------------------------------------------------------------------
> >>    --
> >>    MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> >>    with errorcode 1.
> >> 
> >>    NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> >>    You may or may not see output from other processes, depending on
> >>    exactly when Open MPI kills them.
> >>    ------------------------------------------------------------------------
> >>    --
> >>    [linux:31356] [[18199,0],0] odls:kill_local_proc working on WILDCARD
> >>    [linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
> >>    [[18199,1],0]
> >>    [linux:31356] [[18199,0],0] SENDING SIGCONT TO [[18199,1],0]
> >>    [linux:31356] [[18199,0],0] odls:default:SENT KILL 18 TO PID 31361
> >>    SUCCESS
> >>    [linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
> >>    [[18199,1],1]
> >>    [linux:31356] [[18199,0],0] SENDING SIGCONT TO [[18199,1],1]
> >>    [linux:31356] [[18199,0],0] odls:default:SENT KILL 18 TO PID 31362
> >>    SUCCESS
> >>    [linux:31356] [[18199,0],0] SENDING SIGTERM TO [[18199,1],0]
> >>    [linux:31356] [[18199,0],0] odls:default:SENT KILL 15 TO PID 31361
> >>    SUCCESS
> >>    [linux:31356] [[18199,0],0] SENDING SIGTERM TO [[18199,1],1]
> >>    [linux:31356] [[18199,0],0] odls:default:SENT KILL 15 TO PID 31362
> >>    SUCCESS
> >>    [linux:31356] [[18199,0],0] SENDING SIGKILL TO [[18199,1],0]
> >>    [linux:31356] [[18199,0],0] odls:default:SENT KILL 9 TO PID 31361
> >>    SUCCESS
> >>    [linux:31356] [[18199,0],0] SENDING SIGKILL TO [[18199,1],1]
> >>    [linux:31356] [[18199,0],0] odls:default:SENT KILL 9 TO PID 31362
> >>    SUCCESS
> >>    [linux:31356] [[18199,0],0] odls:kill_local_proc working on WILDCARD
> >>    [linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
> >>    [[18199,1],0]
> >>    [linux:31356] [[18199,0],0] odls:kill_local_proc child [[18199,1],0] is
> >>    not alive
> >>    [linux:31356] [[18199,0],0] odls:kill_local_proc checking child process
> >>    [[18199,1],1]
> >>    [linux:31356] [[18199,0],0] odls:kill_local_proc child [[18199,1],1] is
> >>    not alive
> >> 
> >> 
> >>    Open MPI did kill both shells, and they were indeed killed as evidenced
> >>    by ps
> >> 
> >>    #ps -fu gilles --forest
> >>    UID        PID  PPID  C STIME TTY          TIME CMD
> >>    gilles    1564  1561  0 15:39 ?        00:00:01 sshd: gilles@pts/1
> >>    gilles    1565  1564  0 15:39 pts/1    00:00:00  \_ -bash
> >>    gilles   31356  1565  3 15:57 pts/1    00:00:00      \_ /home/gilles/
> >>    local/ompi-v2.x/bin/mpirun -np 2 -mca btl tcp,self --mca odls_base
> >>    gilles   31364     1  1 15:57 pts/1    00:00:00 ./abort
> >> 
> >> 
> >>    so trapping SIGTERM in your shell and manually killing the MPI task
> >>    should work
> >>    (as Jeff explained, as long as the shell script is fast enough to do
> >>    that between SIGTERM and SIGKILL)
> >> 
> >> 
> >>    if you observe a different behavior, please double check your Open MPI
> >>    version and post the outputs of the same commands.
> >> 
> >>    btw, are you running from a batch manager ? if yes, which one ?
> >> 
> >>    Cheers,
> >> 
> >>    Gilles
> >> 
> >>    ----- Original Message -----
> >>    Ted,
> >> 
> >>    if you
> >> 
> >>    mpirun --mca odls_base_verbose 10 ...
> >> 
> >>    you will see which processes get killed and how
> >> 
> >>    Best regards,
> >> 
> >> 
> >>    Gilles
> >> 
> >>    ----- Original Message -----
> >>    Hello Jeff,
> >> 
> >>    Thanks for your comments.
> >> 
> >>    I am not seeing behavior #4, on the two computers that I have 
> >>    tested
> >>    on, using Open MPI
> >>    2.1.1.
> >> 
> >>    I wonder if you can duplicate my results with the files that I have
> >>    uploaded.
> >> 
> >>    Regarding what is the "correct" behavior, I am willing to modify my
> >>    application to correspond
> >>    to Open MPI's behavior (whatever behavior the Open MPI 
> >>    developers
> >>    decide is best) --
> >>    provided that Open MPI does in fact kill off both shells.
> >> 
> >>    So my highest priority now is to find out why Open MPI 2.1.1 does
> >>    not
> >>    kill off both shells on
> >>    my computer.
> >> 
> >>    Sincerely,
> >> 
> >>    Ted Sussman
> >> 
> >>      On 16 Jun 2017 at 16:35, Jeff Squyres (jsquyres) wrote:
> >> 
> >>    Ted --
> >> 
> >>    Sorry for jumping in late.  Here's my $0.02...
> >> 
> >>    In the runtime, we can do 4 things:
> >> 
> >>    1. Kill just the process that we forked.
> >>    2. Kill just the process(es) that call back and identify
> >>    themselves
> >>    as MPI processes (we don't track this right now, but we could add that
> >>    functionality).
> >>    3. Union of #1 and #2.
> >>    4. Kill all processes (to include any intermediate processes 
> >>    that
> >>    are not included in #1 and #2).
> >> 
> >>    In Open MPI 2.x, #4 is the intended behavior.  There may be a 
> >>    bug
> >>    or
> >>    two that needs to get fixed (e.g., in your last mail, I don't see
> >>    offhand why it waits until the MPI process finishes sleeping), but we
> >>    should be killing the process group, which -- unless any of the
> >>    descendant processes have explicitly left the process group -- should
> >>    hit the entire process tree. 
> >> 
> >>    Sidenote: there's actually a way to be a bit more aggressive 
> >>    and
> >>    do
> >>    a better job of ensuring that we kill *all* processes (via creative
> >>    use
> >>    of PR_SET_CHILD_SUBREAPER), but that's basically a future 
> >>    enhancement
> >>    /
> >>    optimization.
> >> 
> >>    I think Gilles and Ralph proposed a good point to you: if you 
> >>    want
> >>    to be sure to be able to do cleanup after an MPI process terminates (
> >>    normally or abnormally), you should trap signals in your intermediate
> >>    processes to catch what Open MPI's runtime throws and therefore know
> >>    that it is time to cleanup. 
> >> 
> >>    Hypothetically, this should work in all versions of Open MPI...?
> >> 
> >>    I think Ralph made a pull request that adds an MCA param to 
> >>    change
> >>    the default behavior from #4 to #1.
> >> 
> >>    Note, however, that there's a little time between when Open 
> >>    MPI
> >>    sends the SIGTERM and the SIGKILL, so this solution could be racy.  If
> >>    you find that you're running out of time to cleanup, we might be able
> >>    to
> >>    make the delay between the SIGTERM and SIGKILL be configurable 
> >>    (e.g.,
> >>    via MCA param).
> >> 
> >> 
> >> 
> >> 
> >>    On Jun 16, 2017, at 10:08 AM, Ted Sussman 
> >>    <ted.suss...@adina.com
> >> 
> >>    wrote:
> >> 
> >>    Hello Gilles and Ralph,
> >> 
> >>    Thank you for your advice so far.  I appreciate the time 
> >>    that
> >>    you
> >>    have spent to educate me about the details of Open MPI.
> >> 
> >>    But I think that there is something fundamental that I 
> >>    don't
> >>    understand.  Consider Example 2 run with Open MPI 2.1.1.
> >> 
> >>    mpirun --> shell for process 0 -->  executable for process 
> >>    0 -->
> >>    MPI calls, MPI_Abort
> >>            --> shell for process 1 -->  executable for process 1 -->
> >>    MPI calls
> >> 
> >>    After the MPI_Abort is called, ps shows that both shells 
> >>    are
> >>    running, and that the executable for process 1 is running (in this
> >>    case,
> >>    process 1 is sleeping).  And mpirun does not exit until process 1 is
> >>    finished sleeping.
> >> 
> >>    I cannot reconcile this observed behavior with the 
> >>    statement
> >> 
> >>          >     2.x: each process is put into its own process group
> >>    upon launch. When we issue a
> >>         >     "kill", we issue it to the process group. Thus,
> >>    every
> >>    child proc of that child proc will
> >>         >     receive it. IIRC, this was the intended behavior.
> >> 
> >>    I assume that, for my example, there are two process 
> >>    groups. 
> >>    The
> >>    process group for process 0 contains the shell for process 0 and the
> >>    executable for process 0; and the process group for process 1 contains
> >>    the shell for process 1 and the executable for process 1.  So what
> >>    does
> >>    MPI_ABORT do?  MPI_ABORT does not kill the process group for process 
> >>    0,
> >>     
> >>    since the shell for process 0 continues.  And MPI_ABORT does not kill
> >>    the process group for process 1, since both the shell and executable
> >>    for
> >>    process 1 continue.
> >> 
> >>    If I hit Ctrl-C after MPI_Abort is called, I get the message
> >> 
> >>    mpirun: abort is already in progress.. hit ctrl-c again to
> >>    forcibly terminate
> >> 
> >>    but I don't need to hit Ctrl-C again because mpirun 
> >>    immediately
> >>    exits.
> >> 
> >>    Can you shed some light on all of this?
> >> 
> >>    Sincerely,
> >> 
> >>    Ted Sussman
> >> 
> >> 
> >>    On 15 Jun 2017 at 14:44, r...@open-mpi.org wrote:
> >> 
> >> 
> >>    You have to understand that we have no way of 
> >>    knowing who is
> >>    making MPI calls - all we see is
> >>    the proc that we started, and we know someone of 
> >>    that rank is
> >>    running (but we have no way of
> >>    knowing which of the procs you sub-spawned it is).
> >> 
> >>    So the behavior you are seeking only occurred in 
> >>    some earlier
> >>    release by sheer accident. Nor will
> >>    you find it portable as there is no specification 
> >>    directing
> >>    that
> >>    behavior.
> >> 
> >>    The behavior I´ve provided is to either deliver the 
> >>    signal to
> >>    _
> >>    all_ child processes (including
> >>    grandchildren etc.), or _only_ the immediate child 
> >>    of the
> >>    daemon.
> >>      It won´t do what you describe -
> >>    kill the mPI proc underneath the shell, but not the 
> >>    shell
> >>    itself.
> >> 
> >>    What you can eventually do is use PMIx to ask the 
> >>    runtime to
> >>    selectively deliver signals to
> >>    pid/procs for you. We don´t have that capability 
> >>    implemented
> >>    just yet, I´m afraid.
> >> 
> >>    Meantime, when I get a chance, I can code an 
> >>    option that will
> >>    record the pid of the subproc that
> >>    calls MPI_Init, and then let´s you deliver signals to 
> >>    just
> >>    that
> >>    proc. No promises as to when that will
> >>    be done.
> >> 
> >> 
> >>          On Jun 15, 2017, at 1:37 PM, Ted Sussman 
> >>    <ted.sussman@
> >>    adina.
> >>    com> wrote:
> >> 
> >>         Hello Ralph,
> >> 
> >>          I am just an Open MPI end user, so I will need to 
> >>    wait for
> >>    the next official release.
> >> 
> >>         mpirun --> shell for process 0 -->  executable for 
> >>    process
> >>    0
> >>    --> MPI calls
> >>                 --> shell for process 1 -->  executable for process
> >>    1
> >>    --> MPI calls
> >>                                          ...
> >> 
> >>         I guess the question is, should MPI_ABORT kill the
> >>    executables or the shells?  I naively
> >>         thought, that, since it is the executables that make 
> >>    the
> >>    MPI
> >>    calls, it is the executables that
> >>         should be aborted by the call to MPI_ABORT.  Since 
> >>    the
> >>    shells don't make MPI calls, the
> >>          shells should not be aborted.
> >> 
> >>         And users might have several layers of shells in 
> >>    between
> >>    mpirun and the executable.
> >> 
> >>         So now I will look for the latest version of Open MPI 
> >>    that
> >>    has the 1.4.3 behavior.
> >> 
> >>         Sincerely,
> >> 
> >>         Ted Sussman
> >> 
> >>          On 15 Jun 2017 at 12:31, r...@open-mpi.org wrote:
> >> 
> >>         >
> >>          > Yeah, things jittered a little there as we debated 
> >>    the "
> >>    right" behavior. Generally, when we
> >>         see that
> >>         > happening it means that a param is required, but 
> >>    somehow
> >>    we never reached that point.
> >>         >
> >>         > See if https://github.com/open-mpi/ompi/pull/3704  
> >>    helps
> >>    -
> >>    if so, I can schedule it for the next
> >>         2.x
> >>          > release if the RMs agree to take it
> >>         >
> >>         > Ralph
> >>          >
> >>         >     On Jun 15, 2017, at 12:20 PM, Ted Sussman <ted.
> >>    sussman
> >>    @adina.com > wrote:
> >>          >
> >>         >     Thank you for your comments.
> >>          >   
> >>         >     Our application relies upon "dum.sh" to clean up
> >>    after
> >>    the process exits, either if the
> >>          process
> >>         >     exits normally, or if the process exits abnormally
> >>    because of MPI_ABORT.  If the process
> >>          >     group is killed by MPI_ABORT, this clean up will not
> >>    be performed.  If exec is used to launch
> >>         >     the executable from dum.sh, then dum.sh is
> >>    terminated
> >>    by the exec, so dum.sh cannot
> >>         >     perform any clean up.
> >>         >   
> >>          >     I suppose that other user applications might work
> >>    similarly, so it would be good to have an
> >>         >     MCA parameter to control the behavior of 
> >>    MPI_ABORT.
> >>         >   
> >>         >     We could rewrite our shell script that invokes
> >>    mpirun,
> >>    so that the cleanup that is now done
> >>         >     by
> >>          >     dum.sh is done by the invoking shell script after
> >>    mpirun exits.  Perhaps this technique is the
> >>         >     preferred way to clean up after mpirun is invoked.
> >>          >   
> >>         >     By the way, I have also tested with Open MPI 
> >>    1.10.7,
> >>    and Open MPI 1.10.7 has different
> >>          >     behavior than either Open MPI 1.4.3 or Open MPI 
> >>    2.1.
> >>    1.
> >>       In this explanation, it is important to
> >>          >     know that the aborttest executable sleeps for 20 
> >>    sec.
> >>         >   
> >>          >     When running example 2:
> >>         >   
> >>         >     1.4.3: process 1 immediately aborts
> >>         >     1.10.7: process 1 doesn't abort and never stops.
> >>          >     2.1.1 process 1 doesn't abort, but stops after it is
> >>    finished sleeping
> >>         >   
> >>         >     Sincerely,
> >>         >   
> >>         >     Ted Sussman
> >>          >   
> >>         >     On 15 Jun 2017 at 9:18, r...@open-mpi.org wrote:
> >>         >
> >>         >     Here is how the system is working:
> >>          >   
> >>         >     Master: each process is put into its own process
> >>    group
> >>    upon launch. When we issue a
> >>         >     "kill", however, we only issue it to the individual
> >>    process (instead of the process group
> >>         >     that is headed by that child process). This is
> >>    probably a bug as I don´t believe that is
> >>         >     what we intended, but set that aside for now.
> >>          >   
> >>         >     2.x: each process is put into its own process group
> >>    upon launch. When we issue a
> >>         >     "kill", we issue it to the process group. Thus,
> >>    every
> >>    child proc of that child proc will
> >>         >     receive it. IIRC, this was the intended behavior.
> >>          >   
> >>         >     It is rather trivial to make the change (it only
> >>    involves 3 lines of code), but I´m not sure
> >>         >     of what our intended behavior is supposed to be.
> >>    Once
> >>    we clarify that, it is also trivial
> >>         >     to add another MCA param (you can never have too
> >>    many!)
> >>      to allow you to select the
> >>         >     other behavior.
> >>         >   
> >>         >
> >>          >     On Jun 15, 2017, at 5:23 AM, Ted Sussman <ted.
> >>    sussman@
> >>    adina.com > wrote:
> >>         >   
> >>         >     Hello Gilles,
> >>         >   
> >>          >     Thank you for your quick answer.  I confirm that if
> >>    exec is used, both processes
> >>         >     immediately
> >>          >     abort.
> >>         >   
> >>          >     Now suppose that the line
> >>         >   
> >>         >     echo "After aborttest:
> >>         >     
> >>    OMPI_COMM_WORLD_RANK="$OMPI_COMM_
> >>    WORLD_RANK
> >>          >   
> >>         >     is added to the end of dum.sh.
> >>         >   
> >>         >     If Example 2 is run with Open MPI 1.4.3, the output
> >>    is
> >>         >   
> >>         >     After aborttest: OMPI_COMM_WORLD_RANK=0
> >>         >   
> >>         >     which shows that the shell script for the process
> >>    with
> >>    rank 0 continues after the
> >>          >     abort,
> >>         >     but that the shell script for the process with rank
> >>    1
> >>    does not continue after the
> >>          >     abort.
> >>         >   
> >>          >     If Example 2 is run with Open MPI 2.1.1, with exec
> >>    used to invoke
> >>         >     aborttest02.exe, then
> >>         >     there is no such output, which shows that both shell
> >>    scripts do not continue after
> >>         >     the abort.
> >>         >   
> >>          >     I prefer the Open MPI 1.4.3 behavior because our
> >>    original application depends
> >>         >     upon the
> >>          >     Open MPI 1.4.3 behavior.  (Our original application
> >>    will also work if both
> >>         >     executables are
> >>          >     aborted, and if both shell scripts continue after
> >>    the
> >>    abort.)
> >>         >   
> >>          >     It might be too much to expect, but is there a way
> >>    to
> >>    recover the Open MPI 1.4.3
> >>         >     behavior
> >>          >     using Open MPI 2.1.1? 
> >>         >   
> >>          >     Sincerely,
> >>         >   
> >>         >     Ted Sussman
> >>         >   
> >>         >   
> >>          >     On 15 Jun 2017 at 9:50, Gilles Gouaillardet wrote:
> >>         >
> >>         >     Ted,
> >>         >   
> >>          >   
> >>         >     fwiw, the 'master' branch has the behavior you
> >>    expect.
> >>         >   
> >>         >   
> >>         >     meanwhile, you can simple edit your 'dum.sh' script
> >>    and replace
> >>          >   
> >>         >     /home/buildadina/src/aborttest02/aborttest02.exe
> >>          >   
> >>         >     with
> >>          >   
> >>         >     exec /home/buildadina/src/aborttest02/aborttest02.
> >>    exe
> >>          >   
> >>         >   
> >>         >     Cheers,
> >>         >   
> >>         >   
> >>         >     Gilles
> >>         >   
> >>          >   
> >>         >     On 6/15/2017 3:01 AM, Ted Sussman wrote:
> >>          >     Hello,
> >>         >   
> >>         >     My question concerns MPI_ABORT, indirect 
> >>    execution
> >>    of
> >>         >     executables by mpirun and Open
> >>         >     MPI 2.1.1.  When mpirun runs executables directly,
> >>    MPI
> >>    _ABORT
> >>         >     works as expected, but
> >>          >     when mpirun runs executables indirectly, 
> >>    MPI_ABORT
> >>    does not
> >>         >     work as expected.
> >>         >   
> >>         >     If Open MPI 1.4.3 is used instead of Open MPI 
> >>    2.1.1,
> >>    MPI_ABORT
> >>         >     works as expected in all
> >>          >     cases.
> >>         >   
> >>          >     The examples given below have been simplified as 
> >>    far
> >>    as possible
> >>         >     to show the issues.
> >>         >   
> >>         >     ---
> >>         >   
> >>          >     Example 1
> >>         >   
> >>          >     Consider an MPI job run in the following way:
> >>         >   
> >>          >     mpirun ... -app addmpw1
> >>         >   
> >>         >     where the appfile addmpw1 lists two executables:
> >>         >   
> >>         >     -n 1 -host gulftown ... aborttest02.exe
> >>         >     -n 1 -host gulftown ... aborttest02.exe
> >>          >   
> >>         >     The two executables are executed on the local node
> >>    gulftown.
> >>         >      aborttest02 calls MPI_ABORT
> >>         >     for rank 0, then sleeps.
> >>         >   
> >>         >     The above MPI job runs as expected.  Both 
> >>    processes
> >>    immediately
> >>         >     abort when rank 0 calls
> >>         >     MPI_ABORT.
> >>         >   
> >>          >     ---
> >>         >   
> >>          >     Example 2
> >>         >   
> >>         >     Now change the above example as follows:
> >>         >   
> >>         >     mpirun ... -app addmpw2
> >>         >   
> >>         >     where the appfile addmpw2 lists shell scripts:
> >>         >   
> >>         >     -n 1 -host gulftown ... dum.sh
> >>         >     -n 1 -host gulftown ... dum.sh
> >>         >   
> >>         >     dum.sh invokes aborttest02.exe.  So aborttest02.exe
> >>    is
> >>    executed
> >>         >     indirectly by mpirun.
> >>         >   
> >>         >     In this case, the MPI job only aborts process 0 when
> >>    rank 0 calls
> >>          >     MPI_ABORT.  Process 1
> >>         >     continues to run.  This behavior is unexpected.
> >>         >   
> >>         >     ----
> >>          >   
> >>         >     I have attached all files to this E-mail.  Since
> >>    there
> >>    are absolute
> >>          >     pathnames in the files, to
> >>         >     reproduce my findings, you will need to update the
> >>    pathnames in the
> >>          >     appfiles and shell
> >>         >     scripts.  To run example 1,
> >>          >   
> >>         >     sh run1.sh
> >>          >   
> >>         >     and to run example 2,
> >>         >   
> >>         >     sh run2.sh
> >>         >   
> >>          >     ---
> >>         >   
> >>          >     I have tested these examples with Open MPI 1.4.3 
> >>    and
> >>    2.
> >>    0.3.  In
> >>         >     Open MPI 1.4.3, both
> >>          >     examples work as expected.  Open MPI 2.0.3 has 
> >>    the
> >>    same behavior
> >>         >     as Open MPI 2.1.1.
> >>         >   
> >>         >     ---
> >>          >   
> >>         >     I would prefer that Open MPI 2.1.1 aborts both
> >>    processes, even
> >>         >     when the executables are
> >>         >     invoked indirectly by mpirun.  If there is an MCA
> >>    setting that is
> >>         >     needed to make Open MPI
> >>         >     2.1.1 abort both processes, please let me know.
> >>          >   
> >>         >   
> >>         >     Sincerely,
> >>         >   
> >>         >     Theodore Sussman
> >>          >   
> >>         >   
> >>          >     The following section of this message contains a
> >>    file
> >>    attachment
> >>         >     prepared for transmission using the Internet MIME
> >>    message format.
> >>          >     If you are using Pegasus Mail, or any other MIME-
> >>    compliant system,
> >>         >     you should be able to save it or view it from within
> >>    your mailer.
> >>         >     If you cannot, please ask your system administrator
> >>    for assistance.
> >>         >   
> >>         >       ---- File information -----------
> >>         >         File:  config.log.bz2
> >>         >         Date:  14 Jun 2017, 13:35
> >>         >         Size:  146548 bytes.
> >>          >         Type:  Binary
> >>         >   
> >>          >   
> >>         >     The following section of this message contains a
> >>    file
> >>    attachment
> >>          >     prepared for transmission using the Internet MIME
> >>    message format.
> >>         >     If you are using Pegasus Mail, or any other MIME-
> >>    compliant system,
> >>         >     you should be able to save it or view it from within
> >>    your mailer.
> >>         >     If you cannot, please ask your system administrator
> >>    for assistance.
> >>         >   
> >>         >       ---- File information -----------
> >>         >         File:  ompi_info.bz2
> >>         >         Date:  14 Jun 2017, 13:35
> >>          >         Size:  24088 bytes.
> >>         >         Type:  Binary
> >>          >   
> >>         >   
> >>          >     The following section of this message contains a
> >>    file
> >>    attachment
> >>         >     prepared for transmission using the Internet MIME
> >>    message format.
> >>          >     If you are using Pegasus Mail, or any other MIME-
> >>    compliant system,
> >>         >     you should be able to save it or view it from within
> >>    your mailer.
> >>         >     If you cannot, please ask your system administrator
> >>    for assistance.
> >>         >   
> >>         >       ---- File information -----------
> >>         >         File:  aborttest02.tgz
> >>         >         Date:  14 Jun 2017, 13:52
> >>         >         Size:  4285 bytes.
> >>          >         Type:  Binary
> >>         >   
> >>          >   
> >>         >     
> >>    ________________________________________
> >>    _______
> >>          >     users mailing list
> >>         >     users@lists.open-mpi.org
> >>          >     
> >>    https://rfd.newmexicoconsortium.org/mailman/listin
> >>    fo/users
> >> 
> >> 
> >>         >   
> >>         >     
> >>    ________________________________________
> >>    _______
> >>          >     users mailing list
> >>         >     users@lists.open-mpi.org
> >>         >     
> >>    https://rfd.newmexicoconsortium.org/mailman/listin
> >>    fo/users
> >> 
> >> 
> >>         >   
> >>         >   
> >>          >   
> >>         >     
> >>    ________________________________________
> >>    _______
> >>          >     users mailing list
> >>         >     users@lists.open-mpi.org
> >>          >     
> >>    https://rfd.newmexicoconsortium.org/mailman/listin
> >>    fo/users
> >> 
> >> 
> >>         >   
> >>         >     
> >>    ________________________________________
> >>    _______
> >>          >     users mailing list
> >>         >     users@lists.open-mpi.org
> >>         >     
> >>    https://rfd.newmexicoconsortium.org/mailman/listin
> >>    fo/users
> >> 
> >> 
> >>         >   
> >>         >   
> >>          >   
> >>         >     
> >>    ________________________________________
> >>    _______
> >>          >     users mailing list
> >>         >     users@lists.open-mpi.org
> >>          >     
> >>    https://rfd.newmexicoconsortium.org/mailman/listin
> >>    fo/users
> >> 
> >> 
> >>         >
> >> 
> >>          
> >>         __________________________________________
> >>    _____
> >>          users mailing list
> >>         users@lists.open-mpi.org
> >>         
> >>     https://rfd.newmexicoconsortium.org/mailman/listin
> >>    fo/users
> >> 
> >> 
> >>      
> >>    _____________________________________________
> >>    __
> >>    users mailing list
> >>    users@lists.open-mpi.org
> >>    https://rfd.newmexicoconsortium.org/mailman/listinfo/us
> >>    ers
> >> 
> >> 
> >>    --
> >>    Jeff Squyres
> >>    jsquy...@cisco.com
> >> 
> >>    _______________________________________________
> >>    users mailing list
> >>    users@lists.open-mpi.org
> >>    https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >> 
> >> 
> >> 
> >>    _______________________________________________
> >>    users mailing list
> >>    users@lists.open-mpi.org
> >>    https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >> 
> >>    _______________________________________________
> >>    users mailing list
> >>    users@lists.open-mpi.org
> >>    https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >> 
> >>    _______________________________________________
> >>    users mailing list
> >>    users@lists.open-mpi.org
> >>    https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >> 
> >>         
> >> 
> >> 
> >>    The following section of this message contains a file attachment
> >>    prepared for transmission using the Internet MIME message format.
> >>    If you are using Pegasus Mail, or any other MIME-compliant system,
> >>    you should be able to save it or view it from within your mailer.
> >>    If you cannot, please ask your system administrator for assistance.
> >> 
> >>      ---- File information -----------
> >>        File:  aborttest10.tgz
> >>        Date:  19 Jun 2017, 12:42
> >>        Size:  4740 bytes.
> >>        Type:  Binary
> >>    <aborttest10.tgz>_______________________________________________
> >>    users mailing list
> >>    users@lists.open-mpi.org
> >>    https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >> 
> > 
> > 
> > 
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to