I'm seeing different behaviour between Open MPI 1.8.4 and 2.0.1 with
regards to signal propagation.

With version 1.8.4 mpirun seems to propagate SIGTERM to the tasks it starts
which enables the tasks to handle SIGTERM.

In version 2.0.1 mpirun does not seem to propagate SIGTERM and instead I
suspect it's sending SIGKILL immediately. Because the child tasks are not
given a chance to handle SIGTERM they end up orphaning their child
processes.

I have a pretty simply reproducer which consists of:

   1. A simple MPI application that sleeps for a number of seconds.
   2. A simple bash script which launches mpirun.
   3. A second bash script which is used to launch a 'child' MPI
   application 'sleep' binary

Both scripts launch their children in the background, and 'wait' on
completion. They both install signal handlers for SIGTERM.

When SIGTERM is sent to the top level script it is explicitly propagated to
'mpirun' via the signal handler.

In Open MPI 1.8.4 SIGTERM is propagated to the child MPI tasks which in
turn explicitly propagate the signal to the child binary processes.

In Open MPI 2.0.1 I see no evidence that SIGTERM is propagated to the child
MPI tasks. Instead those tasks are killed and their children (the
application binaries) are orphaned.

Is the difference in behaviour between the different versions expected..?
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to