The question was raised on this list a short while ago about potentially
incorrect behavior by ORTE/OMPI in response to SIGUSR2 being sent to
application procs. I have spent some time chasing this down, and it does
-not- appear to be an issue within our systems.

What I have found is that if you send a SIGUSR1/2 to mpirun, mpirun and the
daemons correctly transmit the provided signal to the application processes.
Neither mpirun nor the daemons directly respond to it themselves.


If the application process has defined its own signal handler to trap
USR1/2, then the application process will successfully do so. Everything
seems to work fine - the daemon does -not- get a callback nor in any way
take action to the fact that the proc received this signal - unless the
process' signal handler orders the process to exit! In this case, the
environment reports to the orted that the process exit'd during a signal
handler, which results in a terminated-by-signal status.

You can, of course, get around this by simply not exiting from within the
signal handler. Instead, set a flag and return from the handler, then have
an appropriate routine check the flag and exit. I have done that in several
codes and would be happy to advise you on how to do it. With this technique,
you clear the signal and the environment will not report you as
terminated-by-signal.


However, if the application process has -not- defined its own signal
handler, some native environments terminate the process when it receives
SIGUSR1/2! This occurred for me under SLURM on the odin cluster, and under
TM on our RRZ cluster. I cannot say it is a universal situation and would
welcome more feedback from people with access to other environments.

This termination is dutifully reported to the orted, which notes that the
proc was terminated-by-signal. The orted does not check to see -which-
signal was used to terminate the proc.


By our own design requirements, the response to a termination-by-signal of a
process is to abort the job. If we want to modify that, it would be simple
to say "except if it was a SIGUSR1/2 signal". I have no issue with making
that change, but please note that it -is- a change in our defined behavior,
and a change from what has been our behavior since the beginning of the
project.

Let me know if you want to change the design requirement and we can take
care of it.

Thanks
Ralph


Reply via email to