Ralph,
  Thanks for looking into this.  I do not think that the behaviour needs to
change - it is correct.  However, for some reason this is not how things
were running for me -  I wander what the difference is.  I worked around
this by getting the pid's of the mpi processes, and delivered the signals
directly to them, so was able to avoid the kill, and this was sufficient for
me.

Thanks again,
Rich


On 4/17/08 3:23 PM, "Ralph Castain" <r...@lanl.gov> wrote:

> The question was raised on this list a short while ago about potentially
> incorrect behavior by ORTE/OMPI in response to SIGUSR2 being sent to
> application procs. I have spent some time chasing this down, and it does
> -not- appear to be an issue within our systems.
> 
> What I have found is that if you send a SIGUSR1/2 to mpirun, mpirun and the
> daemons correctly transmit the provided signal to the application processes.
> Neither mpirun nor the daemons directly respond to it themselves.
> 
> 
> If the application process has defined its own signal handler to trap
> USR1/2, then the application process will successfully do so. Everything
> seems to work fine - the daemon does -not- get a callback nor in any way
> take action to the fact that the proc received this signal - unless the
> process' signal handler orders the process to exit! In this case, the
> environment reports to the orted that the process exit'd during a signal
> handler, which results in a terminated-by-signal status.
> 
> You can, of course, get around this by simply not exiting from within the
> signal handler. Instead, set a flag and return from the handler, then have
> an appropriate routine check the flag and exit. I have done that in several
> codes and would be happy to advise you on how to do it. With this technique,
> you clear the signal and the environment will not report you as
> terminated-by-signal.
> 
> 
> However, if the application process has -not- defined its own signal
> handler, some native environments terminate the process when it receives
> SIGUSR1/2! This occurred for me under SLURM on the odin cluster, and under
> TM on our RRZ cluster. I cannot say it is a universal situation and would
> welcome more feedback from people with access to other environments.
> 
> This termination is dutifully reported to the orted, which notes that the
> proc was terminated-by-signal. The orted does not check to see -which-
> signal was used to terminate the proc.
> 
> 
> By our own design requirements, the response to a termination-by-signal of a
> process is to abort the job. If we want to modify that, it would be simple
> to say "except if it was a SIGUSR1/2 signal". I have no issue with making
> that change, but please note that it -is- a change in our defined behavior,
> and a change from what has been our behavior since the beginning of the
> project.
> 
> Let me know if you want to change the design requirement and we can take
> care of it.
> 
> Thanks
> Ralph
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to