What happens if I deliver sigusr2 to mpirun ?  What I observe (for both
ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does
get propagated to the mpi procs, which do invoke the signal handler I
registered, but the job is terminated right after that.  However, if I
deliver the signal directly to the mpi procs, the signal handler is invoked,
and the job continues to run.

So, I think that what was intended to happen is the correct thing, but for
some reason it is not happening.

Rich


On 4/8/08 1:47 PM, "Ralph H Castain" <r...@lanl.gov> wrote:

> I found what Pak said a little confusing as the wait_daemon function doesn't
> actually receive a signal itself - it only detects that a proc has exited
> and checks to see if that happened due to a signal. If so, it flags that
> situation and will order the job aborted.
> 
> So if the proc continues alive, the fact that it was hit with SIGUSR2 will
> not be detected by ORTE nor will anything happen - however, if the OS uses
> SIGUSR2 to terminate the proc, or if the proc terminates when it gets that
> signal, we will see that proc terminate due to signal and abort the rest of
> the job.
> 
> We could change it if that is what people want - it is trivial to insert
> code to say "kill everything except if it died due to a certain signal".
> 
> <shrug> up to you folks. Current behavior is what you said you wanted a long
> time ago - nothing has changed in this regard for several years.
> 
> 
> On 4/8/08 11:36 AM, "Pak Lui" <pak....@sun.com> wrote:
> 
>> First, can your user executable create a signal handler to catch the
>> SIGUSR2 to not exit? By default on Solaris it is going to exit, unless
>> you catch the signal and have the process to do nothing.
>> 
>> from signal(3HEAD)
>>       Name             Value   Default    Event
>>       SIGUSR1          16      Exit       User Signal 1
>>       SIGUSR2          17      Exit       User Signal 2
>> 
>> The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm
>> might cause the processes to exit if the orted (or mpirun if it's on
>> HNP) receives a signal like SIGUSR2; it'd work on killing all the user
>> processes on that node once it receives a signal.
>> 
>> I workaround this for gridengine PLM. Once the gridengine_wait_daemon()
>> receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to
>> acknowledge a signal returns, without declaring the launch_failed which
>> would kill off the user processes. The signals would also get passed to
>> the user processes, and let them decide what to do with the signals
>> themselves.
>> 
>> SGE needed this so the job kill or job suspension notification to work
>> properly since they would send a SIGUSR1/2 to mpirun. I believe this is
>> probably what you need in the rsh plm.
>> 
>> Richard Graham wrote:
>>> I am running into a situation where I am trying to deliver a signal to the
>>> mpi procs (sigusr2).  I deliver this to mpirun, which propagates it to the
>>> mpi procs, but then proceeds to kill the children.  Is there an easy way
>>> that I can get around this ?  I am using this mechanism in a situation where
>>> I don't have a debugger, and trying to use this to turn on debugging when I
>>> hit a hang, so killing the mpi procs is really not what I want to have
>>> happen.
>>> 
>>> Thanks,
>>> Rich
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to