On 4/8/08 12:10 PM, "Pak Lui" <pak....@sun.com> wrote:

> Richard Graham wrote:
>> What happens if I deliver sigusr2 to mpirun ?  What I observe (for both
>> ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does
>> get propagated to the mpi procs, which do invoke the signal handler I
>> registered, but the job is terminated right after that.  However, if I
>> deliver the signal directly to the mpi procs, the signal handler is invoked,
>> and the job continues to run.
> 
> This is exactly what I have observed previously when I made the
> gridengine change. It is due to the fact that orterun (aka mpirun) is
> the process fork and exec'ing the executables on the HNP. e.g. On the
> remote nodes, you don't have this problem. So the wait_daemon function
> picks up the signal from mpirun on HNP, then kill off the children.

I'll look into this, but I don't believe this is true UNLESS something
exits. The wait_daemon function only gets called when a proc terminates - it
doesn't "pickup" a signal on its own. Perhaps we are just having a language
problem here...

In the rsh situation, the daemon "daemonizes" and closes the ssh session
during launch. If the ssh session closed on a signal, then that would return
and indicate that a daemon had failed to start, causing the abort. But that
session is successfully closed PRIOR to the launch of any MPI procs. I note
that we don't "deregister" the waitpid, though, so there may be some issue
there.

However, we most certainly do NOT look for such things in Torque. My guess
is that something is causing a proc/daemon to abort, which then causes the
system to abort the job.

I have tried this on my Mac (got other things going on at the moment on the
distributed machines), and all works as expected. However, that doesn't mean
there isn't a problem in general.

Will investigate when I have time shortly.

> 
>> 
>> So, I think that what was intended to happen is the correct thing, but for
>> some reason it is not happening.
>> 
>> Rich
>> 
>> 
>> On 4/8/08 1:47 PM, "Ralph H Castain" <r...@lanl.gov> wrote:
>> 
>>> I found what Pak said a little confusing as the wait_daemon function doesn't
>>> actually receive a signal itself - it only detects that a proc has exited
>>> and checks to see if that happened due to a signal. If so, it flags that
>>> situation and will order the job aborted.
>>> 
>>> So if the proc continues alive, the fact that it was hit with SIGUSR2 will
>>> not be detected by ORTE nor will anything happen - however, if the OS uses
>>> SIGUSR2 to terminate the proc, or if the proc terminates when it gets that
>>> signal, we will see that proc terminate due to signal and abort the rest of
>>> the job.
>>> 
>>> We could change it if that is what people want - it is trivial to insert
>>> code to say "kill everything except if it died due to a certain signal".
>>> 
>>> <shrug> up to you folks. Current behavior is what you said you wanted a long
>>> time ago - nothing has changed in this regard for several years.
>>> 
>>> 
>>> On 4/8/08 11:36 AM, "Pak Lui" <pak....@sun.com> wrote:
>>> 
>>>> First, can your user executable create a signal handler to catch the
>>>> SIGUSR2 to not exit? By default on Solaris it is going to exit, unless
>>>> you catch the signal and have the process to do nothing.
>>>> 
>>>> from signal(3HEAD)
>>>>       Name             Value   Default    Event
>>>>       SIGUSR1          16      Exit       User Signal 1
>>>>       SIGUSR2          17      Exit       User Signal 2
>>>> 
>>>> The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm
>>>> might cause the processes to exit if the orted (or mpirun if it's on
>>>> HNP) receives a signal like SIGUSR2; it'd work on killing all the user
>>>> processes on that node once it receives a signal.
>>>> 
>>>> I workaround this for gridengine PLM. Once the gridengine_wait_daemon()
>>>> receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to
>>>> acknowledge a signal returns, without declaring the launch_failed which
>>>> would kill off the user processes. The signals would also get passed to
>>>> the user processes, and let them decide what to do with the signals
>>>> themselves.
>>>> 
>>>> SGE needed this so the job kill or job suspension notification to work
>>>> properly since they would send a SIGUSR1/2 to mpirun. I believe this is
>>>> probably what you need in the rsh plm.
>>>> 
>>>> Richard Graham wrote:
>>>>> I am running into a situation where I am trying to deliver a signal to the
>>>>> mpi procs (sigusr2).  I deliver this to mpirun, which propagates it to the
>>>>> mpi procs, but then proceeds to kill the children.  Is there an easy way
>>>>> that I can get around this ?  I am using this mechanism in a situation
>>>>> where
>>>>> I don't have a debugger, and trying to use this to turn on debugging when
>>>>> I
>>>>> hit a hang, so killing the mpi procs is really not what I want to have
>>>>> happen.
>>>>> 
>>>>> Thanks,
>>>>> Rich
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


Reply via email to