What happens if I deliver sigusr2 to mpirun ? What I observe (for both ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does get propagated to the mpi procs, which do invoke the signal handler I registered, but the job is terminated right after that. However, if I deliver the signal directly to the mpi procs, the signal handler is invoked, and the job continues to run.
So, I think that what was intended to happen is the correct thing, but for some reason it is not happening. Rich On 4/8/08 1:47 PM, "Ralph H Castain" <r...@lanl.gov> wrote: > I found what Pak said a little confusing as the wait_daemon function doesn't > actually receive a signal itself - it only detects that a proc has exited > and checks to see if that happened due to a signal. If so, it flags that > situation and will order the job aborted. > > So if the proc continues alive, the fact that it was hit with SIGUSR2 will > not be detected by ORTE nor will anything happen - however, if the OS uses > SIGUSR2 to terminate the proc, or if the proc terminates when it gets that > signal, we will see that proc terminate due to signal and abort the rest of > the job. > > We could change it if that is what people want - it is trivial to insert > code to say "kill everything except if it died due to a certain signal". > > <shrug> up to you folks. Current behavior is what you said you wanted a long > time ago - nothing has changed in this regard for several years. > > > On 4/8/08 11:36 AM, "Pak Lui" <pak....@sun.com> wrote: > >> First, can your user executable create a signal handler to catch the >> SIGUSR2 to not exit? By default on Solaris it is going to exit, unless >> you catch the signal and have the process to do nothing. >> >> from signal(3HEAD) >> Name Value Default Event >> SIGUSR1 16 Exit User Signal 1 >> SIGUSR2 17 Exit User Signal 2 >> >> The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm >> might cause the processes to exit if the orted (or mpirun if it's on >> HNP) receives a signal like SIGUSR2; it'd work on killing all the user >> processes on that node once it receives a signal. >> >> I workaround this for gridengine PLM. Once the gridengine_wait_daemon() >> receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to >> acknowledge a signal returns, without declaring the launch_failed which >> would kill off the user processes. The signals would also get passed to >> the user processes, and let them decide what to do with the signals >> themselves. >> >> SGE needed this so the job kill or job suspension notification to work >> properly since they would send a SIGUSR1/2 to mpirun. I believe this is >> probably what you need in the rsh plm. >> >> Richard Graham wrote: >>> I am running into a situation where I am trying to deliver a signal to the >>> mpi procs (sigusr2). I deliver this to mpirun, which propagates it to the >>> mpi procs, but then proceeds to kill the children. Is there an easy way >>> that I can get around this ? I am using this mechanism in a situation where >>> I don't have a debugger, and trying to use this to turn on debugging when I >>> hit a hang, so killing the mpi procs is really not what I want to have >>> happen. >>> >>> Thanks, >>> Rich >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel