Just curious -- what's difficult about this? SIGTSTP and SIGCONT can be
caught; is there something preventing us from sending "stop" and
"continue" messages (just like we send "die" messages)?
(If I had to guess, I think the user is asking because some other MPI
implementations implement this kind of behavior)
Thanks!
________________________________
From: [email protected]
[mailto:[email protected]] On Behalf Of Ralph Castain
Sent: Thursday, June 01, 2006 10:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted
Actually, there were some implementation issues that might
prevent this from working and were the reason we didn't implement it
right away. We don't actually transmit the SIGTERM - we capture it in
mpirun and then propagate our own "die" command to the remote processes
and daemons. Fortunately, "die" is very easy to implement.
Unfortunately, "stop" and "continue" are much harder to
implement from inside of a process. We'll have to look at it, but this
may not really be feasible.
Ralph
Jeff Squyres (jsquyres) wrote:
The main reason that it doesn't work is because we
didn't do any thing
to make it work. :-)
Specifically, mpirun is not intercepting SIGSTOP and
passing it on to
the remote nodes. There is nothing in the design or
architecture that
would prevent this, but we just don't do it [yet].
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of
Pak Lui
Sent: Thursday, June 01, 2006 5:02 PM
To: [email protected]
Subject: [OMPI devel] SIGSTOP and SIGCONT on
orted
Hi,
I have a question on signals. Normally when I do
a SIGTERM
(control-C)
on mpirun, the signal seems to get handled in a
way that it
broadcasts
to the orted and processes on the execution
hosts. However,
when I send
a SIGSTOP to mpirun, mpirun seems to have
stopped, but the
processes of
the user executable continue to run. I guess I
could hook up the
debugger to mpirun and orted to see why they are
handled differently,
but I guess I anxious to hear about it here.
I am trying to see the behavior of SIGSTOP and
SIGCONT for the
suspension/resumption feature in N1GE. It'll try
to use these
signals to
stop and continue both mpirun and orted (and its
processes), but the
signals (SIGSTOP and SIGCONT) don't seem to get
propagated to
the remote
orted.
I can see there are some issues for implementing
this feature on N1GE
because the 'qrsh' interface does not send the
signal to orted on the
remote node, but only to 'mpirun'. I am trying
to see how to
work around
this.
--
Thanks,
- Pak Lui
[email protected]
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel