Jeff Squyres (jsquyres) wrote:
I guess I had in my head that Josh already
working on most of these issues anyway for the checkpoint / restart
work (i.e., all the quiescing stuff). Indeed, if you think about it --
pause/resume is one form of a checkpoint/restart. Hence, if the
checkpoint/restart frameworks are laid out right -- and I think they
are -- pause/resume may just be a component in the checkpoint/restart
frameworks (there's a little hand-waving going on here, of course :-),
but I'm trusting that Josh will jump in if I have any heinously
incorrect assumptions).
Good point - but Josh is only beginning to scratch the surface on the
issues I mentioned. Quite a ways from having something for general use.
This also brings up another [minor] point -- we
don't currently propagate signals out from mpirun to remote processes
(e.g., SIGUSR1). There hasn't really been a need for this yet, so it's
been a pretty low priority.
Sorry for all the confusion, though -- I keyed
off the phrase "there were some implementation issues that might
prevent this from working" in your original e-mail, which I interpreted
as "our implementation prohibits this." :-)
My fault - should have been clearer.
Jeff Squyres (jsquyres) wrote:
Just curious -- what's difficult about this?
SIGTSTP and SIGCONT can be caught; is there something preventing us
from sending "stop" and "continue" messages (just like we send "die"
messages)?
Nothing preventing it at all. The problem lies in what you do when you
receive it. Take the example of a launch that used orted daemons. We
could pass the "stop" or "continue" message to the orted, which could
signal its child processes (i.e., the application processes on that
node) with the appropriate signal. That would stop/continue the child
process just fine - but what about communications that are still
in-progress?? Bad news.
So instead you could pass the application process a "stop" message. The
process could then "quiet" the MPI-based messaging system, reply back
to the orted that all is now quiet, and then the orted could send the
appropriate OS-level signal so the process would truly "stop".
"Continue" is much easier, of course - there is no "quieting" to be
done, so the orted could just issue a "continue" signal to its children.
Great - except we still haven't "stopped" the run-time! What happens if
the registry is in the middle of a notification process (e.g., we hit a
stage gate and all the notification messages are being sent, or someone
is in the middle of a put that causes a set of subscriptions to fire
and send out messages - that may in turn cause additional action on the
remote host)? What about messages being routed through the orteds (once
we get the routing system in-place)?
Well, we now could go through a similar process to first "quiet" the
run-time itself. We would have to ensure that every subsystem completed
its on-going operation and then "stopped". We would of course have to
tell all the remote processes to "stop" first so that new requests
would quit coming in, or else this process would never complete. Note
that this means the remote processes would have to receive and "log"
any notifications that come in from the registry after we tell the
process to "stop", but could not take action on those notices until we
"continue" the process.
So now we have the MPI and run-time layers "quiet". We send a message
to the remote orteds indicating they should go ahead and send their
local application processes an OS-level signal to "stop" so that the OS
knows not to spend cycles on them. Unfortunately, we cannot do the same
for the orteds themselves, so that means that the orteds remain "awake"
and operating, but they can just "spin".
All sounds fine. Now all we have to deal with are: all the race
conditions inherent in what I just described; how to deal with receipt
of asynchronous notifications when we've already been told to stop; the
scenarios where we don't have orted daemons on every node; how to
stop/restart major MPI collectives in mid operation; etc. etc.
Not saying it cannot be done - just indicating that there were reasons
why it wasn't initially done other than "we just didn't get around to
it". :-)
(If I had to guess, I think the user is asking
because some other MPI implementations implement this kind of behavior)
Thanks!
Actually, there were some implementation issues that might prevent this
from working and were the reason we didn't implement it right away. We
don't actually transmit the SIGTERM - we capture it in mpirun and then
propagate our own "die" command to the remote processes and daemons.
Fortunately, "die" is very easy to implement.
Unfortunately, "stop" and "continue" are much harder to implement from
inside of a process. We'll have to look at it, but this may not really
be feasible.
Ralph
Jeff Squyres (jsquyres) wrote:
The main reason that it doesn't work is because we didn't do any thing
to make it work. :-)
Specifically, mpirun is not intercepting SIGSTOP and passing it on to
the remote nodes. There is nothing in the design or architecture that
would prevent this, but we just don't do it [yet].
-----Original Message-----
From: devel-boun...@open-mpi.org
[mailto:devel-boun...@open-mpi.org] On Behalf Of Pak Lui
Sent: Thursday, June 01, 2006 5:02 PM
To: de...@open-mpi.org
Subject: [OMPI devel] SIGSTOP and SIGCONT on orted
Hi,
I have a question on signals. Normally when I do a SIGTERM
(control-C)
on mpirun, the signal seems to get handled in a way that it
broadcasts
to the orted and processes on the execution hosts. However,
when I send
a SIGSTOP to mpirun, mpirun seems to have stopped, but the
processes of
the user executable continue to run. I guess I could hook up the
debugger to mpirun and orted to see why they are handled differently,
but I guess I anxious to hear about it here.
I am trying to see the behavior of SIGSTOP and SIGCONT for the
suspension/resumption feature in N1GE. It'll try to use these
signals to
stop and continue both mpirun and orted (and its processes), but the
signals (SIGSTOP and SIGCONT) don't seem to get propagated to
the remote
orted.
I can see there are some issues for implementing this feature on N1GE
because the 'qrsh' interface does not send the signal to orted on the
remote node, but only to 'mpirun'. I am trying to see how to
work around
this.
--
Thanks,
- Pak Lui
pak....@sun.com
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
|