Re: [OMPI devel] SIGSTOP and SIGCONT on orted

2006-06-02 Thread Jeff Squyres (jsquyres)
Just curious -- what's difficult about this?  SIGTSTP and SIGCONT can be
caught; is there something preventing us from sending "stop" and
"continue" messages (just like we send "die" messages)?
 
(If I had to guess, I think the user is asking because some other MPI
implementations implement this kind of behavior)
 
Thanks!




From: devel-boun...@open-mpi.org
[mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, June 01, 2006 10:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted


Actually, there were some implementation issues that might
prevent this from working and were the reason we didn't implement it
right away. We don't actually transmit the SIGTERM - we capture it in
mpirun and then propagate our own "die" command to the remote processes
and daemons. Fortunately, "die" is very easy to implement.

Unfortunately, "stop" and "continue" are much harder to
implement from inside of a process. We'll have to look at it, but this
may not really be feasible.

Ralph



Jeff Squyres (jsquyres) wrote: 

The main reason that it doesn't work is because we
didn't do any thing
to make it work.  :-)

Specifically, mpirun is not intercepting SIGSTOP and
passing it on to
the remote nodes.  There is nothing in the design or
architecture that
would prevent this, but we just don't do it [yet].
 

  

-Original Message-
From: devel-boun...@open-mpi.org 
[mailto:devel-boun...@open-mpi.org] On Behalf Of
Pak Lui
Sent: Thursday, June 01, 2006 5:02 PM
To: de...@open-mpi.org
Subject: [OMPI devel] SIGSTOP and SIGCONT on
orted

Hi,

I have a question on signals. Normally when I do
a SIGTERM 
(control-C) 
on mpirun, the signal seems to get handled in a
way that it 
broadcasts 
to the orted and processes on the execution
hosts. However, 
when I send 
a SIGSTOP to mpirun, mpirun seems to have
stopped, but the 
processes of 
the user executable continue to run. I guess I
could hook up the 
debugger to mpirun and orted to see why they are
handled differently, 
but I guess I anxious to hear about it here.

I am trying to see the behavior of SIGSTOP and
SIGCONT for the 
suspension/resumption feature in N1GE. It'll try
to use these 
signals to 
stop and continue both mpirun and orted (and its
processes), but the 
signals (SIGSTOP and SIGCONT) don't seem to get
propagated to 
the remote 
orted.

I can see there are some issues for implementing
this feature on N1GE 
because the 'qrsh' interface does not send the
signal to orted on the 
remote node, but only to 'mpirun'. I am trying
to see how to 
work around 
this.

-- 

Thanks,

- Pak Lui
pak@sun.com

___
devel mailing list
de...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  



Re: [OMPI devel] SIGSTOP and SIGCONT on orted

2006-06-02 Thread Ralph Castain






Jeff Squyres (jsquyres) wrote:

  
  
  Just curious -- what's difficult about this? 
SIGTSTP and SIGCONT can be caught; is there something preventing us
from sending "stop" and "continue" messages (just like we send "die"
messages)?

Nothing preventing it at all. The problem lies in what you do when you
receive it. Take the example of a launch that used orted daemons. We
could pass the "stop" or "continue" message to the orted, which could
signal its child processes (i.e., the application processes on that
node) with the appropriate signal. That would stop/continue the child
process just fine - but what about communications that are still
in-progress?? Bad news.

So instead you could pass the application process a "stop" message. The
process could then "quiet" the MPI-based messaging system, reply back
to the orted that all is now quiet, and then the orted could send the
appropriate OS-level signal so the process would truly "stop".
"Continue" is much easier, of course - there is no "quieting" to be
done, so the orted could just issue a "continue" signal to its children.

Great - except we still haven't "stopped" the run-time! What happens if
the registry is in the middle of a notification process (e.g., we hit a
stage gate and all the notification messages are being sent, or someone
is in the middle of a put that causes a set of subscriptions to fire
and send out messages - that may in turn cause additional action on the
remote host)? What about messages being routed through the orteds (once
we get the routing system in-place)?

Well, we now could go through a similar process to first "quiet" the
run-time itself. We would have to ensure that every subsystem completed
its on-going operation and then "stopped". We would of course have to
tell all the remote processes to "stop" first so that new requests
would quit coming in, or else this process would never complete. Note
that this means the remote processes would have to receive and "log"
any notifications that come in from the registry after we tell the
process to "stop", but could not take action on those notices until we
"continue" the process.

So now we have the MPI and run-time layers "quiet". We send a message
to the remote orteds indicating they should go ahead and send their
local application processes an OS-level signal to "stop" so that the OS
knows not to spend cycles on them. Unfortunately, we cannot do the same
for the orteds themselves, so that means that the orteds remain "awake"
and operating, but they can just "spin".

All sounds fine. Now all we have to deal with are: all the race
conditions inherent in what I just described; how to deal with receipt
of asynchronous notifications when we've already been told to stop; the
scenarios where we don't have orted daemons on every node; how to
stop/restart major MPI collectives in mid operation; etc. etc.

Not saying it cannot be done - just indicating that there were reasons
why it wasn't initially done other than "we just didn't get around to
it".  :-) 



   
  (If I had to guess, I think the user is asking
because some other MPI implementations implement this kind of behavior)
   
  Thanks!
  
  

 From:
devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On
Behalf Of Ralph Castain
Sent: Thursday, June 01, 2006 10:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted


Actually, there were some implementation issues that might prevent this
from working and were the reason we didn't implement it right away. We
don't actually transmit the SIGTERM - we capture it in mpirun and then
propagate our own "die" command to the remote processes and daemons.
Fortunately, "die" is very easy to implement.

Unfortunately, "stop" and "continue" are much harder to implement from
inside of a process. We'll have to look at it, but this may not really
be feasible.

Ralph



Jeff Squyres (jsquyres) wrote:

  The main reason that it doesn't work is because we didn't do any thing
to make it work.  :-)

Specifically, mpirun is not intercepting SIGSTOP and passing it on to
the remote nodes.  There is nothing in the design or architecture that
would prevent this, but we just don't do it [yet].


  
  
-Original Message-
From: devel-boun...@open-mpi.org 
[mailto:devel-boun...@open-mpi.org] On Behalf Of Pak Lui
Sent: Thursday, June 01, 2006 5:02 PM
To: de...@open-mpi.org
Subject: [OMPI devel] SIGSTOP and SIGCONT on orted

Hi,

I have a question on signals. Normally when I do a SIGTERM 
(control-C) 
on mpirun, the signal seems to get handled in a way that it 
broadcasts 
to the orted and processes on the execution hosts. However, 
when I send 
a SIGSTOP to mpirun, mpirun seems to have stopped, but the 
processes of 
the user executable continue to run. I guess I could hook up the 
debugger to mpirun and orted to see why they are handled differently, 
but I guess I anxious to hear about it here.

I

Re: [OMPI devel] SIGSTOP and SIGCONT on orted

2006-06-02 Thread Pak Lui

Ralph Castain wrote:



Jeff Squyres (jsquyres) wrote:

Just curious -- what's difficult about this?  SIGTSTP and SIGCONT can 
be caught; is there something preventing us from sending "stop" and 
"continue" messages (just like we send "die" messages)?


Nothing preventing it at all. The problem lies in what you do when you 
receive it. Take the example of a launch that used orted daemons. We 
could pass the "stop" or "continue" message to the orted, which could 
signal its child processes (i.e., the application processes on that 
node) with the appropriate signal. That would stop/continue the child 
process just fine - but what about communications that are still 
in-progress?? Bad news.


So instead you could pass the application process a "stop" message. The 
process could then "quiet" the MPI-based messaging system, reply back to 
the orted that all is now quiet, and then the orted could send the 
appropriate OS-level signal so the process would truly "stop". 
"Continue" is much easier, of course - there is no "quieting" to be 
done, so the orted could just issue a "continue" signal to its children.




I agree that stopping orted may not be the behavior that we are looking 
for. Instead, we can send the signals to the application processes, 
since stopping them is what we are interested in.


The idea is to stop the resource consumption by the user processes once 
the stop signal is sent from N1GE, since orted is being an 
administrative daemon rather than a running process that's doing work, 
it probably does not need to be accounted for the resource usage.


And since 'qrsh' does not issue a 'stop' orted but only give a stop 
signal to mpirun, it's really up to mpirun to tell where to give the 
stop signal to.


Great - except we still haven't "stopped" the run-time! What happens if 
the registry is in the middle of a notification process (e.g., we hit a 
stage gate and all the notification messages are being sent, or someone 
is in the middle of a put that causes a set of subscriptions to fire and 
send out messages - that may in turn cause additional action on the 
remote host)? What about messages being routed through the orteds (once 
we get the routing system in-place)?


Well, we now could go through a similar process to first "quiet" the 
run-time itself. We would have to ensure that every subsystem completed 
its on-going operation and then "stopped". We would of course have to 
tell all the remote processes to "stop" first so that new requests would 
quit coming in, or else this process would never complete. Note that 
this means the remote processes would have to receive and "log" any 
notifications that come in from the registry after we tell the process 
to "stop", but could not take action on those notices until we 
"continue" the process.


So now we have the MPI and run-time layers "quiet". We send a message to 
the remote orteds indicating they should go ahead and send their local 
application processes an OS-level signal to "stop" so that the OS knows 
not to spend cycles on them. Unfortunately, we cannot do the same for 
the orteds themselves, so that means that the orteds remain "awake" and 
operating, but they can just "spin".


All sounds fine. Now all we have to deal with are: all the race 
conditions inherent in what I just described; how to deal with receipt 
of asynchronous notifications when we've already been told to stop; the 
scenarios where we don't have orted daemons on every node; how to 
stop/restart major MPI collectives in mid operation; etc. etc.


Not saying it cannot be done - just indicating that there were reasons 
why it wasn't initially done other than "we just didn't get around to 
it". :-)


Excellent explanations. These issues seem to be non-trivial and I don't 
see that we can resolve them at this point, not even when we make sure 
the run-time communications are in the state of quiescence. It maybe 
wise to keep this feature out for now.


 
(If I had to guess, I think the user is asking because some other MPI 
implementations implement this kind of behavior)


I am not sure if we hear high demand from users for this feature or not, 
but while reading some of the posts on sunsource.net on job suspension, 
I actually don't other MPI implementations have done this, except for 
ClusterTools, our previous MPI implementation. There are some issues 
involve communications timeouts that you already mentioned, file IO, 
plus others. So it could be messy to implement this feature for parallel 
jobs in general.

http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=1418

There are also some workaround mentioned, one is for user is to put the 
parallel job in a subordinate queue, or modify the existing queue with 
lower priority, insteading of putting the stop to freeze the application 
processes.


 
Thanks!



*From:* devel-boun...@open-mpi.org
[mailto:devel-boun...@o

Re: [OMPI devel] SIGSTOP and SIGCONT on orted

2006-06-02 Thread Jeff Squyres (jsquyres)
I guess I had in my head that Josh already working on most of these
issues anyway for the checkpoint / restart work (i.e., all the quiescing
stuff).  Indeed, if you think about it -- pause/resume is one form of a
checkpoint/restart.  Hence, if the checkpoint/restart frameworks are
laid out right -- and I think they are -- pause/resume may just be a
component in the checkpoint/restart frameworks (there's a little
hand-waving going on here, of course :-), but I'm trusting that Josh
will jump in if I have any heinously incorrect assumptions).
 
This also brings up another [minor] point -- we don't currently
propagate signals out from mpirun to remote processes (e.g., SIGUSR1).
There hasn't really been a need for this yet, so it's been a pretty low
priority.
 
Sorry for all the confusion, though -- I keyed off the phrase "there
were some implementation issues that might prevent this from working" in
your original e-mail, which I interpreted as "our implementation
prohibits this."  :-)
 



From: devel-boun...@open-mpi.org
[mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, June 02, 2006 9:12 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted




Jeff Squyres (jsquyres) wrote: 

Just curious -- what's difficult about this?  SIGTSTP
and SIGCONT can be caught; is there something preventing us from sending
"stop" and "continue" messages (just like we send "die" messages)?

Nothing preventing it at all. The problem lies in what you do
when you receive it. Take the example of a launch that used orted
daemons. We could pass the "stop" or "continue" message to the orted,
which could signal its child processes (i.e., the application processes
on that node) with the appropriate signal. That would stop/continue the
child process just fine - but what about communications that are still
in-progress?? Bad news.

So instead you could pass the application process a "stop"
message. The process could then "quiet" the MPI-based messaging system,
reply back to the orted that all is now quiet, and then the orted could
send the appropriate OS-level signal so the process would truly "stop".
"Continue" is much easier, of course - there is no "quieting" to be
done, so the orted could just issue a "continue" signal to its children.

Great - except we still haven't "stopped" the run-time! What
happens if the registry is in the middle of a notification process
(e.g., we hit a stage gate and all the notification messages are being
sent, or someone is in the middle of a put that causes a set of
subscriptions to fire and send out messages - that may in turn cause
additional action on the remote host)? What about messages being routed
through the orteds (once we get the routing system in-place)?

Well, we now could go through a similar process to first "quiet"
the run-time itself. We would have to ensure that every subsystem
completed its on-going operation and then "stopped". We would of course
have to tell all the remote processes to "stop" first so that new
requests would quit coming in, or else this process would never
complete. Note that this means the remote processes would have to
receive and "log" any notifications that come in from the registry after
we tell the process to "stop", but could not take action on those
notices until we "continue" the process.

So now we have the MPI and run-time layers "quiet". We send a
message to the remote orteds indicating they should go ahead and send
their local application processes an OS-level signal to "stop" so that
the OS knows not to spend cycles on them. Unfortunately, we cannot do
the same for the orteds themselves, so that means that the orteds remain
"awake" and operating, but they can just "spin".

All sounds fine. Now all we have to deal with are: all the race
conditions inherent in what I just described; how to deal with receipt
of asynchronous notifications when we've already been told to stop; the
scenarios where we don't have orted daemons on every node; how to
stop/restart major MPI collectives in mid operation; etc. etc.

Not saying it cannot be done - just indicating that there were
reasons why it wasn't initially done other than "we just didn't get
around to it". :-) 




 
(If I had to guess, I think the user is asking because
some other MPI implementations implement this kind of behavior)
 
Thanks!




From: devel-boun...@open-mpi.org
[mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, June 01, 2006 10:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] SIGSTOP and SIGCONT o

Re: [OMPI devel] SIGSTOP and SIGCONT on orted

2006-06-02 Thread Jeff Squyres (jsquyres)
I forgot to mention that I completely agree that we don't need (or want)
to pause/resume the orteds.  This is also in total agreement with the
checkpoint/restart philosophy: we are only checkpointing and restarting
the user application(s), not the run-time infrastructure.  There may
still be quiescing issues within ORTE for checkpointing the user
applications (per Josh's work and Ralph's explanations), but there's no
need to actually pause / checkpoint the orteds themselves. 

As a corollary, this means that we likely will not be able to pause /
checkpoint in cases where we don't use orteds.  I'm fine with that.
Currently, the only place where this occurs is on Red Storm, where
pausing doesn't make sense (I'm not conversant enough with the Red Storm
architecture to know if they care about checkpointing, and if so, how
it's handled).


> -Original Message-
> From: devel-boun...@open-mpi.org 
> [mailto:devel-boun...@open-mpi.org] On Behalf Of Pak Lui
> Sent: Friday, June 02, 2006 11:37 AM
> To: r...@lanl.gov; Open MPI Developers
> Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted
> 
> I agree that stopping orted may not be the behavior that we 
> are looking 
> for. Instead, we can send the signals to the application processes, 
> since stopping them is what we are interested in.
> 
> The idea is to stop the resource consumption by the user 
> processes once 
> the stop signal is sent from N1GE, since orted is being an 
> administrative daemon rather than a running process that's 
> doing work, 
> it probably does not need to be accounted for the resource usage.
> 
> And since 'qrsh' does not issue a 'stop' orted but only give a stop 
> signal to mpirun, it's really up to mpirun to tell where to give the 
> stop signal to.



Re: [OMPI devel] SIGSTOP and SIGCONT on orted

2006-06-02 Thread Ralph Castain






Jeff Squyres (jsquyres) wrote:

  
  
  
  I guess I had in my head that Josh already
working on most of these issues anyway for the checkpoint / restart
work (i.e., all the quiescing stuff).  Indeed, if you think about it --
pause/resume is one form of a checkpoint/restart.  Hence, if the
checkpoint/restart frameworks are laid out right -- and I think they
are -- pause/resume may just be a component in the checkpoint/restart
frameworks (there's a little hand-waving going on here, of course :-),
but I'm trusting that Josh will jump in if I have any heinously
incorrect assumptions).

Good point - but Josh is only beginning to scratch the surface on the
issues I mentioned. Quite a ways from having something for general use.

   
  This also brings up another [minor] point -- we
don't currently propagate signals out from mpirun to remote processes
(e.g., SIGUSR1).  There hasn't really been a need for this yet, so it's
been a pretty low priority.
   
  Sorry for all the confusion, though -- I keyed
off the phrase "there were some implementation issues that might
prevent this from working" in your original e-mail, which I interpreted
as "our implementation prohibits this."  :-)

My fault - should have been clearer.

   
  

 From:
devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On
Behalf Of Ralph Castain
Sent: Friday, June 02, 2006 9:12 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted




Jeff Squyres (jsquyres) wrote:

  
  Just curious -- what's difficult about this? 
SIGTSTP and SIGCONT can be caught; is there something preventing us
from sending "stop" and "continue" messages (just like we send "die"
messages)?

Nothing preventing it at all. The problem lies in what you do when you
receive it. Take the example of a launch that used orted daemons. We
could pass the "stop" or "continue" message to the orted, which could
signal its child processes (i.e., the application processes on that
node) with the appropriate signal. That would stop/continue the child
process just fine - but what about communications that are still
in-progress?? Bad news.

So instead you could pass the application process a "stop" message. The
process could then "quiet" the MPI-based messaging system, reply back
to the orted that all is now quiet, and then the orted could send the
appropriate OS-level signal so the process would truly "stop".
"Continue" is much easier, of course - there is no "quieting" to be
done, so the orted could just issue a "continue" signal to its children.

Great - except we still haven't "stopped" the run-time! What happens if
the registry is in the middle of a notification process (e.g., we hit a
stage gate and all the notification messages are being sent, or someone
is in the middle of a put that causes a set of subscriptions to fire
and send out messages - that may in turn cause additional action on the
remote host)? What about messages being routed through the orteds (once
we get the routing system in-place)?

Well, we now could go through a similar process to first "quiet" the
run-time itself. We would have to ensure that every subsystem completed
its on-going operation and then "stopped". We would of course have to
tell all the remote processes to "stop" first so that new requests
would quit coming in, or else this process would never complete. Note
that this means the remote processes would have to receive and "log"
any notifications that come in from the registry after we tell the
process to "stop", but could not take action on those notices until we
"continue" the process.

So now we have the MPI and run-time layers "quiet". We send a message
to the remote orteds indicating they should go ahead and send their
local application processes an OS-level signal to "stop" so that the OS
knows not to spend cycles on them. Unfortunately, we cannot do the same
for the orteds themselves, so that means that the orteds remain "awake"
and operating, but they can just "spin".

All sounds fine. Now all we have to deal with are: all the race
conditions inherent in what I just described; how to deal with receipt
of asynchronous notifications when we've already been told to stop; the
scenarios where we don't have orted daemons on every node; how to
stop/restart major MPI collectives in mid operation; etc. etc.

Not saying it cannot be done - just indicating that there were reasons
why it wasn't initially done other than "we just didn't get around to
it". :-) 



   
  (If I had to guess, I think the user is asking
because some other MPI implementations implement this kind of behavior)
   
  Thanks!
  
  

 From: devel-boun...@open-mpi.org
[mailto:devel-boun...@open-mpi.org]
On Behalf Of Ralph Castain
Sent: Thursday, June 01, 2006 10:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on or

[OMPI devel] Query on zero-copy sends

2006-06-02 Thread Jonathan Day
Hi,

I'm working on developing some components for OpenMPI,
but am a little unclear as to how to implement
efficient sends and receives. I'm wanting to do
zero-copy two-sided MPI, but as far as I can see, this
is not going to be easy. As best as I can tell, the
receive mechanism copies into a temporary user buffer
then, on actually handling the receive, copies that
into the application's buffer. Would I be correct in
this interpretation?

I'm also a little hazy on how to get information on
messages being passed. What information on the sending
process is visible to the receiving BTL components?

Finally, I'm assuming that developers have, over time,
produced test harnesses and other useful (for
developers) tools that would have no real value to
general users. Has anyone put together a kit of
development aids for coders of new components?

Jonathan Day


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com