Couple of other things to help stimulate the thinking:
1. it isn't that OMPI -couldn't- receive a message, but rather that it
-didn't- receive a message. This may or may not indicate that there is a
problem. Could just be an application that doesn't need to communicate for
awhile, as per my example. I admit, though, that 10 minutes is a tad
long...but I've seen some bizarre apps around here :-)
2. instead of putting things to sleep or even adjusting the loop rate, you
might want to consider using the orte_notifier capability and notify the
system that the job may be stalled. Or perhaps adding an API to the
orte_errmgr framework to notify it that nothing has been received for
awhile, and let people implement different strategies for detecting what
might be "wrong" and what they want to do about it.
My point with this second bullet is that there are other response options
than hardwiring putting the process to sleep. You could let someone know so
a human can decide what, if anything, to do about it, or provide a hook so
that people can explore/utilize different response strategies...or both!
HTH
Ralph
On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net>
wrote:
I understand your point of view, and mostly share it.
I think the biggest point in my example is that sleep occurs only after (I
was wrong in my previous e-mail) 10 minutes of inactivity, and this value
is fully configurable. I didn't intend to call sleep after 2 seconds. Plus,
as said before, I planned to have the library do show_help() when this
happens (something like : "Open MPI couldn't receive a message for 10
minutes, lowering pressure") so that the application that really needs more
than 10 minutes to receive a message can increase it.
Looking at the tick rate code, I couldn't see how changing it would make
CPU usage drop. If I understand correctly your e-mail, you block in the
kernel using poll(), is that right ? So, you may well loose 10 us because
of that kernel call, but this is a lot less than the 1 ms I'm currently
loosing with usleep. This makes sense - although being hard to implement
since all btl must have this ability.
Thanks for your comments, I will continue to think about it.
Sylvain
On Tue, 9 Jun 2009, Ralph Castain wrote:
My concern with any form of sleep is with the impact on the proc - since
opal_progress might not be running in a separate thread, won't the sleep
apply to the process as a whole? In that case, the process isn't free to
continue computing.
I can envision applications that might call down into the MPI library and
have opal_progress not find anything, but there is nothing wrong. The
application could continue computations just fine. I would hate to see us
put the process to sleep just because the MPI library wasn't busy enough.
Hence my suggestion to just change the tick rate. It would definitely cause
a higher latency for the first message that arrived while in this state,
which is bothersome, but would meet the stated objective without
interfering with the process itself.
LANL has also been looking at this problem of stalled jobs, but from a
different approach. We monitor (using a separate job) progress in terms of
output files changing in size plus other factors as specified by the user.
If we don't see any progress in those terms over some time, then we kill
the job. We chose that path because of the concerns expressed above - e.g.,
on our RR machine, intense computations can be underway on the Cell blades
while the Opteron MPI processes wait for us to reach a communication point.
We -want- those processes spinning away so that, when the comm starts, it
can proceed as quickly as possible.
Just some thoughts...
Ralph
On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:
Sylvain Jeaugey wrote:
Hi Ralph,
I'm entirely convinced that MPI doesn't have to save power in a normal
scenario. The idea is just that if an MPI process is blocked (i.e. has not
performed progress for -say- 5 minutes (default in my implementation), we
stop busy polling and have the process drop from 100% CPU usage to 0%.
I do not call sleep() but usleep(). The result if quite the same, but is
less hurting performance in case of (unexpected) restart.
However, the goal of my RFC was also to know if there was a more clean way
to achieve my goal, and from what I read, I guess I should look at the
"tick" rate instead of trying to do my own delaying.
One way around this is to make all blocked communications (even SM) to use
poll to block for incoming messages. Jeff and I have discussed this and
had many false starts on it. The biggest issue is coming up with a way to
have blocks on the SM btl converted to the system poll call without
requiring a socket write for every packet.
The usleep solution works but is kind of ugly IMO. I think when I looked
at doing that the overhead increased signifcantly for certain
communications. Maybe not for toy benchmarks but for less synchronized
processes I saw the usleep adding overhead where I didn't want it too.
--td
Don't worry, I was quite expecting the configure-in requirement. However, I
don't think my patch is good for inclusion, it is only an example to
describe what I want to achieve.
Thanks a lot for your comments,
Sylvain
On Mon, 8 Jun 2009, Ralph Castain wrote:
I'm not entirely convinced this actually achieves your goals, but I can see
some potential benefits. I'm also not sure that power consumption is that
big of an issue that MPI needs to begin chasing "power saver" modes of
operation, but that can be a separate debate some day.
I'm assuming you don't mean that you actually call "sleep()" as this would
be very bad - I'm assuming you just change the opal_progress "tick" rate
instead. True? If not, and you really call "sleep", then I would have to
oppose adding this to the code base pending discussion with others who can
corroborate that this won't cause problems.
Either way, I could live with this so long as it was done as a
"configure-in" capability. Just having the params default to a value that
causes the system to behave similarly to today isn't enough - we still wind
up adding logic into a very critical timing loop for no reason. A simple
configure option of --enable-mpi-progress-monitoring would be sufficient to
protect the code.
HTH
Ralph
On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:
What : when nothing has been received for a very long time - e.g. 5
minutes, stop busy polling in opal_progress and switch to a usleep-based
one.
Why : when we have long waits, and especially when an application is
deadlock'ed, detecting it is not easy and a lot of power is wasted until
the end of the time slice (if there is one).
Where : an example of how it could be implemented is available at
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/
Principle
=========
opal_progress() ensures the progression of MPI communication. The current
algorithm is a loop calling progress on all registered components. If the
program is blocked, the loop will busy-poll indefinetely.
Going to sleep after a certain amount of time with nothing received is
interesting for two things :
- Administrator can easily detect whether a job is deadlocked : all the
processes are in sleep(). Currently, all processors are using 100% cpu and
it is very hard to know if progression is still happening or not.
- When there is nothing to receive, power usage is highly reduced.
However, it could hurt performance in some cases, typically if we go to
sleep just before the message arrives. This will highly depend on the
parameters you give to the sleep mechanism.
At first, we can start with the following assumption : if the sleep takes T
usec, then sleeping after 10000xT should slow down Receives by a factor
less than 0.01 %.
However, other processes may suffer from you being late, and be delayed by
T usec (which may represent more than 0.01% for them).
So, the goal of this mechanism is mainly to detect far-too-long-waits and
should quite never be used in normal MPI jobs. It could also trigger a
warning message when starting to sleep, or at least a trace in the
notifier.
Details of Implementation
=========================
Three parameters fully control the behaviour of this mechanism :
* opal_progress_sleep_count : number of unsuccessful opal_progress() calls
before we start the timer (to prevent latency impact). It defaults to -1,
which completely deactivates the sleep (and is therefore equivalent to the
former code). A value of 1000 can be thought of as a starting point to
enable this mechanism.
* opal_progress_sleep_trigger : time to wait before going to
low-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.
* opal_progress_sleep_duration : time we sleep at each further unsuccessful
call to opal_progress(). Default : 1000 (in us) = 1 ms.
The duration is big enough to make the process show 0% CPU in top, but low
enough to preserve a good trigger/duration ratio.
The trigger is voluntary high to keep a good trigger/duration ratio.
Indeed, to prevent delays from causing chain reactions, trigger should be
higher than duration * numprocs.
Possible Improvements & Pitfalls
================================
* Trigger could be set automatically at max(trigger, duration * numprocs *
2).
* poll_start and poll_count could be fields of the opal_condition_t struct.
* The sleep section may be exported in a #define and reported in all the
progress pathes (I'm not sure my patch is good for progress threads for
example)
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel