My point with this second bullet is that there are other response options than
hardwiring putting the process to sleep. You
could let someone know so a human can decide what, if anything, to do about it,
or provide a hook so that people can
explore/utilize different response strategies...or both!
HTH
Ralph
On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey <[email protected]>
wrote:
I understand your point of view, and mostly share it.
I think the biggest point in my example is that sleep occurs only after
(I was wrong in my previous e-mail) 10
minutes of inactivity, and this value is fully configurable. I didn't
intend to call sleep after 2 seconds.
Plus, as said before, I planned to have the library do show_help() when this
happens (something like : "Open
MPI couldn't receive a message for 10 minutes, lowering pressure") so
that the application that really needs
more than 10 minutes to receive a message can increase it.
Looking at the tick rate code, I couldn't see how changing it would make
CPU usage drop. If I understand
correctly your e-mail, you block in the kernel using poll(), is that
right ? So, you may well loose 10 us
because of that kernel call, but this is a lot less than the 1 ms I'm
currently loosing with usleep. This makes
sense - although being hard to implement since all btl must have this
ability.
Thanks for your comments, I will continue to think about it.
Sylvain
On Tue, 9 Jun 2009, Ralph Castain wrote:
My concern with any form of sleep is with the impact on the proc - since
opal_progress might not be
running in a separate thread, won't the sleep apply to the process as a
whole? In that case, the process
isn't free to continue computing.
I can envision applications that might call down into the MPI library and
have opal_progress not find
anything, but there is nothing wrong. The application could continue
computations just fine. I would hate
to see us put the process to sleep just because the MPI library wasn't
busy enough.
Hence my suggestion to just change the tick rate. It would definitely
cause a higher latency for the
first message that arrived while in this state, which is bothersome, but
would meet the stated objective
without interfering with the process itself.
LANL has also been looking at this problem of stalled jobs, but from a
different approach. We monitor
(using a separate job) progress in terms of output files changing in size
plus other factors as specified
by the user. If we don't see any progress in those terms over some time,
then we kill the job. We chose
that path because of the concerns expressed above - e.g., on our RR
machine, intense computations can be
underway on the Cell blades while the Opteron MPI processes wait for us
to reach a communication point.
We -want- those processes spinning away so that, when the comm starts, it
can proceed as quickly as
possible.
Just some thoughts...
Ralph
On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:
Sylvain Jeaugey wrote:
Hi Ralph,
I'm entirely convinced that MPI doesn't have to save power in
a normal scenario.
The idea is just that if an MPI process is blocked (i.e. has
not performed
progress for -say- 5 minutes (default in my implementation),
we stop busy polling
and have the process drop from 100% CPU usage to 0%.
I do not call sleep() but usleep(). The result if quite the
same, but is less
hurting performance in case of (unexpected) restart.
However, the goal of my RFC was also to know if there was a
more clean way to
achieve my goal, and from what I read, I guess I should look at the
"tick" rate
instead of trying to do my own delaying.
One way around this is to make all blocked communications (even SM)
to use poll to block for
incoming messages. Jeff and I have discussed this and had many
false starts on it. The
biggest issue is coming up with a way to have blocks on the SM btl
converted to the system
poll call without requiring a socket write for every packet.
The usleep solution works but is kind of ugly IMO. I think when I
looked at doing that the
overhead increased signifcantly for certain communications. Maybe
not for toy benchmarks but
for less synchronized processes I saw the usleep adding overhead
where I didn't want it too.
--td
Don't worry, I was quite expecting the configure-in
requirement. However, I don't
think my patch is good for inclusion, it is only an example
to describe what I
want to achieve.
Thanks a lot for your comments,
Sylvain
On Mon, 8 Jun 2009, Ralph Castain wrote:
I'm not entirely convinced this actually achieves your
goals, but I
can see some potential benefits. I'm also not sure that
power
consumption is that big of an issue that MPI needs to
begin chasing
"power saver" modes of operation, but that can be a
separate debate
some day.
I'm assuming you don't mean that you actually call
"sleep()" as this
would be very bad - I'm assuming you just change the
opal_progress
"tick" rate instead. True? If not, and you really call
"sleep", then
I would have to oppose adding this to the code base
pending
discussion with others who can corroborate that this
won't cause
problems.
Either way, I could live with this so long as it was
done as a
"configure-in" capability. Just having the params
default to a value
that causes the system to behave similarly to today
isn't enough - we
still wind up adding logic into a very critical timing
loop for no
reason. A simple configure option of
--enable-mpi-progress-monitoring
would be sufficient to protect the code.
HTH
Ralph
On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:
What : when nothing has been received for a very
long
time - e.g. 5 minutes, stop busy polling in
opal_progress
and switch to a usleep-based one.
Why : when we have long waits, and especially
when an
application is deadlock'ed, detecting it is not
easy and
a lot of power is wasted until the end of the
time slice
(if there is one).
Where : an example of how it could be implemented
is
available at
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/
Principle
=========
opal_progress() ensures the progression of MPI
communication. The current algorithm is a loop
calling
progress on all registered components. If the
program is
blocked, the loop will busy-poll indefinetely.
Going to sleep after a certain amount of time with
nothing received is interesting for two things :
- Administrator can easily detect whether a job is
deadlocked : all the processes are in sleep().
Currently,
all processors are using 100% cpu and it is very
hard to
know if progression is still happening or not.
- When there is nothing to receive, power usage
is highly
reduced.
However, it could hurt performance in some cases,
typically if we go to sleep just before the
message
arrives. This will highly depend on the
parameters you
give to the sleep mechanism.
At first, we can start with the following
assumption : if
the sleep takes T usec, then sleeping after
10000xT
should slow down Receives by a factor less than
0.01 %.
However, other processes may suffer from you
being late,
and be delayed by T usec (which may represent
more than
0.01% for them).
So, the goal of this mechanism is mainly to detect
far-too-long-waits and should quite never be used
in
normal MPI jobs. It could also trigger a warning
message
when starting to sleep, or at least a trace in the
notifier.
Details of Implementation
=========================
Three parameters fully control the behaviour of
this
mechanism :
* opal_progress_sleep_count : number of
unsuccessful
opal_progress() calls before we start the timer
(to
prevent latency impact). It defaults to -1, which
completely deactivates the sleep (and is therefore
equivalent to the former code). A value of 1000
can be
thought of as a starting point to enable this
mechanism.
* opal_progress_sleep_trigger : time to wait
before going
to low-pressure-powersave mode. Default : 600 (in
seconds) = 10 minutes.
* opal_progress_sleep_duration : time we sleep at
each
further unsuccessful call to opal_progress().
Default :
1000 (in us) = 1 ms.
The duration is big enough to make the process
show 0%
CPU in top, but low enough to preserve a good
trigger/duration ratio.
The trigger is voluntary high to keep a good
trigger/duration ratio. Indeed, to prevent delays
from
causing chain reactions, trigger should be higher
than
duration * numprocs.
Possible Improvements & Pitfalls
================================
* Trigger could be set automatically at
max(trigger,
duration * numprocs * 2).
* poll_start and poll_count could be fields of the
opal_condition_t struct.
* The sleep section may be exported in a #define
and
reported in all the progress pathes (I'm not sure
my
patch is good for progress threads for example)
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel