Re: [OMPI devel] [RFC] Low pressure OPAL progress

Sylvain Jeaugey Tue, 9 Jun 2009 10:20:24 -0400

On Tue, 9 Jun 2009, Ralph Castain wrote:

2. instead of putting things to sleep or even adjusting the loop rate, you 
might want to consider using the orte_notifier
capability and notify the system that the job may be stalled. Or perhaps adding 
an API to the orte_errmgr framework to
notify it that nothing has been received for awhile, and let people implement 
different strategies for detecting what might
be "wrong" and what they want to do about it.

Great remark. What is really needed here is the information of "nothingreceived for X minutes". Just having the information somewhere should besufficient. We often see users asking if their application is stillprogressing, and this should answer their questions. This would alsoaddress the need of administrators to stop deadlocked runs during thenight.

I guess I'll redirect my work on this and couple it with our currenteffort on logging and administration tools coupling.


Thanks a lot guys !

Sylvain

My point with this second bullet is that there are other response options than 
hardwiring putting the process to sleep. You
could let someone know so a human can decide what, if anything, to do about it, 
or provide a hook so that people can
explore/utilize different response strategies...or both!

HTH
Ralph


On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net> 
wrote:
      I understand your point of view, and mostly share it.

      I think the biggest point in my example is that sleep occurs only after 
(I was wrong in my previous e-mail) 10
      minutes of inactivity, and this value is fully configurable. I didn't 
intend to call sleep after 2 seconds.
      Plus, as said before, I planned to have the library do show_help() when this 
happens (something like : "Open
      MPI couldn't receive a message for 10 minutes, lowering pressure") so 
that the application that really needs
      more than 10 minutes to receive a message can increase it.

      Looking at the tick rate code, I couldn't see how changing it would make 
CPU usage drop. If I understand
      correctly your e-mail, you block in the kernel using poll(), is that 
right ? So, you may well loose 10 us
      because of that kernel call, but this is a lot less than the 1 ms I'm 
currently loosing with usleep. This makes
      sense - although being hard to implement since all btl must have this 
ability.

      Thanks for your comments, I will continue to think about it.

      Sylvain


On Tue, 9 Jun 2009, Ralph Castain wrote:

      My concern with any form of sleep is with the impact on the proc - since 
opal_progress might not be
      running in a separate thread, won't the sleep apply to the process as a 
whole? In that case, the process
      isn't free to continue computing.

      I can envision applications that might call down into the MPI library and 
have opal_progress not find
      anything, but there is nothing wrong. The application could continue 
computations just fine. I would hate
      to see us put the process to sleep just because the MPI library wasn't 
busy enough.

      Hence my suggestion to just change the tick rate. It would definitely 
cause a higher latency for the
      first message that arrived while in this state, which is bothersome, but 
would meet the stated objective
      without interfering with the process itself.

      LANL has also been looking at this problem of stalled jobs, but from a 
different approach. We monitor
      (using a separate job) progress in terms of output files changing in size 
plus other factors as specified
      by the user. If we don't see any progress in those terms over some time, 
then we kill the job. We chose
      that path because of the concerns expressed above - e.g., on our RR 
machine, intense computations can be
      underway on the Cell blades while the Opteron MPI processes wait for us 
to reach a communication point.
      We -want- those processes spinning away so that, when the comm starts, it 
can proceed as quickly as
      possible.

      Just some thoughts...
      Ralph


      On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:

            Sylvain Jeaugey wrote:
                  Hi Ralph,

                  I'm entirely convinced that MPI doesn't have to save power in 
a normal scenario.
                  The idea is just that if an MPI process is blocked (i.e. has 
not performed
                  progress for -say- 5 minutes (default in my implementation), 
we stop busy polling
                  and have the process drop from 100% CPU usage to 0%.

                  I do not call sleep() but usleep(). The result if quite the 
same, but is less
                  hurting performance in case of (unexpected) restart.

                  However, the goal of my RFC was also to know if there was a 
more clean way to
                  achieve my goal, and from what I read, I guess I should look at the 
"tick" rate
                  instead of trying to do my own delaying.

            One way around this is to make all blocked communications (even SM) 
to use poll to block for
            incoming messages.  Jeff and I have discussed this and had many 
false starts on it.  The
            biggest issue is coming up with a way to have blocks on the SM btl 
converted to the system
            poll call without requiring a socket write for every packet.

            The usleep solution works but is kind of ugly IMO.  I think when I 
looked at doing that the
            overhead increased signifcantly for certain communications.  Maybe 
not for toy benchmarks but
            for less synchronized processes I saw the usleep adding overhead 
where I didn't want it too.

            --td
                  Don't worry, I was quite expecting the configure-in 
requirement. However, I don't
                  think my patch is good for inclusion, it is only an example 
to describe what I
                  want to achieve.

                  Thanks a lot for your comments,
                  Sylvain

                  On Mon, 8 Jun 2009, Ralph Castain wrote:

                        I'm not entirely convinced this actually achieves your 
goals, but I
                        can see some potential benefits. I'm also not sure that 
power
                        consumption is that big of an issue that MPI needs to 
begin chasing
                        "power saver" modes of operation, but that can be a 
separate debate
                        some day.

                        I'm assuming you don't mean that you actually call 
"sleep()" as this
                        would be very bad - I'm assuming you just change the 
opal_progress
                        "tick" rate instead. True? If not, and you really call 
"sleep", then
                        I would have to oppose adding this to the code base 
pending
                        discussion with others who can corroborate that this 
won't cause
                        problems.

                        Either way, I could live with this so long as it was 
done as a
                        "configure-in" capability. Just having the params 
default to a value
                        that causes the system to behave similarly to today 
isn't enough - we
                        still wind up adding logic into a very critical timing 
loop for no
                        reason. A simple configure option of 
--enable-mpi-progress-monitoring
                        would be sufficient to protect the code.

                        HTH
                        Ralph


                        On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:

                              What : when nothing has been received for a very 
long
                              time - e.g. 5 minutes, stop busy polling in 
opal_progress
                              and switch to a usleep-based one.

                              Why : when we have long waits, and especially 
when an
                              application is deadlock'ed, detecting it is not 
easy and
                              a lot of power is wasted until the end of the 
time slice
                              (if there is one).

                              Where : an example of how it could be implemented 
is
                              available at
                              
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/

                              Principle
                              =========

                              opal_progress() ensures the progression of MPI
                              communication. The current algorithm is a loop 
calling
                              progress on all registered components. If the 
program is
                              blocked, the loop will busy-poll indefinetely.

                              Going to sleep after a certain amount of time with
                              nothing received is interesting for two things :
                              - Administrator can easily detect whether a job is
                              deadlocked : all the processes are in sleep(). 
Currently,
                              all processors are using 100% cpu and it is very 
hard to
                              know if progression is still happening or not.
                              - When there is nothing to receive, power usage 
is highly
                              reduced.

                              However, it could hurt performance in some cases,
                              typically if we go to sleep just before the 
message
                              arrives. This will highly depend on the 
parameters you
                              give to the sleep mechanism.

                              At first, we can start with the following 
assumption : if
                              the sleep takes T usec, then sleeping after 
10000xT
                              should slow down Receives by a factor less than 
0.01 %.

                              However, other processes may suffer from you 
being late,
                              and be delayed by T usec (which may represent 
more than
                              0.01% for them).

                              So, the goal of this mechanism is mainly to detect
                              far-too-long-waits and should quite never be used 
in
                              normal MPI jobs. It could also trigger a warning 
message
                              when starting to sleep, or at least a trace in the
                              notifier.

                              Details of Implementation
                              =========================

                              Three parameters fully control the behaviour of 
this
                              mechanism :
                              * opal_progress_sleep_count : number of 
unsuccessful
                              opal_progress() calls before we start the timer 
(to
                              prevent latency impact). It defaults to -1, which
                              completely deactivates the sleep (and is therefore
                              equivalent to the former code). A value of 1000 
can be
                              thought of as a starting point to enable this 
mechanism.
                              * opal_progress_sleep_trigger : time to wait 
before going
                              to low-pressure-powersave mode. Default : 600 (in
                              seconds) = 10 minutes.
                              * opal_progress_sleep_duration : time we sleep at 
each
                              further unsuccessful call to opal_progress(). 
Default :
                              1000 (in us) = 1 ms.

                              The duration is big enough to make the process 
show 0%
                              CPU in top, but low enough to preserve a good
                              trigger/duration ratio.

                              The trigger is voluntary high to keep a good
                              trigger/duration ratio. Indeed, to prevent delays 
from
                              causing chain reactions, trigger should be higher 
than
                              duration * numprocs.

                              Possible Improvements & Pitfalls
                              ================================

                              * Trigger could be set automatically at 
max(trigger,
                              duration * numprocs * 2).

                              * poll_start and poll_count could be fields of the
                              opal_condition_t struct.

                              * The sleep section may be exported in a #define 
and
                              reported in all the progress pathes (I'm not sure 
my
                              patch is good for progress threads for example)
                              _______________________________________________
                              devel mailing list
                              de...@open-mpi.org
                              http://www.open-mpi.org/mailman/listinfo.cgi/devel


                        _______________________________________________
                        devel mailing list
                        de...@open-mpi.org
                        http://www.open-mpi.org/mailman/listinfo.cgi/devel

                  _______________________________________________
                  devel mailing list
                  de...@open-mpi.org
                  http://www.open-mpi.org/mailman/listinfo.cgi/devel


            _______________________________________________
            devel mailing list
            de...@open-mpi.org
            http://www.open-mpi.org/mailman/listinfo.cgi/devel


      _______________________________________________
      devel mailing list
      de...@open-mpi.org
      http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [RFC] Low pressure OPAL progress

Reply via email to