Re: [OMPI devel] [RFC] Low pressure OPAL progress

Jeff Squyres Tue, 9 Jun 2009 11:47:38 -0400

I'll throw in my random $0.02. I'm at the Forum this week, so mylatency on replies here will likely be large.

1. Ashley is correct that we shouldn't sleep. A better solution wouldbe to block waiting for something to happen (rather than spin). AsTerry mentioned, we pretty much know how to do this -- it's just thatno one has done it yet. The full solution would then be: if we spinfor a while (probably MCA-param settable) with nothing happening,switch to the blocking mode and continue waiting. I'm happy to passon the information on how we've imagined that this should be done, ifyou want.

2. Note that your solution presupposes that one MPI process can detectthat the entire job is deadlocked. This is not quite correct. Whatexactly do you want to detect -- that one process may be imbalanced onits receives (waiting for long periods of time without doinganything), or that the entire job is deadlocked? The former may be ok-- it depends on the app. If the latter, it requires a bit more work-- e.g., if one process detects that nothing has happened for a longtime, it can initiate a collective/distributed deadlock detectionalgorithm with all the other MPI processes in the job. Only if *all*processes agree, then you can say "this job is deadlocked, we might aswell abort." IIRC, there are some 3rd party tools / libraries that dothis kind of stuff...? (although it might be cool / useful toincorporate some of this technology into OMPI itself)

3. As Ralph noted, how exactly do you know when "nothing happens for along time" is a bad thing? a) some codes are structured that way --that they'll have no MPI activity for a long time, even if they havepending non-blocking receives pre-posted. b) are you looking withinthe scope of *one* MPI blocking call? I.e., if nothing happens*within the span of one blocking MPI call*, or are you looking ifnothing happens across successive calls to opal_progress() (which maybe few and far between after OMPI hits steady state when using non-TCPnetworks)? It seems like there would need to be a [thread safe]"reset" at some point -- indicating that something has happened. Thateither would be when something has happened, or that a blocking MPIcall has exited, or ....? Need to make sure that that "reset" doesn'tget expensive.

4. Note, too, that opal_progress() doesn't see *all* progress - theopenib BTL doesn't use opal_progress to know when OpenFabrics messagesarrive, for example.



On Jun 9, 2009, at 6:43 AM, Ralph Castain wrote:

Couple of other things to help stimulate the thinking:
1. it isn't that OMPI -couldn't- receive a message, but rather thatit -didn't- receive a message. This may or may not indicate thatthere is a problem. Could just be an application that doesn't needto communicate for awhile, as per my example. I admit, though, that10 minutes is a tad long...but I've seen some bizarre apps aroundhere :-)
2. instead of putting things to sleep or even adjusting the looprate, you might want to consider using the orte_notifier capabilityand notify the system that the job may be stalled. Or perhaps addingan API to the orte_errmgr framework to notify it that nothing hasbeen received for awhile, and let people implement differentstrategies for detecting what might be "wrong" and what they want todo about it.
My point with this second bullet is that there are other responseoptions than hardwiring putting the process to sleep. You could letsomeone know so a human can decide what, if anything, to do aboutit, or provide a hook so that people can explore/utilize differentresponse strategies...or both!
HTH
Ralph
On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net> wrote:
I understand your point of view, and mostly share it.
I think the biggest point in my example is that sleep occurs onlyafter (I was wrong in my previous e-mail) 10 minutes of inactivity,and this value is fully configurable. I didn't intend to call sleepafter 2 seconds. Plus, as said before, I planned to have the librarydo show_help() when this happens (something like : "Open MPIcouldn't receive a message for 10 minutes, lowering pressure") sothat the application that really needs more than 10 minutes toreceive a message can increase it.
Looking at the tick rate code, I couldn't see how changing it wouldmake CPU usage drop. If I understand correctly your e-mail, youblock in the kernel using poll(), is that right ? So, you may wellloose 10 us because of that kernel call, but this is a lot less thanthe 1 ms I'm currently loosing with usleep. This makes sense -although being hard to implement since all btl must have this ability.
Thanks for your comments, I will continue to think about it.

Sylvain


On Tue, 9 Jun 2009, Ralph Castain wrote:
My concern with any form of sleep is with the impact on the proc -since opal_progress might not be running in a separate thread, won'tthe sleep apply to the process as a whole? In that case, the processisn't free to continue computing.
I can envision applications that might call down into the MPIlibrary and have opal_progress not find anything, but there isnothing wrong. The application could continue computations justfine. I would hate to see us put the process to sleep just becausethe MPI library wasn't busy enough.
Hence my suggestion to just change the tick rate. It woulddefinitely cause a higher latency for the first message that arrivedwhile in this state, which is bothersome, but would meet the statedobjective without interfering with the process itself.
LANL has also been looking at this problem of stalled jobs, but froma different approach. We monitor (using a separate job) progress interms of output files changing in size plus other factors asspecified by the user. If we don't see any progress in those termsover some time, then we kill the job. We chose that path because ofthe concerns expressed above - e.g., on our RR machine, intensecomputations can be underway on the Cell blades while the OpteronMPI processes wait for us to reach a communication point. We -want-those processes spinning away so that, when the comm starts, it canproceed as quickly as possible.
Just some thoughts...
Ralph


On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:

Sylvain Jeaugey wrote:
Hi Ralph,
I'm entirely convinced that MPI doesn't have to save power in anormal scenario. The idea is just that if an MPI process is blocked(i.e. has not performed progress for -say- 5 minutes (default in myimplementation), we stop busy polling and have the process drop from100% CPU usage to 0%.
I do not call sleep() but usleep(). The result if quite the same,but is less hurting performance in case of (unexpected) restart.
However, the goal of my RFC was also to know if there was a moreclean way to achieve my goal, and from what I read, I guess I shouldlook at the "tick" rate instead of trying to do my own delaying.
One way around this is to make all blocked communications (even SM)to use poll to block for incoming messages. Jeff and I havediscussed this and had many false starts on it. The biggest issueis coming up with a way to have blocks on the SM btl converted tothe system poll call without requiring a socket write for everypacket.
The usleep solution works but is kind of ugly IMO. I think when Ilooked at doing that the overhead increased signifcantly for certaincommunications. Maybe not for toy benchmarks but for lesssynchronized processes I saw the usleep adding overhead where Ididn't want it too.
--td
Don't worry, I was quite expecting the configure-in requirement.However, I don't think my patch is good for inclusion, it is only anexample to describe what I want to achieve.
Thanks a lot for your comments,
Sylvain

On Mon, 8 Jun 2009, Ralph Castain wrote:
I'm not entirely convinced this actually achieves your goals, but Ican see some potential benefits. I'm also not sure that powerconsumption is that big of an issue that MPI needs to begin chasing"power saver" modes of operation, but that can be a separate debatesome day.
I'm assuming you don't mean that you actually call "sleep()" as thiswould be very bad - I'm assuming you just change the opal_progress"tick" rate instead. True? If not, and you really call "sleep", thenI would have to oppose adding this to the code base pendingdiscussion with others who can corroborate that this won't causeproblems.
Either way, I could live with this so long as it was done as a"configure-in" capability. Just having the params default to a valuethat causes the system to behave similarly to today isn't enough -we still wind up adding logic into a very critical timing loop forno reason. A simple configure option of --enable-mpi-progress-monitoring would be sufficient to protect the code.
HTH
Ralph


On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:
What : when nothing has been received for a very long time - e.g. 5minutes, stop busy polling in opal_progress and switch to a usleep-based one.
Why : when we have long waits, and especially when an application isdeadlock'ed, detecting it is not easy and a lot of power is wasteduntil the end of the time slice (if there is one).
Where : an example of how it could be implemented is available at 
http://bitbucket.org/jeaugeys/low-pressure-opal-progress/

Principle
=========
opal_progress() ensures the progression of MPI communication. Thecurrent algorithm is a loop calling progress on all registeredcomponents. If the program is blocked, the loop will busy-pollindefinetely.
Going to sleep after a certain amount of time with nothing receivedis interesting for two things :- Administrator can easily detect whether a job is deadlocked : allthe processes are in sleep(). Currently, all processors are using100% cpu and it is very hard to know if progression is stillhappening or not.
- When there is nothing to receive, power usage is highly reduced.
However, it could hurt performance in some cases, typically if we goto sleep just before the message arrives. This will highly depend onthe parameters you give to the sleep mechanism.
At first, we can start with the following assumption : if the sleeptakes T usec, then sleeping after 10000xT should slow down Receivesby a factor less than 0.01 %.
However, other processes may suffer from you being late, and bedelayed by T usec (which may represent more than 0.01% for them).
So, the goal of this mechanism is mainly to detect far-too-long-waits and should quite never be used in normal MPI jobs. It couldalso trigger a warning message when starting to sleep, or at least atrace in the notifier.
Details of Implementation
=========================

Three parameters fully control the behaviour of this mechanism :
* opal_progress_sleep_count : number of unsuccessful opal_progress()calls before we start the timer (to prevent latency impact). Itdefaults to -1, which completely deactivates the sleep (and istherefore equivalent to the former code). A value of 1000 can bethought of as a starting point to enable this mechanism.* opal_progress_sleep_trigger : time to wait before going to low-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.* opal_progress_sleep_duration : time we sleep at each furtherunsuccessful call to opal_progress(). Default : 1000 (in us) = 1 ms.
The duration is big enough to make the process show 0% CPU in top,but low enough to preserve a good trigger/duration ratio.
The trigger is voluntary high to keep a good trigger/duration ratio.Indeed, to prevent delays from causing chain reactions, triggershould be higher than duration * numprocs.
Possible Improvements & Pitfalls
================================
* Trigger could be set automatically at max(trigger, duration *numprocs * 2).
* poll_start and poll_count could be fields of the opal_condition_tstruct.
* The sleep section may be exported in a #define and reported in allthe progress pathes (I'm not sure my patch is good for progressthreads for example)
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] [RFC] Low pressure OPAL progress

Reply via email to