Re: [OMPI devel] [RFC] Low pressure OPAL progress

Sylvain Jeaugey Wed, 10 Jun 2009 05:30:05 -0400

Hi Jeff,

Thanks for jumping in.


On Tue, 9 Jun 2009, Jeff Squyres wrote:

2. Note that your solution presupposes that one MPI process can detect thatthe entire job is deadlocked. This is not quite correct. What exactly doyou want to detect -- that one process may be imbalanced on its receives(waiting for long periods of time without doing anything), or that the entirejob is deadlocked? The former may be ok -- it depends on the app. If thelatter, it requires a bit more work -- e.g., if one process detects thatnothing has happened for a long time, it can initiate acollective/distributed deadlock detection algorithm with all the other MPIprocesses in the job. Only if *all* processes agree, then you can say "thisjob is deadlocked, we might as well abort." IIRC, there are some 3rd partytools / libraries that do this kind of stuff...? (although it might be cool/ useful to incorporate some of this technology into OMPI itself)

My approach was based on a per-process detection. Of course this does notindicate that the job is stuck, but tools like ganglia will quickly showyou whether all processes are in the "sleep" state or not (maybe combinedwith debugging tools, to see if all are really in MPI, not blocked in anI/O or something). Then, the user or the admin can take a decision whetherto abort the job or not. The "sleep" was only a way for me to bring theinformation to the user/admin. But as Ralph stated, a log would be evenbetter in this case (more precise, no performance penalty, ..), also itneeds to be coupled with other tools (whereas the sleep was naturallycoupled with ganglia).

3. As Ralph noted, how exactly do you know when "nothing happens for a longtime" is a bad thing? a) some codes are structured that way -- that they'llhave no MPI activity for a long time, even if they have pending non-blockingreceives pre-posted. b) are you looking within the scope of *one* MPIblocking call? I.e., if nothing happens *within the span of one blocking MPIcall*, or are you looking if nothing happens across successive calls toopal_progress() (which may be few and far between after OMPI hits steadystate when using non-TCP networks)? It seems like there would need to be a[thread safe] "reset" at some point -- indicating that something hashappened. That either would be when something has happened, or that ablocking MPI call has exited, or ....? Need to make sure that that "reset"doesn't get expensive.

Uh. This is way more complicated than my patch. From the variousreactions, it seems my RFC is misleading. I only work inopal_condition_wait(), which calls opal_progress(). The idea was only tosleep when we had been blocked in an MPI Wait (or similar) for a longtime. So, we sleep only if there is no possible background computation :the MPI process is waiting, and basically doing nothing else. MPI_Testfunctions will never call sleep. The fact that opal_progress() didprogress or not does not matter, the only question is : how long have webeen in opal_condition_wait() ?

So, what I would want to do now is to replace the sleep by a message sentto the HNP indicating "I'm blocked for X minutes", then X minutes later"I'm blocked for 2X minutes", etc.

The HNP would then aggregate those messages and when every process hassent one, log "Everyone is blocked for X minutes", then (I presume) Xminutes later, "Everyone is blocked for 2X minutes", etc.

I would then let users, admin or admin tools decide whether or not toabort the job.

If someone finally receives something, it should send a message to the HNPindicating that it is no longer blocked, or maybe just looking at logsshould suffice to see if block times continue to increase or not.

Since I'm only working on opal_condition_wait(), deadlocks in applicationsusing only MPI_Test calls will not be detected (but is that possible inthe first place ?).


Sylvain

On Jun 9, 2009, at 6:43 AM, Ralph Castain wrote:
Couple of other things to help stimulate the thinking:
1. it isn't that OMPI -couldn't- receive a message, but rather that it-didn't- receive a message. This may or may not indicate that there is aproblem. Could just be an application that doesn't need to communicate forawhile, as per my example. I admit, though, that 10 minutes is a tadlong...but I've seen some bizarre apps around here :-)
2. instead of putting things to sleep or even adjusting the loop rate, youmight want to consider using the orte_notifier capability and notify thesystem that the job may be stalled. Or perhaps adding an API to theorte_errmgr framework to notify it that nothing has been received forawhile, and let people implement different strategies for detecting whatmight be "wrong" and what they want to do about it.
My point with this second bullet is that there are other response optionsthan hardwiring putting the process to sleep. You could let someone know soa human can decide what, if anything, to do about it, or provide a hook sothat people can explore/utilize different response strategies...or both!
HTH
Ralph
On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey <sylvain.jeau...@bull.net>wrote:
I understand your point of view, and mostly share it.
I think the biggest point in my example is that sleep occurs only after (Iwas wrong in my previous e-mail) 10 minutes of inactivity, and this valueis fully configurable. I didn't intend to call sleep after 2 seconds. Plus,as said before, I planned to have the library do show_help() when thishappens (something like : "Open MPI couldn't receive a message for 10minutes, lowering pressure") so that the application that really needs morethan 10 minutes to receive a message can increase it.
Looking at the tick rate code, I couldn't see how changing it would makeCPU usage drop. If I understand correctly your e-mail, you block in thekernel using poll(), is that right ? So, you may well loose 10 us becauseof that kernel call, but this is a lot less than the 1 ms I'm currentlyloosing with usleep. This makes sense - although being hard to implementsince all btl must have this ability.
Thanks for your comments, I will continue to think about it.

Sylvain


On Tue, 9 Jun 2009, Ralph Castain wrote:
My concern with any form of sleep is with the impact on the proc - sinceopal_progress might not be running in a separate thread, won't the sleepapply to the process as a whole? In that case, the process isn't free tocontinue computing.
I can envision applications that might call down into the MPI library andhave opal_progress not find anything, but there is nothing wrong. Theapplication could continue computations just fine. I would hate to see usput the process to sleep just because the MPI library wasn't busy enough.
Hence my suggestion to just change the tick rate. It would definitely causea higher latency for the first message that arrived while in this state,which is bothersome, but would meet the stated objective withoutinterfering with the process itself.
LANL has also been looking at this problem of stalled jobs, but from adifferent approach. We monitor (using a separate job) progress in terms ofoutput files changing in size plus other factors as specified by the user.If we don't see any progress in those terms over some time, then we killthe job. We chose that path because of the concerns expressed above - e.g.,on our RR machine, intense computations can be underway on the Cell bladeswhile the Opteron MPI processes wait for us to reach a communication point.We -want- those processes spinning away so that, when the comm starts, itcan proceed as quickly as possible.
Just some thoughts...
Ralph


On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote:

Sylvain Jeaugey wrote:
Hi Ralph,
I'm entirely convinced that MPI doesn't have to save power in a normalscenario. The idea is just that if an MPI process is blocked (i.e. has notperformed progress for -say- 5 minutes (default in my implementation), westop busy polling and have the process drop from 100% CPU usage to 0%.
I do not call sleep() but usleep(). The result if quite the same, but isless hurting performance in case of (unexpected) restart.
However, the goal of my RFC was also to know if there was a more clean wayto achieve my goal, and from what I read, I guess I should look at the"tick" rate instead of trying to do my own delaying.
One way around this is to make all blocked communications (even SM) to usepoll to block for incoming messages. Jeff and I have discussed this andhad many false starts on it. The biggest issue is coming up with a way tohave blocks on the SM btl converted to the system poll call withoutrequiring a socket write for every packet.
The usleep solution works but is kind of ugly IMO. I think when I lookedat doing that the overhead increased signifcantly for certaincommunications. Maybe not for toy benchmarks but for less synchronizedprocesses I saw the usleep adding overhead where I didn't want it too.
--td
Don't worry, I was quite expecting the configure-in requirement. However, Idon't think my patch is good for inclusion, it is only an example todescribe what I want to achieve.
Thanks a lot for your comments,
Sylvain

On Mon, 8 Jun 2009, Ralph Castain wrote:
I'm not entirely convinced this actually achieves your goals, but I can seesome potential benefits. I'm also not sure that power consumption is thatbig of an issue that MPI needs to begin chasing "power saver" modes ofoperation, but that can be a separate debate some day.
I'm assuming you don't mean that you actually call "sleep()" as this wouldbe very bad - I'm assuming you just change the opal_progress "tick" rateinstead. True? If not, and you really call "sleep", then I would have tooppose adding this to the code base pending discussion with others who cancorroborate that this won't cause problems.
Either way, I could live with this so long as it was done as a"configure-in" capability. Just having the params default to a value thatcauses the system to behave similarly to today isn't enough - we still windup adding logic into a very critical timing loop for no reason. A simpleconfigure option of --enable-mpi-progress-monitoring would be sufficient toprotect the code.
HTH
Ralph


On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote:
What : when nothing has been received for a very long time - e.g. 5minutes, stop busy polling in opal_progress and switch to a usleep-basedone.
Why : when we have long waits, and especially when an application isdeadlock'ed, detecting it is not easy and a lot of power is wasted untilthe end of the time slice (if there is one).
Where : an example of how it could be implemented is available athttp://bitbucket.org/jeaugeys/low-pressure-opal-progress/
Principle
=========
opal_progress() ensures the progression of MPI communication. The currentalgorithm is a loop calling progress on all registered components. If theprogram is blocked, the loop will busy-poll indefinetely.
Going to sleep after a certain amount of time with nothing received isinteresting for two things :- Administrator can easily detect whether a job is deadlocked : all theprocesses are in sleep(). Currently, all processors are using 100% cpu andit is very hard to know if progression is still happening or not.
- When there is nothing to receive, power usage is highly reduced.
However, it could hurt performance in some cases, typically if we go tosleep just before the message arrives. This will highly depend on theparameters you give to the sleep mechanism.
At first, we can start with the following assumption : if the sleep takes Tusec, then sleeping after 10000xT should slow down Receives by a factorless than 0.01 %.
However, other processes may suffer from you being late, and be delayed byT usec (which may represent more than 0.01% for them).
So, the goal of this mechanism is mainly to detect far-too-long-waits andshould quite never be used in normal MPI jobs. It could also trigger awarning message when starting to sleep, or at least a trace in thenotifier.
Details of Implementation
=========================

Three parameters fully control the behaviour of this mechanism :
* opal_progress_sleep_count : number of unsuccessful opal_progress() callsbefore we start the timer (to prevent latency impact). It defaults to -1,which completely deactivates the sleep (and is therefore equivalent to theformer code). A value of 1000 can be thought of as a starting point toenable this mechanism.* opal_progress_sleep_trigger : time to wait before going tolow-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes.* opal_progress_sleep_duration : time we sleep at each further unsuccessfulcall to opal_progress(). Default : 1000 (in us) = 1 ms.
The duration is big enough to make the process show 0% CPU in top, but lowenough to preserve a good trigger/duration ratio.
The trigger is voluntary high to keep a good trigger/duration ratio.Indeed, to prevent delays from causing chain reactions, trigger should behigher than duration * numprocs.
Possible Improvements & Pitfalls
================================
* Trigger could be set automatically at max(trigger, duration * numprocs *2).
* poll_start and poll_count could be fields of the opal_condition_t struct.
* The sleep section may be exported in a #define and reported in all theprogress pathes (I'm not sure my patch is good for progress threads forexample)
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [RFC] Low pressure OPAL progress

Reply via email to