Re: [OMPI devel] [RFC] Low pressure OPAL progress
On Tue, 2009-06-09 at 07:28 -0400, Terry Dontje wrote: > The biggest issue is coming up with a > way to have blocks on the SM btl converted to the system poll call > without requiring a socket write for every packet. For what it's worth you don't need a socket write every (local) packet, all you need to send your local peers a message when you are about to sleep. This can be implemented with a shared memory word so no implicit comms is required. The sender can then send a message using whatever means it does currently, check if the bit is set send a "wakeup" message via a socket if the remote process is sleeping. You need to be careful to get the ordering right or you end up with deadlocks and you need to establish a "remote wakeup" mechanism although this is easily done with sockets. You don't even need to communicate over the socket, all it's for is to cause your peer to return from poll/select so it can query the shared memory state. Signals would also likely work however they tend to present other problems in my experience. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] [RFC] Low pressure OPAL progress
Hi Jeff, Thanks for jumping in. On Tue, 9 Jun 2009, Jeff Squyres wrote: 2. Note that your solution presupposes that one MPI process can detect that the entire job is deadlocked. This is not quite correct. What exactly do you want to detect -- that one process may be imbalanced on its receives (waiting for long periods of time without doing anything), or that the entire job is deadlocked? The former may be ok -- it depends on the app. If the latter, it requires a bit more work -- e.g., if one process detects that nothing has happened for a long time, it can initiate a collective/distributed deadlock detection algorithm with all the other MPI processes in the job. Only if *all* processes agree, then you can say "this job is deadlocked, we might as well abort." IIRC, there are some 3rd party tools / libraries that do this kind of stuff...? (although it might be cool / useful to incorporate some of this technology into OMPI itself) My approach was based on a per-process detection. Of course this does not indicate that the job is stuck, but tools like ganglia will quickly show you whether all processes are in the "sleep" state or not (maybe combined with debugging tools, to see if all are really in MPI, not blocked in an I/O or something). Then, the user or the admin can take a decision whether to abort the job or not. The "sleep" was only a way for me to bring the information to the user/admin. But as Ralph stated, a log would be even better in this case (more precise, no performance penalty, ..), also it needs to be coupled with other tools (whereas the sleep was naturally coupled with ganglia). 3. As Ralph noted, how exactly do you know when "nothing happens for a long time" is a bad thing? a) some codes are structured that way -- that they'll have no MPI activity for a long time, even if they have pending non-blocking receives pre-posted. b) are you looking within the scope of *one* MPI blocking call? I.e., if nothing happens *within the span of one blocking MPI call*, or are you looking if nothing happens across successive calls to opal_progress() (which may be few and far between after OMPI hits steady state when using non-TCP networks)? It seems like there would need to be a [thread safe] "reset" at some point -- indicating that something has happened. That either would be when something has happened, or that a blocking MPI call has exited, or ? Need to make sure that that "reset" doesn't get expensive. Uh. This is way more complicated than my patch. From the various reactions, it seems my RFC is misleading. I only work in opal_condition_wait(), which calls opal_progress(). The idea was only to sleep when we had been blocked in an MPI Wait (or similar) for a long time. So, we sleep only if there is no possible background computation : the MPI process is waiting, and basically doing nothing else. MPI_Test functions will never call sleep. The fact that opal_progress() did progress or not does not matter, the only question is : how long have we been in opal_condition_wait() ? So, what I would want to do now is to replace the sleep by a message sent to the HNP indicating "I'm blocked for X minutes", then X minutes later "I'm blocked for 2X minutes", etc. The HNP would then aggregate those messages and when every process has sent one, log "Everyone is blocked for X minutes", then (I presume) X minutes later, "Everyone is blocked for 2X minutes", etc. I would then let users, admin or admin tools decide whether or not to abort the job. If someone finally receives something, it should send a message to the HNP indicating that it is no longer blocked, or maybe just looking at logs should suffice to see if block times continue to increase or not. Since I'm only working on opal_condition_wait(), deadlocks in applications using only MPI_Test calls will not be detected (but is that possible in the first place ?). Sylvain On Jun 9, 2009, at 6:43 AM, Ralph Castain wrote: Couple of other things to help stimulate the thinking: 1. it isn't that OMPI -couldn't- receive a message, but rather that it -didn't- receive a message. This may or may not indicate that there is a problem. Could just be an application that doesn't need to communicate for awhile, as per my example. I admit, though, that 10 minutes is a tad long...but I've seen some bizarre apps around here :-) 2. instead of putting things to sleep or even adjusting the loop rate, you might want to consider using the orte_notifier capability and notify the system that the job may be stalled. Or perhaps adding an API to the orte_errmgr framework to notify it that nothing has been received for awhile, and let people implement different strategies for detecting what might be "wrong" and what they want to do about it. My point with this second bullet is that there are other response options than hardwiring putting the process to sleep. You could let someone know so a huma
Re: [OMPI devel] [RFC] Low pressure OPAL progress
On Jun 9, 2009, at 8:31 AM, Jeff Squyres (jsquyres) wrote: 4. Note, too, that opal_progress() doesn't see *all* progress - the openib BTL doesn't use opal_progress to know when OpenFabrics messages arrive, for example. Wait, I lied -- sorry. opal_progress will call the bml progress, which then calls each of the btl processes (or we changed it so that opal_progress directly calls each btl progres -- I forget which). So technically opal_progress will see that "something" happened. But what that "something" is is unknown -- it could be a control message, or somesuch. (I don't recall offhand what the openib btl's progress function returns) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [RFC] Low pressure OPAL progress
I'll throw in my random $0.02. I'm at the Forum this week, so my latency on replies here will likely be large. 1. Ashley is correct that we shouldn't sleep. A better solution would be to block waiting for something to happen (rather than spin). As Terry mentioned, we pretty much know how to do this -- it's just that no one has done it yet. The full solution would then be: if we spin for a while (probably MCA-param settable) with nothing happening, switch to the blocking mode and continue waiting. I'm happy to pass on the information on how we've imagined that this should be done, if you want. 2. Note that your solution presupposes that one MPI process can detect that the entire job is deadlocked. This is not quite correct. What exactly do you want to detect -- that one process may be imbalanced on its receives (waiting for long periods of time without doing anything), or that the entire job is deadlocked? The former may be ok -- it depends on the app. If the latter, it requires a bit more work -- e.g., if one process detects that nothing has happened for a long time, it can initiate a collective/distributed deadlock detection algorithm with all the other MPI processes in the job. Only if *all* processes agree, then you can say "this job is deadlocked, we might as well abort." IIRC, there are some 3rd party tools / libraries that do this kind of stuff...? (although it might be cool / useful to incorporate some of this technology into OMPI itself) 3. As Ralph noted, how exactly do you know when "nothing happens for a long time" is a bad thing? a) some codes are structured that way -- that they'll have no MPI activity for a long time, even if they have pending non-blocking receives pre-posted. b) are you looking within the scope of *one* MPI blocking call? I.e., if nothing happens *within the span of one blocking MPI call*, or are you looking if nothing happens across successive calls to opal_progress() (which may be few and far between after OMPI hits steady state when using non-TCP networks)? It seems like there would need to be a [thread safe] "reset" at some point -- indicating that something has happened. That either would be when something has happened, or that a blocking MPI call has exited, or ? Need to make sure that that "reset" doesn't get expensive. 4. Note, too, that opal_progress() doesn't see *all* progress - the openib BTL doesn't use opal_progress to know when OpenFabrics messages arrive, for example. On Jun 9, 2009, at 6:43 AM, Ralph Castain wrote: Couple of other things to help stimulate the thinking: 1. it isn't that OMPI -couldn't- receive a message, but rather that it -didn't- receive a message. This may or may not indicate that there is a problem. Could just be an application that doesn't need to communicate for awhile, as per my example. I admit, though, that 10 minutes is a tad long...but I've seen some bizarre apps around here :-) 2. instead of putting things to sleep or even adjusting the loop rate, you might want to consider using the orte_notifier capability and notify the system that the job may be stalled. Or perhaps adding an API to the orte_errmgr framework to notify it that nothing has been received for awhile, and let people implement different strategies for detecting what might be "wrong" and what they want to do about it. My point with this second bullet is that there are other response options than hardwiring putting the process to sleep. You could let someone know so a human can decide what, if anything, to do about it, or provide a hook so that people can explore/utilize different response strategies...or both! HTH Ralph On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey > wrote: I understand your point of view, and mostly share it. I think the biggest point in my example is that sleep occurs only after (I was wrong in my previous e-mail) 10 minutes of inactivity, and this value is fully configurable. I didn't intend to call sleep after 2 seconds. Plus, as said before, I planned to have the library do show_help() when this happens (something like : "Open MPI couldn't receive a message for 10 minutes, lowering pressure") so that the application that really needs more than 10 minutes to receive a message can increase it. Looking at the tick rate code, I couldn't see how changing it would make CPU usage drop. If I understand correctly your e-mail, you block in the kernel using poll(), is that right ? So, you may well loose 10 us because of that kernel call, but this is a lot less than the 1 ms I'm currently loosing with usleep. This makes sense - although being hard to implement since all btl must have this ability. Thanks for your comments, I will continue to think about it. Sylvain On Tue, 9 Jun 2009, Ralph Castain wrote: My concern with any form of sleep is with the impact on the proc - since opal
Re: [OMPI devel] [RFC] Low pressure OPAL progress
On Tue, 9 Jun 2009, Ralph Castain wrote: 2. instead of putting things to sleep or even adjusting the loop rate, you might want to consider using the orte_notifier capability and notify the system that the job may be stalled. Or perhaps adding an API to the orte_errmgr framework to notify it that nothing has been received for awhile, and let people implement different strategies for detecting what might be "wrong" and what they want to do about it. Great remark. What is really needed here is the information of "nothing received for X minutes". Just having the information somewhere should be sufficient. We often see users asking if their application is still progressing, and this should answer their questions. This would also address the need of administrators to stop deadlocked runs during the night. I guess I'll redirect my work on this and couple it with our current effort on logging and administration tools coupling. Thanks a lot guys ! Sylvain My point with this second bullet is that there are other response options than hardwiring putting the process to sleep. You could let someone know so a human can decide what, if anything, to do about it, or provide a hook so that people can explore/utilize different response strategies...or both! HTH Ralph On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey wrote: I understand your point of view, and mostly share it. I think the biggest point in my example is that sleep occurs only after (I was wrong in my previous e-mail) 10 minutes of inactivity, and this value is fully configurable. I didn't intend to call sleep after 2 seconds. Plus, as said before, I planned to have the library do show_help() when this happens (something like : "Open MPI couldn't receive a message for 10 minutes, lowering pressure") so that the application that really needs more than 10 minutes to receive a message can increase it. Looking at the tick rate code, I couldn't see how changing it would make CPU usage drop. If I understand correctly your e-mail, you block in the kernel using poll(), is that right ? So, you may well loose 10 us because of that kernel call, but this is a lot less than the 1 ms I'm currently loosing with usleep. This makes sense - although being hard to implement since all btl must have this ability. Thanks for your comments, I will continue to think about it. Sylvain On Tue, 9 Jun 2009, Ralph Castain wrote: My concern with any form of sleep is with the impact on the proc - since opal_progress might not be running in a separate thread, won't the sleep apply to the process as a whole? In that case, the process isn't free to continue computing. I can envision applications that might call down into the MPI library and have opal_progress not find anything, but there is nothing wrong. The application could continue computations just fine. I would hate to see us put the process to sleep just because the MPI library wasn't busy enough. Hence my suggestion to just change the tick rate. It would definitely cause a higher latency for the first message that arrived while in this state, which is bothersome, but would meet the stated objective without interfering with the process itself. LANL has also been looking at this problem of stalled jobs, but from a different approach. We monitor (using a separate job) progress in terms of output files changing in size plus other factors as specified by the user. If we don't see any progress in those terms over some time, then we kill the job. We chose that path because of the concerns expressed above - e.g., on our RR machine, intense computations can be underway on the Cell blades while the Opteron MPI processes wait for us to reach a communication point. We -want- those processes spinning away so that, when the comm starts, it can proceed as quickly as possible. Just some thoughts... Ralph On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote: Sylvain Jeaugey wrote: Hi Ralph, I'm entirely convinced that MPI doesn't have to save power in a normal scenario. The idea is just that if an MPI process is blocked (i.e. has not performed progress for -say- 5 minutes (default in my implementation), we stop busy polling and have the process drop from 100% CPU usage to 0%. I do not call sleep() but usleep(). The result if quite the same, but is less hurting performance in case of (unexpected) restart. However, the goal of my RFC was also to know if there was a more clean way to achieve my goal, and from what I read, I guess I should look at the "tick" rate instead of trying to do my own delaying. One way a
Re: [OMPI devel] [RFC] Low pressure OPAL progress
Couple of other things to help stimulate the thinking: 1. it isn't that OMPI -couldn't- receive a message, but rather that it -didn't- receive a message. This may or may not indicate that there is a problem. Could just be an application that doesn't need to communicate for awhile, as per my example. I admit, though, that 10 minutes is a tad long...but I've seen some bizarre apps around here :-) 2. instead of putting things to sleep or even adjusting the loop rate, you might want to consider using the orte_notifier capability and notify the system that the job may be stalled. Or perhaps adding an API to the orte_errmgr framework to notify it that nothing has been received for awhile, and let people implement different strategies for detecting what might be "wrong" and what they want to do about it. My point with this second bullet is that there are other response options than hardwiring putting the process to sleep. You could let someone know so a human can decide what, if anything, to do about it, or provide a hook so that people can explore/utilize different response strategies...or both! HTH Ralph On Tue, Jun 9, 2009 at 6:52 AM, Sylvain Jeaugey wrote: > I understand your point of view, and mostly share it. > > I think the biggest point in my example is that sleep occurs only after (I > was wrong in my previous e-mail) 10 minutes of inactivity, and this value is > fully configurable. I didn't intend to call sleep after 2 seconds. Plus, as > said before, I planned to have the library do show_help() when this happens > (something like : "Open MPI couldn't receive a message for 10 minutes, > lowering pressure") so that the application that really needs more than 10 > minutes to receive a message can increase it. > > Looking at the tick rate code, I couldn't see how changing it would make > CPU usage drop. If I understand correctly your e-mail, you block in the > kernel using poll(), is that right ? So, you may well loose 10 us because of > that kernel call, but this is a lot less than the 1 ms I'm currently loosing > with usleep. This makes sense - although being hard to implement since all > btl must have this ability. > > Thanks for your comments, I will continue to think about it. > > Sylvain > > > On Tue, 9 Jun 2009, Ralph Castain wrote: > > My concern with any form of sleep is with the impact on the proc - since >> opal_progress might not be running in a separate thread, won't the sleep >> apply to the process as a whole? In that case, the process isn't free to >> continue computing. >> >> I can envision applications that might call down into the MPI library and >> have opal_progress not find anything, but there is nothing wrong. The >> application could continue computations just fine. I would hate to see us >> put the process to sleep just because the MPI library wasn't busy enough. >> >> Hence my suggestion to just change the tick rate. It would definitely >> cause a higher latency for the first message that arrived while in this >> state, which is bothersome, but would meet the stated objective without >> interfering with the process itself. >> >> LANL has also been looking at this problem of stalled jobs, but from a >> different approach. We monitor (using a separate job) progress in terms of >> output files changing in size plus other factors as specified by the user. >> If we don't see any progress in those terms over some time, then we kill the >> job. We chose that path because of the concerns expressed above - e.g., on >> our RR machine, intense computations can be underway on the Cell blades >> while the Opteron MPI processes wait for us to reach a communication point. >> We -want- those processes spinning away so that, when the comm starts, it >> can proceed as quickly as possible. >> >> Just some thoughts... >> Ralph >> >> >> On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote: >> >> Sylvain Jeaugey wrote: >>> Hi Ralph, I'm entirely convinced that MPI doesn't have to save power in a normal scenario. The idea is just that if an MPI process is blocked (i.e. has not performed progress for -say- 5 minutes (default in my implementation), we stop busy polling and have the process drop from 100% CPU usage to 0%. I do not call sleep() but usleep(). The result if quite the same, but is less hurting performance in case of (unexpected) restart. However, the goal of my RFC was also to know if there was a more clean way to achieve my goal, and from what I read, I guess I should look at the "tick" rate instead of trying to do my own delaying. One way around this is to make all blocked communications (even SM) to >>> use poll to block for incoming messages. Jeff and I have discussed this and >>> had many false starts on it. The biggest issue is coming up with a way to >>> have blocks on the SM btl converted to the system poll call without >>> requiring a socket write for every packet. >>> >>> The usleep solution works but is
Re: [OMPI devel] [RFC] Low pressure OPAL progress
On Mon, 2009-06-08 at 17:50 +0200, Sylvain Jeaugey wrote: > Principle > = > > opal_progress() ensures the progression of MPI communication. The current > algorithm is a loop calling progress on all registered components. If the > program is blocked, the loop will busy-poll indefinetely. I have some experience here due to implementing this feature (blocking waits) on Quadrics hardware. You're right that it can have benefits and yielding the CPU when "idle" is a good thing in the general case. The "correct" way for a process to relinquish the cpu is to block in a select() or poll() call until data is received whereupon it can wake up and continue working, the major problem each and every MPI implementation has is that select() only works for tcp/ip and not for shared memory or any of the more exotic networks. IMHO it would be much preferred to solve this problem properly and block in the wakeable select() rather than usleep(). In my experience when done correctly the performance is affected however surprisingly it can often lead to increased performance, we had full coverage however so were able to sleep early and wake up in a timely manner on receiving any message. Yeilding even one cpu per node from the application occasionally gives any background/os processing a chance to run without impacting the performance of the application so enabling blocking waits can lead to quicker runtimes. > Going to sleep after a certain amount of time with nothing received is > interesting for two things : > > - Administrator can easily detect whether a job is deadlocked : all the > processes are in sleep(). Currently, all processors are using 100% cpu and > it is very hard to know if progression is still happening or not. This is a valuable thing to know however I don't view the proposed solution as the correct one, if this were the problem you were aiming to solve I'd recommend a different approach, more like the llnl solution that Ralph described. Yours, Ashley Pittman. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] [RFC] Low pressure OPAL progress
I understand your point of view, and mostly share it. I think the biggest point in my example is that sleep occurs only after (I was wrong in my previous e-mail) 10 minutes of inactivity, and this value is fully configurable. I didn't intend to call sleep after 2 seconds. Plus, as said before, I planned to have the library do show_help() when this happens (something like : "Open MPI couldn't receive a message for 10 minutes, lowering pressure") so that the application that really needs more than 10 minutes to receive a message can increase it. Looking at the tick rate code, I couldn't see how changing it would make CPU usage drop. If I understand correctly your e-mail, you block in the kernel using poll(), is that right ? So, you may well loose 10 us because of that kernel call, but this is a lot less than the 1 ms I'm currently loosing with usleep. This makes sense - although being hard to implement since all btl must have this ability. Thanks for your comments, I will continue to think about it. Sylvain On Tue, 9 Jun 2009, Ralph Castain wrote: My concern with any form of sleep is with the impact on the proc - since opal_progress might not be running in a separate thread, won't the sleep apply to the process as a whole? In that case, the process isn't free to continue computing. I can envision applications that might call down into the MPI library and have opal_progress not find anything, but there is nothing wrong. The application could continue computations just fine. I would hate to see us put the process to sleep just because the MPI library wasn't busy enough. Hence my suggestion to just change the tick rate. It would definitely cause a higher latency for the first message that arrived while in this state, which is bothersome, but would meet the stated objective without interfering with the process itself. LANL has also been looking at this problem of stalled jobs, but from a different approach. We monitor (using a separate job) progress in terms of output files changing in size plus other factors as specified by the user. If we don't see any progress in those terms over some time, then we kill the job. We chose that path because of the concerns expressed above - e.g., on our RR machine, intense computations can be underway on the Cell blades while the Opteron MPI processes wait for us to reach a communication point. We -want- those processes spinning away so that, when the comm starts, it can proceed as quickly as possible. Just some thoughts... Ralph On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote: Sylvain Jeaugey wrote: Hi Ralph, I'm entirely convinced that MPI doesn't have to save power in a normal scenario. The idea is just that if an MPI process is blocked (i.e. has not performed progress for -say- 5 minutes (default in my implementation), we stop busy polling and have the process drop from 100% CPU usage to 0%. I do not call sleep() but usleep(). The result if quite the same, but is less hurting performance in case of (unexpected) restart. However, the goal of my RFC was also to know if there was a more clean way to achieve my goal, and from what I read, I guess I should look at the "tick" rate instead of trying to do my own delaying. One way around this is to make all blocked communications (even SM) to use poll to block for incoming messages. Jeff and I have discussed this and had many false starts on it. The biggest issue is coming up with a way to have blocks on the SM btl converted to the system poll call without requiring a socket write for every packet. The usleep solution works but is kind of ugly IMO. I think when I looked at doing that the overhead increased signifcantly for certain communications. Maybe not for toy benchmarks but for less synchronized processes I saw the usleep adding overhead where I didn't want it too. --td Don't worry, I was quite expecting the configure-in requirement. However, I don't think my patch is good for inclusion, it is only an example to describe what I want to achieve. Thanks a lot for your comments, Sylvain On Mon, 8 Jun 2009, Ralph Castain wrote: I'm not entirely convinced this actually achieves your goals, but I can see some potential benefits. I'm also not sure that power consumption is that big of an issue that MPI needs to begin chasing "power saver" modes of operation, but that can be a separate debate some day. I'm assuming you don't mean that you actually call "sleep()" as this would be very bad - I'm assuming you just change the opal_progress "tick" rate instead. True? If not, and you really call "sleep", then I would have to oppose adding this to the code base pending discussion with others who can corroborate that this won't cause problems. Either way, I could live with this so long as it was done as a "configure-in" capability. Just having the params default to a value that causes the system to behave similarly to today isn't enough - we still
Re: [OMPI devel] [RFC] Low pressure OPAL progress
My concern with any form of sleep is with the impact on the proc - since opal_progress might not be running in a separate thread, won't the sleep apply to the process as a whole? In that case, the process isn't free to continue computing. I can envision applications that might call down into the MPI library and have opal_progress not find anything, but there is nothing wrong. The application could continue computations just fine. I would hate to see us put the process to sleep just because the MPI library wasn't busy enough. Hence my suggestion to just change the tick rate. It would definitely cause a higher latency for the first message that arrived while in this state, which is bothersome, but would meet the stated objective without interfering with the process itself. LANL has also been looking at this problem of stalled jobs, but from a different approach. We monitor (using a separate job) progress in terms of output files changing in size plus other factors as specified by the user. If we don't see any progress in those terms over some time, then we kill the job. We chose that path because of the concerns expressed above - e.g., on our RR machine, intense computations can be underway on the Cell blades while the Opteron MPI processes wait for us to reach a communication point. We -want- those processes spinning away so that, when the comm starts, it can proceed as quickly as possible. Just some thoughts... Ralph On Jun 9, 2009, at 5:28 AM, Terry Dontje wrote: Sylvain Jeaugey wrote: Hi Ralph, I'm entirely convinced that MPI doesn't have to save power in a normal scenario. The idea is just that if an MPI process is blocked (i.e. has not performed progress for -say- 5 minutes (default in my implementation), we stop busy polling and have the process drop from 100% CPU usage to 0%. I do not call sleep() but usleep(). The result if quite the same, but is less hurting performance in case of (unexpected) restart. However, the goal of my RFC was also to know if there was a more clean way to achieve my goal, and from what I read, I guess I should look at the "tick" rate instead of trying to do my own delaying. One way around this is to make all blocked communications (even SM) to use poll to block for incoming messages. Jeff and I have discussed this and had many false starts on it. The biggest issue is coming up with a way to have blocks on the SM btl converted to the system poll call without requiring a socket write for every packet. The usleep solution works but is kind of ugly IMO. I think when I looked at doing that the overhead increased signifcantly for certain communications. Maybe not for toy benchmarks but for less synchronized processes I saw the usleep adding overhead where I didn't want it too. --td Don't worry, I was quite expecting the configure-in requirement. However, I don't think my patch is good for inclusion, it is only an example to describe what I want to achieve. Thanks a lot for your comments, Sylvain On Mon, 8 Jun 2009, Ralph Castain wrote: I'm not entirely convinced this actually achieves your goals, but I can see some potential benefits. I'm also not sure that power consumption is that big of an issue that MPI needs to begin chasing "power saver" modes of operation, but that can be a separate debate some day. I'm assuming you don't mean that you actually call "sleep()" as this would be very bad - I'm assuming you just change the opal_progress "tick" rate instead. True? If not, and you really call "sleep", then I would have to oppose adding this to the code base pending discussion with others who can corroborate that this won't cause problems. Either way, I could live with this so long as it was done as a "configure-in" capability. Just having the params default to a value that causes the system to behave similarly to today isn't enough - we still wind up adding logic into a very critical timing loop for no reason. A simple configure option of --enable-mpi- progress-monitoring would be sufficient to protect the code. HTH Ralph On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote: What : when nothing has been received for a very long time - e.g. 5 minutes, stop busy polling in opal_progress and switch to a usleep-based one. Why : when we have long waits, and especially when an application is deadlock'ed, detecting it is not easy and a lot of power is wasted until the end of the time slice (if there is one). Where : an example of how it could be implemented is available at http://bitbucket.org/jeaugeys/low-pressure-opal-progress/ Principle = opal_progress() ensures the progression of MPI communication. The current algorithm is a loop calling progress on all registered components. If the program is blocked, the loop will busy-poll indefinetely. Going to sleep after a certain amount of time with nothing received is in
Re: [OMPI devel] [RFC] Low pressure OPAL progress
Sylvain Jeaugey wrote: Hi Ralph, I'm entirely convinced that MPI doesn't have to save power in a normal scenario. The idea is just that if an MPI process is blocked (i.e. has not performed progress for -say- 5 minutes (default in my implementation), we stop busy polling and have the process drop from 100% CPU usage to 0%. I do not call sleep() but usleep(). The result if quite the same, but is less hurting performance in case of (unexpected) restart. However, the goal of my RFC was also to know if there was a more clean way to achieve my goal, and from what I read, I guess I should look at the "tick" rate instead of trying to do my own delaying. One way around this is to make all blocked communications (even SM) to use poll to block for incoming messages. Jeff and I have discussed this and had many false starts on it. The biggest issue is coming up with a way to have blocks on the SM btl converted to the system poll call without requiring a socket write for every packet. The usleep solution works but is kind of ugly IMO. I think when I looked at doing that the overhead increased signifcantly for certain communications. Maybe not for toy benchmarks but for less synchronized processes I saw the usleep adding overhead where I didn't want it too. --td Don't worry, I was quite expecting the configure-in requirement. However, I don't think my patch is good for inclusion, it is only an example to describe what I want to achieve. Thanks a lot for your comments, Sylvain On Mon, 8 Jun 2009, Ralph Castain wrote: I'm not entirely convinced this actually achieves your goals, but I can see some potential benefits. I'm also not sure that power consumption is that big of an issue that MPI needs to begin chasing "power saver" modes of operation, but that can be a separate debate some day. I'm assuming you don't mean that you actually call "sleep()" as this would be very bad - I'm assuming you just change the opal_progress "tick" rate instead. True? If not, and you really call "sleep", then I would have to oppose adding this to the code base pending discussion with others who can corroborate that this won't cause problems. Either way, I could live with this so long as it was done as a "configure-in" capability. Just having the params default to a value that causes the system to behave similarly to today isn't enough - we still wind up adding logic into a very critical timing loop for no reason. A simple configure option of --enable-mpi-progress-monitoring would be sufficient to protect the code. HTH Ralph On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote: What : when nothing has been received for a very long time - e.g. 5 minutes, stop busy polling in opal_progress and switch to a usleep-based one. Why : when we have long waits, and especially when an application is deadlock'ed, detecting it is not easy and a lot of power is wasted until the end of the time slice (if there is one). Where : an example of how it could be implemented is available at http://bitbucket.org/jeaugeys/low-pressure-opal-progress/ Principle = opal_progress() ensures the progression of MPI communication. The current algorithm is a loop calling progress on all registered components. If the program is blocked, the loop will busy-poll indefinetely. Going to sleep after a certain amount of time with nothing received is interesting for two things : - Administrator can easily detect whether a job is deadlocked : all the processes are in sleep(). Currently, all processors are using 100% cpu and it is very hard to know if progression is still happening or not. - When there is nothing to receive, power usage is highly reduced. However, it could hurt performance in some cases, typically if we go to sleep just before the message arrives. This will highly depend on the parameters you give to the sleep mechanism. At first, we can start with the following assumption : if the sleep takes T usec, then sleeping after 1xT should slow down Receives by a factor less than 0.01 %. However, other processes may suffer from you being late, and be delayed by T usec (which may represent more than 0.01% for them). So, the goal of this mechanism is mainly to detect far-too-long-waits and should quite never be used in normal MPI jobs. It could also trigger a warning message when starting to sleep, or at least a trace in the notifier. Details of Implementation = Three parameters fully control the behaviour of this mechanism : * opal_progress_sleep_count : number of unsuccessful opal_progress() calls before we start the timer (to prevent latency impact). It defaults to -1, which completely deactivates the sleep (and is therefore equivalent to the former code). A value of 1000 can be thought of as a starting point to enable this mechanism. * opal_progress_sleep_trigger : time to wait before going to low-pressure-powersave mode. Default : 600 (in
Re: [OMPI devel] [RFC] Low pressure OPAL progress
Hi Ralph, I'm entirely convinced that MPI doesn't have to save power in a normal scenario. The idea is just that if an MPI process is blocked (i.e. has not performed progress for -say- 5 minutes (default in my implementation), we stop busy polling and have the process drop from 100% CPU usage to 0%. I do not call sleep() but usleep(). The result if quite the same, but is less hurting performance in case of (unexpected) restart. However, the goal of my RFC was also to know if there was a more clean way to achieve my goal, and from what I read, I guess I should look at the "tick" rate instead of trying to do my own delaying. Don't worry, I was quite expecting the configure-in requirement. However, I don't think my patch is good for inclusion, it is only an example to describe what I want to achieve. Thanks a lot for your comments, Sylvain On Mon, 8 Jun 2009, Ralph Castain wrote: I'm not entirely convinced this actually achieves your goals, but I can see some potential benefits. I'm also not sure that power consumption is that big of an issue that MPI needs to begin chasing "power saver" modes of operation, but that can be a separate debate some day. I'm assuming you don't mean that you actually call "sleep()" as this would be very bad - I'm assuming you just change the opal_progress "tick" rate instead. True? If not, and you really call "sleep", then I would have to oppose adding this to the code base pending discussion with others who can corroborate that this won't cause problems. Either way, I could live with this so long as it was done as a "configure-in" capability. Just having the params default to a value that causes the system to behave similarly to today isn't enough - we still wind up adding logic into a very critical timing loop for no reason. A simple configure option of --enable-mpi-progress-monitoring would be sufficient to protect the code. HTH Ralph On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote: What : when nothing has been received for a very long time - e.g. 5 minutes, stop busy polling in opal_progress and switch to a usleep-based one. Why : when we have long waits, and especially when an application is deadlock'ed, detecting it is not easy and a lot of power is wasted until the end of the time slice (if there is one). Where : an example of how it could be implemented is available at http://bitbucket.org/jeaugeys/low-pressure-opal-progress/ Principle = opal_progress() ensures the progression of MPI communication. The current algorithm is a loop calling progress on all registered components. If the program is blocked, the loop will busy-poll indefinetely. Going to sleep after a certain amount of time with nothing received is interesting for two things : - Administrator can easily detect whether a job is deadlocked : all the processes are in sleep(). Currently, all processors are using 100% cpu and it is very hard to know if progression is still happening or not. - When there is nothing to receive, power usage is highly reduced. However, it could hurt performance in some cases, typically if we go to sleep just before the message arrives. This will highly depend on the parameters you give to the sleep mechanism. At first, we can start with the following assumption : if the sleep takes T usec, then sleeping after 1xT should slow down Receives by a factor less than 0.01 %. However, other processes may suffer from you being late, and be delayed by T usec (which may represent more than 0.01% for them). So, the goal of this mechanism is mainly to detect far-too-long-waits and should quite never be used in normal MPI jobs. It could also trigger a warning message when starting to sleep, or at least a trace in the notifier. Details of Implementation = Three parameters fully control the behaviour of this mechanism : * opal_progress_sleep_count : number of unsuccessful opal_progress() calls before we start the timer (to prevent latency impact). It defaults to -1, which completely deactivates the sleep (and is therefore equivalent to the former code). A value of 1000 can be thought of as a starting point to enable this mechanism. * opal_progress_sleep_trigger : time to wait before going to low-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes. * opal_progress_sleep_duration : time we sleep at each further unsuccessful call to opal_progress(). Default : 1000 (in us) = 1 ms. The duration is big enough to make the process show 0% CPU in top, but low enough to preserve a good trigger/duration ratio. The trigger is voluntary high to keep a good trigger/duration ratio. Indeed, to prevent delays from causing chain reactions, trigger should be higher than duration * numprocs. Possible Improvements & Pitfalls * Trigger could be set automatically at max(trigger, duration * numprocs * 2). * poll_start and poll_count could be fields
Re: [OMPI devel] [RFC] Low pressure OPAL progress
I'm not entirely convinced this actually achieves your goals, but I can see some potential benefits. I'm also not sure that power consumption is that big of an issue that MPI needs to begin chasing "power saver" modes of operation, but that can be a separate debate some day. I'm assuming you don't mean that you actually call "sleep()" as this would be very bad - I'm assuming you just change the opal_progress "tick" rate instead. True? If not, and you really call "sleep", then I would have to oppose adding this to the code base pending discussion with others who can corroborate that this won't cause problems. Either way, I could live with this so long as it was done as a "configure-in" capability. Just having the params default to a value that causes the system to behave similarly to today isn't enough - we still wind up adding logic into a very critical timing loop for no reason. A simple configure option of --enable-mpi-progress-monitoring would be sufficient to protect the code. HTH Ralph On Jun 8, 2009, at 9:50 AM, Sylvain Jeaugey wrote: What : when nothing has been received for a very long time - e.g. 5 minutes, stop busy polling in opal_progress and switch to a usleep- based one. Why : when we have long waits, and especially when an application is deadlock'ed, detecting it is not easy and a lot of power is wasted until the end of the time slice (if there is one). Where : an example of how it could be implemented is available at http://bitbucket.org/jeaugeys/low-pressure-opal-progress/ Principle = opal_progress() ensures the progression of MPI communication. The current algorithm is a loop calling progress on all registered components. If the program is blocked, the loop will busy-poll indefinetely. Going to sleep after a certain amount of time with nothing received is interesting for two things : - Administrator can easily detect whether a job is deadlocked : all the processes are in sleep(). Currently, all processors are using 100% cpu and it is very hard to know if progression is still happening or not. - When there is nothing to receive, power usage is highly reduced. However, it could hurt performance in some cases, typically if we go to sleep just before the message arrives. This will highly depend on the parameters you give to the sleep mechanism. At first, we can start with the following assumption : if the sleep takes T usec, then sleeping after 1xT should slow down Receives by a factor less than 0.01 %. However, other processes may suffer from you being late, and be delayed by T usec (which may represent more than 0.01% for them). So, the goal of this mechanism is mainly to detect far-too-long- waits and should quite never be used in normal MPI jobs. It could also trigger a warning message when starting to sleep, or at least a trace in the notifier. Details of Implementation = Three parameters fully control the behaviour of this mechanism : * opal_progress_sleep_count : number of unsuccessful opal_progress() calls before we start the timer (to prevent latency impact). It defaults to -1, which completely deactivates the sleep (and is therefore equivalent to the former code). A value of 1000 can be thought of as a starting point to enable this mechanism. * opal_progress_sleep_trigger : time to wait before going to low- pressure-powersave mode. Default : 600 (in seconds) = 10 minutes. * opal_progress_sleep_duration : time we sleep at each further unsuccessful call to opal_progress(). Default : 1000 (in us) = 1 ms. The duration is big enough to make the process show 0% CPU in top, but low enough to preserve a good trigger/duration ratio. The trigger is voluntary high to keep a good trigger/duration ratio. Indeed, to prevent delays from causing chain reactions, trigger should be higher than duration * numprocs. Possible Improvements & Pitfalls * Trigger could be set automatically at max(trigger, duration * numprocs * 2). * poll_start and poll_count could be fields of the opal_condition_t struct. * The sleep section may be exported in a #define and reported in all the progress pathes (I'm not sure my patch is good for progress threads for example) ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] [RFC] Low pressure OPAL progress
What : when nothing has been received for a very long time - e.g. 5 minutes, stop busy polling in opal_progress and switch to a usleep-based one. Why : when we have long waits, and especially when an application is deadlock'ed, detecting it is not easy and a lot of power is wasted until the end of the time slice (if there is one). Where : an example of how it could be implemented is available at http://bitbucket.org/jeaugeys/low-pressure-opal-progress/ Principle = opal_progress() ensures the progression of MPI communication. The current algorithm is a loop calling progress on all registered components. If the program is blocked, the loop will busy-poll indefinetely. Going to sleep after a certain amount of time with nothing received is interesting for two things : - Administrator can easily detect whether a job is deadlocked : all the processes are in sleep(). Currently, all processors are using 100% cpu and it is very hard to know if progression is still happening or not. - When there is nothing to receive, power usage is highly reduced. However, it could hurt performance in some cases, typically if we go to sleep just before the message arrives. This will highly depend on the parameters you give to the sleep mechanism. At first, we can start with the following assumption : if the sleep takes T usec, then sleeping after 1xT should slow down Receives by a factor less than 0.01 %. However, other processes may suffer from you being late, and be delayed by T usec (which may represent more than 0.01% for them). So, the goal of this mechanism is mainly to detect far-too-long-waits and should quite never be used in normal MPI jobs. It could also trigger a warning message when starting to sleep, or at least a trace in the notifier. Details of Implementation = Three parameters fully control the behaviour of this mechanism : * opal_progress_sleep_count : number of unsuccessful opal_progress() calls before we start the timer (to prevent latency impact). It defaults to -1, which completely deactivates the sleep (and is therefore equivalent to the former code). A value of 1000 can be thought of as a starting point to enable this mechanism. * opal_progress_sleep_trigger : time to wait before going to low-pressure-powersave mode. Default : 600 (in seconds) = 10 minutes. * opal_progress_sleep_duration : time we sleep at each further unsuccessful call to opal_progress(). Default : 1000 (in us) = 1 ms. The duration is big enough to make the process show 0% CPU in top, but low enough to preserve a good trigger/duration ratio. The trigger is voluntary high to keep a good trigger/duration ratio. Indeed, to prevent delays from causing chain reactions, trigger should be higher than duration * numprocs. Possible Improvements & Pitfalls * Trigger could be set automatically at max(trigger, duration * numprocs * 2). * poll_start and poll_count could be fields of the opal_condition_t struct. * The sleep section may be exported in a #define and reported in all the progress pathes (I'm not sure my patch is good for progress threads for example)