Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
Hi Ralph, I tried with the trunk and it makes no difference for me. Looking at potential differences, I found out something strange. The bug may have something to do with the "routed" framework. I can reproduce the bug with binomial and direct, but not with cm and linear (you disabled the build of the latest in your configure options -- why ?). Btw, all components have a 0 priority and none is defined to be the default component. Which one is the default then ? binomial (as the first in alphabetical order) ? Can you check which one you are using and try with binomial explicitely chosen ? Thanks for your time, Sylvain On Thu, 26 Nov 2009, Ralph Castain wrote: Hi Sylvain Well, I hate to tell you this, but I cannot reproduce the "bug" even with this code in ORTE no matter what value of ORTE_RELAY_DELAY I use. The system runs really slow as I increase the delay, but it completes the job just fine. I ran jobs across 16 nodes on a slurm machine, 1-4 ppn, a "hello world" app that calls MPI_Init immediately upon execution. So I have to conclude this is a problem in your setup/config. Are you sure you didn't --enable-progress-threads?? That is the only way I can recreate this behavior. I plan to modify the relay/message processing method anyway to clean it up. But there doesn't appear to be anything wrong with the current code. Ralph On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote: Hi Ralph, Thanks for your efforts. I will look at our configuration and see how it may differ from ours. Here is a patch which helps reproducing the bug even with a small number of nodes. diff -r b622b9e8f1ac orte/orted/orted_comm.c --- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009 +0100 +++ b/orte/orted/orted_comm.c Fri Nov 20 14:47:39 2009 +0100 @@ -126,6 +126,13 @@ ORTE_ERROR_LOG(ret); goto CLEANUP; } +{ /* Add delay to reproduce bug */ +char * str = getenv("ORTE_RELAY_DELAY"); +int sec = str ? atoi(str) : 0; +if (sec) { +sleep(sec); +} +} } CLEANUP: Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug. During our experiments, the bug disappeared when we added a delay before calling MPI_Init. So, configurations where processes are launched slowly or take some time before MPI_Init should be immune to this bug. We usually reproduce the bug with one ppn (faster to spawn). Sylvain On Thu, 19 Nov 2009, Ralph Castain wrote: Hi Sylvain I've spent several hours trying to replicate the behavior you described on clusters up to a couple of hundred nodes (all running slurm), without success. I'm becoming increasingly convinced that this is a configuration issue as opposed to a code issue. I have enclosed the platform file I use below. Could you compare it to your configuration? I'm wondering if there is something critical about the config that may be causing the problem (perhaps we have a problem in our default configuration). Also, is there anything else you can tell us about your configuration? How many ppn triggers it, or do you always get the behavior every time you launch over a certain number of nodes? Meantime, I will look into this further. I am going to introduce a "slow down" param that will force the situation you encountered - i.e., will ensure that the relay is still being sent when the daemon receives the first collective input. We can then use that to try and force replication of the behavior you are encountering. Thanks Ralph enable_dlopen=no enable_pty_support=no with_blcr=no with_openib=yes with_memory_manager=no enable_mem_debug=yes enable_mem_profile=no enable_debug_symbols=yes enable_binaries=yes with_devel_headers=yes enable_heterogeneous=no enable_picky=yes enable_debug=yes enable_shared=yes enable_static=yes with_slurm=yes enable_contrib_no_build=libnbc,vt enable_visibility=yes enable_memchecker=no enable_ipv6=no enable_mpi_f77=no enable_mpi_f90=no enable_mpi_cxx=no enable_mpi_cxx_seek=no enable_mca_no_build=pml-dr,pml-crcp2,crcp enable_io_romio=no On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote: On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote: Thank you Ralph for this precious help. I setup a quick-and-dirty patch basically postponing process_msg (hence daemon_collective) until the launch is done. In process_msg, I therefore requeue a process_msg handler and return. That is basically the idea I proposed, just done in a slightly different place In this "all-must-be-non-blocking-and-done-through-opal_progress" algorithm, I don't think that blocking calls like the one in daemon_collective should be allowed. This also applies to the blocking one in send_relay. [Well, actually, one is okay, 2 may lead to interlocking.] Well, that would be problematic - you will find "progressed_wait" used repeatedly in the code. Removing them all would take a -lot- of effort and a major rewrite. I'm not yet convi
Re: [OMPI devel] SC09 OMPI-related slides
Jeff, Thanks for making these papers and presentations available. Ken Kenneth A. Lloyd CEO - Director of Systems Science Watt Systems Technologies Inc. Albuquerque, NM USA kenneth.lloyd[at]wattsys.com kenneth.lloyd[at]incose.org kenneth.lloyd[at]nmug.net www.wattsys.com http://www.linkedin.com/pub/kenneth-lloyd/7/9a/824 http://kenscomplex.blogspot.com/ This e-mail is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521 and is intended only for the addressee named above. It may contain privileged or confidential information. If you are not the addressee you must not copy, distribute, disclose or use any of the information in it. If you have received it in error please delete it and immediately notify the sender. > -Original Message- > From: devel-boun...@open-mpi.org > [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres > Sent: Tuesday, November 24, 2009 6:08 AM > To: Open MPI Developers List > Subject: [OMPI devel] SC09 OMPI-related slides > > If you had any papers or presentations on, about, or relating > to Open MPI, please send me a copy so that we can post them here: > > http://www.open-mpi.org/papers/sc-2009/ > > Thanks! > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
On Nov 27, 2009, at 8:23 AM, Sylvain Jeaugey wrote: > Hi Ralph, > > I tried with the trunk and it makes no difference for me. Strange > > Looking at potential differences, I found out something strange. The bug may > have something to do with the "routed" framework. I can reproduce the bug > with binomial and direct, but not with cm and linear (you disabled the build > of the latest in your configure options -- why ?). You won't with cm because there is no relay. Likewise, direct doesn't have a relay - so I'm really puzzled how you can see this behavior when using the direct component??? I disable components in my build to save memory. Every component we open costs us memory that may or may not be recoverable during the course of execution. > > Btw, all components have a 0 priority and none is defined to be the default > component. Which one is the default then ? binomial (as the first in > alphabetical order) ? I believe you must have a severely corrupted version of the code. The binomial component has priority 70 so it will be selected as the default. Linear has priority 40, though it will only be selected if you say ^binomial. CM and radix have special selection code in them so they will only be selected when specified. Direct and slave have priority 0 to ensure they will only be selected when specified > > Can you check which one you are using and try with binomial explicitely > chosen ? I am using binomial for all my tests >From what you are describing, I think you either have a corrupted copy of the >code, are picking up mis-matched versions, or something strange as your >experiences don't match what anyone else is seeing. Remember, the phase you are discussing here has nothing to do with the native launch environment. This is dealing with the relative timing of the application launch versus relaying the launch message itself - i.e., the daemons are already up and running before any of this starts. Thus, this "problem" has nothing to do with how we launch the daemons. So, if it truly were a problem in the code, we would see it on every environment - torque, slurm, ssh, etc. We routinely launch jobs spanning hundreds to thousands of nodes without problem. If this timing problem was as you have identified, then we would see this constantly. Yet nobody is seeing it, and I cannot reproduce it even with your reproducer. I honestly don't know what to suggest at this point. Any chance you are picking up mis-matched OMPI versions are your backend nodes or something? Tried fresh checkouts of the code? Is this a code base you have modified, or are you seeing this with the "stock" code from the repo? Just fishing at this point - can't find anything wrong! :-/ Ralph > > Thanks for your time, > Sylvain > > On Thu, 26 Nov 2009, Ralph Castain wrote: > >> Hi Sylvain >> >> Well, I hate to tell you this, but I cannot reproduce the "bug" even with >> this code in ORTE no matter what value of ORTE_RELAY_DELAY I use. The system >> runs really slow as I increase the delay, but it completes the job just >> fine. I ran jobs across 16 nodes on a slurm machine, 1-4 ppn, a "hello >> world" app that calls MPI_Init immediately upon execution. >> >> So I have to conclude this is a problem in your setup/config. Are you sure >> you didn't --enable-progress-threads?? That is the only way I can recreate >> this behavior. >> >> I plan to modify the relay/message processing method anyway to clean it up. >> But there doesn't appear to be anything wrong with the current code. >> Ralph >> >> On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote: >> >>> Hi Ralph, >>> >>> Thanks for your efforts. I will look at our configuration and see how it >>> may differ from ours. >>> >>> Here is a patch which helps reproducing the bug even with a small number of >>> nodes. >>> >>> diff -r b622b9e8f1ac orte/orted/orted_comm.c >>> --- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009 +0100 >>> +++ b/orte/orted/orted_comm.c Fri Nov 20 14:47:39 2009 +0100 >>> @@ -126,6 +126,13 @@ >>>ORTE_ERROR_LOG(ret); >>>goto CLEANUP; >>>} >>> +{ /* Add delay to reproduce bug */ >>> +char * str = getenv("ORTE_RELAY_DELAY"); >>> +int sec = str ? atoi(str) : 0; >>> +if (sec) { >>> +sleep(sec); >>> +} >>> +} >>>} >>> >>> CLEANUP: >>> >>> Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug. >>> >>> During our experiments, the bug disappeared when we added a delay before >>> calling MPI_Init. So, configurations where processes are launched slowly or >>> take some time before MPI_Init should be immune to this bug. >>> >>> We usually reproduce the bug with one ppn (faster to spawn). >>> >>> Sylvain >>> >>> On Thu, 19 Nov 2009, Ralph Castain wrote: >>> Hi Sylvain I've spent several hours trying to replicate the behavior you described on cluste
Re: [OMPI devel] RFC: Add extra_state field to ompi_request_t
Brian, This is a pretty big change to be done with a so short notice, especially over the Thanksgiving weekend. I do have a lots of concerns about this approach, but I lack the time to expand on this right now. I'll be back at work on Monday and I'll give detailed informations. Please delay the deadline until at least Wednesday. Thanks, george. On Nov 25, 2009, at 11:52 , Barrett, Brian W wrote: > WHAT: Add a void* extra_state field to ompi_request_t > > WHY: When we added the req_complete_cb field so that internal pieces of OMPI > who generated requests (such as the OSC components using the PML) could be > async notified when the request completed (ie, the PML request the OSC > component had initiated was finished), we neglected to add any type of > "extra state" associated with that request/callback. So the completion > callback is almost worthless, because the upper layer has a hard time > figuring out which thing it was working on it can now progress due to the > given (lower?) request completing. > > WHERE: One line in each of ompi/request/request.[hc]. > > WHEN: ASAP > > TIMEOUT: Sunday, Nov 29. > > More Details > > > This is probably not even worth an RFC, which is why I'm not giving a very > long timeout (that, and if I don't get this done during the holiday weekend, > it will never get done). The changes are a single line in request.h adding > a void* extra_state variable to the ompi_request_t and another single line > in request.c to initialize the field to NULL. > > While looking for some other code, I stumbled upon the OSC changes I made a > long time ago to try to use req_complete_cb instead of registering a > progress function. The code is actually a lot cleaner that way, and means > no progress functions for the one-side components. > > The down side is that it adds another 8 bytes to ompi_request_t, which is > already larger than I'd like. But on the flip side, we have an 8 byte field > (the callback) which is totally unusable without the extra_state field. > > Brian > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel