Re: [OMPI devel] Hang in collectives involving shared memory
Ashley Pittman wrote: Do you have a stack trace of your hung application to hand, in particular when you say "All processes have made the same call to MPI_Allreduce. The processes are all in opal_progress, called (with intervening calls) by MPI_Allreduce." do the intervening calls include mca_coll_sync_bcast ompi_coll_tuned_barrier_intra_dec_fixed and ompi_coll_tuned_barrier_intra_recursivedoubling? I don't have a stack trace handy, and today is pretty full. I'll try and make some time to document what I've got in the next few days. I was able to hang a C translation of Ralph's reproducer as well. - Bryan -- Bryan Lally, la...@lanl.gov 505.667.9954 CCS-2 Los Alamos National Laboratory Los Alamos, New Mexico
Re: [OMPI devel] Hang in collectives involving shared memory
Ashley Pittman wrote: Whilst the fact that it appears to only happen on your machine implies it's not a general problem with OpenMPI the fact that it happens in the same location/rep count every time does swing the blame back the other way. This sounds a _lot_ like the problem I was seeing, my initial message is appended here. If it's the same thing, then it's not only on the big machines here that Ralph was talking about, but on very vanilla Fedora 7 and 9 boxes. I was able to hang Ralph's reproducer on an 8 core Dell, Fedora 9, kernel 2.6.27(.4-78.2.53.fc9.x86_64). I don't think it's just the one machine and it's configuration. - Bryan -- Bryan Lally, la...@lanl.gov 505.667.9954 CCS-2 Los Alamos National Laboratory Los Alamos, New Mexico Developers, This is my first post to the openmpi developers list. I think I've run across a race condition in your latest release. Since my demonstrator is somewhat large and cumbersome, I'd like to know if you already know about this issue before we start the process of providing code and details. Basics: openmpi 1.3.2, Fedora 9, 2 x86_64 quad-core cpus in one machine. Symptoms: our code hangs, always in the same vicinity, usually at the same place, 10-25% of the time. Sometimes more often, sometimes less. Our code has run reliably with many MPI implementations for years. We haven't added anything recently that is a likely culprit. While we have our own issues, this doesn't feel like one of ours. We see that there is new code in the shared memory transport between 1.3.1 and 1.3.2. Our code doesn't hang with 1.3.1 (nor 1.2.9). Only with 1.3.2. If we switch to tcp for transport (with mpirun --mca btl tcp,self ...) we don't see any hangs. Running using --mca btl sm,self results in hangs. If we sprinkle a few calls (3) to MPI_Barrier in the vicinity of the problem, we no longer see hangs. We demonstrate this with 4 processes. When we attach a debugger to the hung processes, we see that the hang results from an MPI_Allreduce. All processes have made the same call to MPI_Allreduce. The processes are all in opal_progress, called (with intervening calls) by MPI_Allreduce. My question is, have you seen anything like this before? If not, what do we do next? Thanks. - Bryan
Re: [OMPI devel] Hang in collectives involving shared memory
On Wed, 2009-06-10 at 09:07 -0600, Ralph Castain wrote: > Hi Ashley > > Thanks! I would definitely be interested and will look at the tool. > Meantime, I have filed a bunch of data on this in ticket #1944, so > perhaps you might take a glance at that and offer some thoughts? > > https://svn.open-mpi.org/trac/ompi/ticket/1944 > > Will be back after I look at the tool. Have you made any progress? Whilst the fact that it appears to only happen on your machine implies it's not a general problem with OpenMPI the fact that it happens in the same location/rep count every time does swing the blame back the other way. Perhaps it's some special configure or runtime option you are setting? One thing that springs to mind is the numa-maps could be exposing some timimg problem with shared memory calls however this doesn't sit well with it always failing on the same iteration. Can you provide stack traces from when it's hung and crucially are they the same for every hang? If you change the coll_sync_barrier_before value to make it hang on a different repetition does this change the stack trace at all? Likewise when you have applied the collectives patch is the collective state the same for every hang and how does this differ when you change the coll_sync_barrier_before variable? It would be useful to see stack traces and collective state from the three collectives you report as causing problems, MPI_Bcast, MPI_Reduce and MPI_Allgather because as I said before these three collectives have radically different communication patterns. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] Hang in collectives involving shared memory
Hi Ralph, I managed to have a deadlock after a whole night, but not the same you have : after a quick analysis, process 0 seems to be blocked in the very first send through shared memory. Still maybe a bug, but not the same as yours IMO. I also figured out that libnuma support was not in my library, so I rebuilt the lib and this doesn't seem to change anything : same execution speed, same memory footprint, and of course same the-bug-does-not-appear :-(. So, no luck so far in reproducing your problem. I guess you're the only one to be able to progress on this (since you seem to have a real reproducer). Sylvain On Wed, 10 Jun 2009, Sylvain Jeaugey wrote: Hum, very glad that padb works with Open MPI, I couldn't live without it. In my opinion, the best debug tool for parallel applications, and more importantly, the only one that scales. About the issue, I couldn't reproduce it on my platform (tried 2 nodes with 2 to 8 processes each, nodes are twin 2.93 GHz Nehalem, IB is Mellanox QDR). So my feeling about that is that is may be very hardware related. Especially if you use the hierarch component, some transactions will be done through RDMA on one side and read directly through shared memory on the other side, which can, depending on the hardware, produce very different timings and bugs. Did you try with a different collective component (i.e. not hierarch) ? Or with another interconnect ? [Yes, of course, if it is a race condition, we might well avoid the bug because timings will be different, but that's still information] Perhaps all what I'm saying makes no sense or you already thought about this, anyway, if you want me to try different things, just let me know. Sylvain On Wed, 10 Jun 2009, Ralph Castain wrote: Hi Ashley Thanks! I would definitely be interested and will look at the tool. Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps you might take a glance at that and offer some thoughts? https://svn.open-mpi.org/trac/ompi/ticket/1944 Will be back after I look at the tool. Thanks again Ralph On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittmanwrote: Ralph, If I may say this is exactly the type of problem the tool I have been working on recently aims to help with and I'd be happy to help you through it. Firstly I'd say of the three collectives you mention, MPI_Allgather, MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one and the last a many-to-one communication pattern. The scenario of a root process falling behind and getting swamped in comms is a plausible one for MPI_Reduce only but doesn't hold water with the other two. You also don't mention if the loop is over a single collective or if you have loop calling a number of different collectives each iteration. padb, the tool I've been working on has the ability to look at parallel jobs and report on the state of collective comms and should help narrow you down on erroneous processes and those simply blocked waiting for comms. I'd recommend using it to look at maybe four or five instances where the application has hung and look for any common features between them. Let me know if you are willing to try this route and I'll talk, the code is downloadable from http://padb.pittman.org.uk and if you want the full collective functionality you'll need to patch openmp with the patch from http://padb.pittman.org.uk/extensions.html Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Hang in collectives involving shared memory
Well, it would - except then -all- the procs would run real slow! :-) Still, might be a reasonable diagnostic step to try...will give it a shot. On Wed, Jun 10, 2009 at 1:12 PM, Bogdan Costescu < bogdan.coste...@iwr.uni-heidelberg.de> wrote: > On Wed, 10 Jun 2009, Ralph Castain wrote: > > I appreciate the input and have captured it in the ticket. Since this >> appears to be a NUMA-related issue, the lack of NUMA support in your setup >> makes the test difficult to interpret. >> > > Based on this reasoning, disabling libnuma support in your OpenMPI build > should also solve the problem, or do I interpret things the wrong way ? > > > -- > Bogdan Costescu > > IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany > Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 > E-mail: bogdan.coste...@iwr.uni-heidelberg.de > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Hang in collectives involving shared memory
On Wed, 10 Jun 2009, Ralph Castain wrote: I appreciate the input and have captured it in the ticket. Since this appears to be a NUMA-related issue, the lack of NUMA support in your setup makes the test difficult to interpret. Based on this reasoning, disabling libnuma support in your OpenMPI build should also solve the problem, or do I interpret things the wrong way ? -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [OMPI devel] Hang in collectives involving shared memory
Much appreciated! Per some of my other comments on this thread and on the referenced ticket, can you tell me what kernel you have on that machine? I assume you have NUMA support enabled, given that chipset? Thanks! Ralph On Wed, Jun 10, 2009 at 10:29 AM, Sylvain Jeaugeywrote: > Hum, very glad that padb works with Open MPI, I couldn't live without it. > In my opinion, the best debug tool for parallel applications, and more > importantly, the only one that scales. > > About the issue, I couldn't reproduce it on my platform (tried 2 nodes with > 2 to 8 processes each, nodes are twin 2.93 GHz Nehalem, IB is Mellanox QDR). > > So my feeling about that is that is may be very hardware related. > Especially if you use the hierarch component, some transactions will be done > through RDMA on one side and read directly through shared memory on the > other side, which can, depending on the hardware, produce very different > timings and bugs. Did you try with a different collective component (i.e. > not hierarch) ? Or with another interconnect ? [Yes, of course, if it is a > race condition, we might well avoid the bug because timings will be > different, but that's still information] > > Perhaps all what I'm saying makes no sense or you already thought about > this, anyway, if you want me to try different things, just let me know. > > Sylvain > > > On Wed, 10 Jun 2009, Ralph Castain wrote: > > Hi Ashley >> >> Thanks! I would definitely be interested and will look at the tool. >> Meantime, I have filed a bunch of data on this in >> ticket #1944, so perhaps you might take a glance at that and offer some >> thoughts? >> >> https://svn.open-mpi.org/trac/ompi/ticket/1944 >> >> Will be back after I look at the tool. >> >> Thanks again >> Ralph >> >> >> On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman >> wrote: >> >> Ralph, >> >> If I may say this is exactly the type of problem the tool I have been >> working on recently aims to help with and I'd be happy to help you >> through it. >> >> Firstly I'd say of the three collectives you mention, MPI_Allgather, >> MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a >> many-to-one >> and the last a many-to-one communication pattern. The scenario of a >> root process falling behind and getting swamped in comms is a >> plausible >> one for MPI_Reduce only but doesn't hold water with the other two. >> You >> also don't mention if the loop is over a single collective or if you >> have loop calling a number of different collectives each iteration. >> >> padb, the tool I've been working on has the ability to look at >> parallel >> jobs and report on the state of collective comms and should help >> narrow >> you down on erroneous processes and those simply blocked waiting for >> comms. I'd recommend using it to look at maybe four or five >> instances >> where the application has hung and look for any common features >> between >> them. >> >> Let me know if you are willing to try this route and I'll talk, the >> code >> is downloadable from http://padb.pittman.org.uk and if you want the >> full >> collective functionality you'll need to patch openmp with the patch >> from >> http://padb.pittman.org.uk/extensions.html >> >> Ashley, >> >> -- >> >> Ashley Pittman, Bath, UK. >> >> Padb - A parallel job inspection tool for cluster computing >> http://padb.pittman.org.uk >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Hang in collectives involving shared memory
I appreciate the input and have captured it in the ticket. Since this appears to be a NUMA-related issue, the lack of NUMA support in your setup makes the test difficult to interpret. I agree, though, that this is likely something peculiar to our particular setup. Of primary concern is that it might be related to the relatively old kernel (2.6.18) on these machines. There has been a lot of change since that kernel was released, and some of those changes may be relevant to this problem. Unfortunately, upgrading the kernel will take persuasive argument. We are going to try and run the reproducers on machines with more modern kernels to see if we get different behavior. Please feel free to follow this further on the ticket. Thanks again! Ralph On Wed, Jun 10, 2009 at 11:29 AM, Bogdan Costescu < bogdan.coste...@iwr.uni-heidelberg.de> wrote: > On Wed, 10 Jun 2009, Ralph Castain wrote: > > Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps >> you might take a glance at that and offer some thoughts? >> >> https://svn.open-mpi.org/trac/ompi/ticket/1944 >> > > I wasn't able to reproduce this. I have run with the following setup: > - OS is Scientific Linux 5.1 with a custom compiled kernel based on > 2.6.22.19, but (due to circumstances that I can't control): > > checking if MCA component maffinity:libnuma can compile... no > > - Intel compiler 10.1 > - OpenMPI 1.3.2 > - nodes have 2 CPUs of type E5440 (quad core), 16GB RAM and a ConnectX IB > DDR > > I've used the platform file that you have provided, but took out the > references to PanFS and fixed the paths. I've also used the MCA file that > you have provided. > > I have run with nodes=1:ppn=8 and nodes=2:ppn=8 and the test finished > successfully with m=50 several times. This, together with the earlier post > also describing a negative result, points to a problem related to your > particular setup... > > -- > Bogdan Costescu > > IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany > Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 > E-mail: bogdan.coste...@iwr.uni-heidelberg.de > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Hang in collectives involving shared memory
On Wed, 10 Jun 2009, Ralph Castain wrote: Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps you might take a glance at that and offer some thoughts? https://svn.open-mpi.org/trac/ompi/ticket/1944 I wasn't able to reproduce this. I have run with the following setup: - OS is Scientific Linux 5.1 with a custom compiled kernel based on 2.6.22.19, but (due to circumstances that I can't control): checking if MCA component maffinity:libnuma can compile... no - Intel compiler 10.1 - OpenMPI 1.3.2 - nodes have 2 CPUs of type E5440 (quad core), 16GB RAM and a ConnectX IB DDR I've used the platform file that you have provided, but took out the references to PanFS and fixed the paths. I've also used the MCA file that you have provided. I have run with nodes=1:ppn=8 and nodes=2:ppn=8 and the test finished successfully with m=50 several times. This, together with the earlier post also describing a negative result, points to a problem related to your particular setup... -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [OMPI devel] Hang in collectives involving shared memory
Hum, very glad that padb works with Open MPI, I couldn't live without it. In my opinion, the best debug tool for parallel applications, and more importantly, the only one that scales. About the issue, I couldn't reproduce it on my platform (tried 2 nodes with 2 to 8 processes each, nodes are twin 2.93 GHz Nehalem, IB is Mellanox QDR). So my feeling about that is that is may be very hardware related. Especially if you use the hierarch component, some transactions will be done through RDMA on one side and read directly through shared memory on the other side, which can, depending on the hardware, produce very different timings and bugs. Did you try with a different collective component (i.e. not hierarch) ? Or with another interconnect ? [Yes, of course, if it is a race condition, we might well avoid the bug because timings will be different, but that's still information] Perhaps all what I'm saying makes no sense or you already thought about this, anyway, if you want me to try different things, just let me know. Sylvain On Wed, 10 Jun 2009, Ralph Castain wrote: Hi Ashley Thanks! I would definitely be interested and will look at the tool. Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps you might take a glance at that and offer some thoughts? https://svn.open-mpi.org/trac/ompi/ticket/1944 Will be back after I look at the tool. Thanks again Ralph On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittmanwrote: Ralph, If I may say this is exactly the type of problem the tool I have been working on recently aims to help with and I'd be happy to help you through it. Firstly I'd say of the three collectives you mention, MPI_Allgather, MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one and the last a many-to-one communication pattern. The scenario of a root process falling behind and getting swamped in comms is a plausible one for MPI_Reduce only but doesn't hold water with the other two. You also don't mention if the loop is over a single collective or if you have loop calling a number of different collectives each iteration. padb, the tool I've been working on has the ability to look at parallel jobs and report on the state of collective comms and should help narrow you down on erroneous processes and those simply blocked waiting for comms. I'd recommend using it to look at maybe four or five instances where the application has hung and look for any common features between them. Let me know if you are willing to try this route and I'll talk, the code is downloadable from http://padb.pittman.org.uk and if you want the full collective functionality you'll need to patch openmp with the patch from http://padb.pittman.org.uk/extensions.html Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Hang in collectives involving shared memory
Hi Ashley Thanks! I would definitely be interested and will look at the tool. Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps you might take a glance at that and offer some thoughts? https://svn.open-mpi.org/trac/ompi/ticket/1944 Will be back after I look at the tool. Thanks again Ralph On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittmanwrote: > > Ralph, > > If I may say this is exactly the type of problem the tool I have been > working on recently aims to help with and I'd be happy to help you > through it. > > Firstly I'd say of the three collectives you mention, MPI_Allgather, > MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one > and the last a many-to-one communication pattern. The scenario of a > root process falling behind and getting swamped in comms is a plausible > one for MPI_Reduce only but doesn't hold water with the other two. You > also don't mention if the loop is over a single collective or if you > have loop calling a number of different collectives each iteration. > > padb, the tool I've been working on has the ability to look at parallel > jobs and report on the state of collective comms and should help narrow > you down on erroneous processes and those simply blocked waiting for > comms. I'd recommend using it to look at maybe four or five instances > where the application has hung and look for any common features between > them. > > Let me know if you are willing to try this route and I'll talk, the code > is downloadable from http://padb.pittman.org.uk and if you want the full > collective functionality you'll need to patch openmp with the patch from > http://padb.pittman.org.uk/extensions.html > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Hang in collectives involving shared memory
Ralph, If I may say this is exactly the type of problem the tool I have been working on recently aims to help with and I'd be happy to help you through it. Firstly I'd say of the three collectives you mention, MPI_Allgather, MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one and the last a many-to-one communication pattern. The scenario of a root process falling behind and getting swamped in comms is a plausible one for MPI_Reduce only but doesn't hold water with the other two. You also don't mention if the loop is over a single collective or if you have loop calling a number of different collectives each iteration. padb, the tool I've been working on has the ability to look at parallel jobs and report on the state of collective comms and should help narrow you down on erroneous processes and those simply blocked waiting for comms. I'd recommend using it to look at maybe four or five instances where the application has hung and look for any common features between them. Let me know if you are willing to try this route and I'll talk, the code is downloadable from http://padb.pittman.org.uk and if you want the full collective functionality you'll need to patch openmp with the patch from http://padb.pittman.org.uk/extensions.html Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
[OMPI devel] Hang in collectives involving shared memory
Hi folks As mentioned in today's telecon, we at LANL are continuing to see hangs when running even small jobs that involve shared memory in collective operations. This has been the topic of discussion before, but I bring it up again because (a) the problem is beginning to become epidemic across our application codes, and (b) repeated testing provides more info and (most importantly) confirms that this problem -does not- occur under 1.2.x - it is strictly a 1.3.2 (we haven't checked to see if it is in 1.3.0 or 1.3.1) problem. The condition is caused when the application performs a loop over collective operations such as MPI_Allgather, MPI_Reduce, and MPI_Bcast. This list is not intended to be exhaustive, but only represents the ones for which we have solid and repeatable data. The symptoms are a "hanging" job, typically (but not always!) associated with fully-consumed memory. The loops do not have to involve substantial amounts of memory (the Bcast loop hangs after moving a whole 32Mbytes, total), nor involve high loop counts. They only have to repeatedly call the collective. Disabling the shared memory BTL is enough to completely resolve the problem. However, this creates an undesirable performance penalty we would like to avoid, if possible. Our current solution is to use the "sync" collective to occasionally insert an MPI_Barrier into the code "behind the scenes" - i.e., to add an MPI_Barrier call every N number of calls to "problem" collectives. The argument in favor of this was that the hang is caused by consuming memory due to "unexpected messages", caused principally by the root process in the collective running slower than other procs. Thus, the notion goes, the root process continues to fall further and further behind, consuming ever more memory until it simply cannot progress. Adding the barrier operation forced the other procs to "hold" until the root process could catch up, thereby relieving the memory backlog. The sync collective has worked for us, but we are now finding a very disconcerting behavior - namely, that the precise value of N required to avoid hanging (a) is very, very sensitive and can still let the app hang even by changing the value by small amounts, (b) flunctuates between runs on an unpredictable basis, and (c) can be different for different collectives. These new problems surfaced this week when we found that a job that previously ran fine with one value of coll_sync_barrier_before suddenly hung when a loop over MPI_Bcast was added to the code. Further investigation has found that the value of N required to make the new loop work is significantly different than the prior value that made Allgather work, creating an exhaustive search for a "sweet spot" for N. Clearly, as codes grow in complexity, this simply is not going to work. It seems to me that we have to begin investigating -why- the 1.3.2 code is encountering this problem whereas the 1.2.x code is not. From our rough measurements, there is a some speed difference between the two releases, so perhaps we are now getting fast enough to create the problem - I don't think we know enough yet to really claim this is true. At this time, we really don't know -why- one process is running slow, or even if it is -always- the root process that is doing so...nor have we confirmed (to my knowledge) that our original analysis of the problem is correct! We would appreciate any help with this problem. I gathered from today's telecon that others are also encountering this, so perhaps there is enough general pain to stimulate a team effort to resolve it! Thanks Ralph