Re: [OMPI devel] [RFC] Low pressure OPAL progress
Hi Jeff, Thanks for jumping in. On Tue, 9 Jun 2009, Jeff Squyres wrote: 2. Note that your solution presupposes that one MPI process can detect that the entire job is deadlocked. This is not quite correct. What exactly do you want to detect -- that one process may be imbalanced on its receives (waiting for long periods of time without doing anything), or that the entire job is deadlocked? The former may be ok -- it depends on the app. If the latter, it requires a bit more work -- e.g., if one process detects that nothing has happened for a long time, it can initiate a collective/distributed deadlock detection algorithm with all the other MPI processes in the job. Only if *all* processes agree, then you can say "this job is deadlocked, we might as well abort." IIRC, there are some 3rd party tools / libraries that do this kind of stuff...? (although it might be cool / useful to incorporate some of this technology into OMPI itself) My approach was based on a per-process detection. Of course this does not indicate that the job is stuck, but tools like ganglia will quickly show you whether all processes are in the "sleep" state or not (maybe combined with debugging tools, to see if all are really in MPI, not blocked in an I/O or something). Then, the user or the admin can take a decision whether to abort the job or not. The "sleep" was only a way for me to bring the information to the user/admin. But as Ralph stated, a log would be even better in this case (more precise, no performance penalty, ..), also it needs to be coupled with other tools (whereas the sleep was naturally coupled with ganglia). 3. As Ralph noted, how exactly do you know when "nothing happens for a long time" is a bad thing? a) some codes are structured that way -- that they'll have no MPI activity for a long time, even if they have pending non-blocking receives pre-posted. b) are you looking within the scope of *one* MPI blocking call? I.e., if nothing happens *within the span of one blocking MPI call*, or are you looking if nothing happens across successive calls to opal_progress() (which may be few and far between after OMPI hits steady state when using non-TCP networks)? It seems like there would need to be a [thread safe] "reset" at some point -- indicating that something has happened. That either would be when something has happened, or that a blocking MPI call has exited, or ? Need to make sure that that "reset" doesn't get expensive. Uh. This is way more complicated than my patch. From the various reactions, it seems my RFC is misleading. I only work in opal_condition_wait(), which calls opal_progress(). The idea was only to sleep when we had been blocked in an MPI Wait (or similar) for a long time. So, we sleep only if there is no possible background computation : the MPI process is waiting, and basically doing nothing else. MPI_Test functions will never call sleep. The fact that opal_progress() did progress or not does not matter, the only question is : how long have we been in opal_condition_wait() ? So, what I would want to do now is to replace the sleep by a message sent to the HNP indicating "I'm blocked for X minutes", then X minutes later "I'm blocked for 2X minutes", etc. The HNP would then aggregate those messages and when every process has sent one, log "Everyone is blocked for X minutes", then (I presume) X minutes later, "Everyone is blocked for 2X minutes", etc. I would then let users, admin or admin tools decide whether or not to abort the job. If someone finally receives something, it should send a message to the HNP indicating that it is no longer blocked, or maybe just looking at logs should suffice to see if block times continue to increase or not. Since I'm only working on opal_condition_wait(), deadlocks in applications using only MPI_Test calls will not be detected (but is that possible in the first place ?). Sylvain On Jun 9, 2009, at 6:43 AM, Ralph Castain wrote: Couple of other things to help stimulate the thinking: 1. it isn't that OMPI -couldn't- receive a message, but rather that it -didn't- receive a message. This may or may not indicate that there is a problem. Could just be an application that doesn't need to communicate for awhile, as per my example. I admit, though, that 10 minutes is a tad long...but I've seen some bizarre apps around here :-) 2. instead of putting things to sleep or even adjusting the loop rate, you might want to consider using the orte_notifier capability and notify the system that the job may be stalled. Or perhaps adding an API to the orte_errmgr framework to notify it that nothing has been received for awhile, and let people implement different strategies for detecting what might be "wrong" and what they want to do about it. My point with this second bullet is that there are other response options than hardwiring putting the process to sleep. You could let someone know so a huma
[OMPI devel] Does open MPI support nodes behind NAT or Firewall
Hi Everyone, I wanted to know whether OPENMPI supported nodes that are behind a NAT or a firewall. If it doesn't do this by default can anyone let me know how i should go about making openMPI support NAT and firewall. LEO Explore and discover exciting holidays and getaways with Yahoo! India Travel http://in.travel.yahoo.com/
Re: [OMPI devel] Hang in collectives involving shared memory
Ralph, If I may say this is exactly the type of problem the tool I have been working on recently aims to help with and I'd be happy to help you through it. Firstly I'd say of the three collectives you mention, MPI_Allgather, MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one and the last a many-to-one communication pattern. The scenario of a root process falling behind and getting swamped in comms is a plausible one for MPI_Reduce only but doesn't hold water with the other two. You also don't mention if the loop is over a single collective or if you have loop calling a number of different collectives each iteration. padb, the tool I've been working on has the ability to look at parallel jobs and report on the state of collective comms and should help narrow you down on erroneous processes and those simply blocked waiting for comms. I'd recommend using it to look at maybe four or five instances where the application has hung and look for any common features between them. Let me know if you are willing to try this route and I'll talk, the code is downloadable from http://padb.pittman.org.uk and if you want the full collective functionality you'll need to patch openmp with the patch from http://padb.pittman.org.uk/extensions.html Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] Hang in collectives involving shared memory
Hi Ashley Thanks! I would definitely be interested and will look at the tool. Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps you might take a glance at that and offer some thoughts? https://svn.open-mpi.org/trac/ompi/ticket/1944 Will be back after I look at the tool. Thanks again Ralph On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman wrote: > > Ralph, > > If I may say this is exactly the type of problem the tool I have been > working on recently aims to help with and I'd be happy to help you > through it. > > Firstly I'd say of the three collectives you mention, MPI_Allgather, > MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one > and the last a many-to-one communication pattern. The scenario of a > root process falling behind and getting swamped in comms is a plausible > one for MPI_Reduce only but doesn't hold water with the other two. You > also don't mention if the loop is over a single collective or if you > have loop calling a number of different collectives each iteration. > > padb, the tool I've been working on has the ability to look at parallel > jobs and report on the state of collective comms and should help narrow > you down on erroneous processes and those simply blocked waiting for > comms. I'd recommend using it to look at maybe four or five instances > where the application has hung and look for any common features between > them. > > Let me know if you are willing to try this route and I'll talk, the code > is downloadable from http://padb.pittman.org.uk and if you want the full > collective functionality you'll need to patch openmp with the patch from > http://padb.pittman.org.uk/extensions.html > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Hang in collectives involving shared memory
On Wed, 2009-06-10 at 09:07 -0600, Ralph Castain wrote: > Hi Ashley > > Thanks! I would definitely be interested and will look at the tool. Great. My plan was to introduce the tool to this list today or tomorrow anyway but this problem falls right it's it's target area so I brought it up early. > Meantime, I have filed a bunch of data on this in ticket #1944, so > perhaps you might take a glance at that and offer some thoughts? > > https://svn.open-mpi.org/trac/ompi/ticket/1944 One thing that springs to mind is does the fortran reproducer hang on other machines if you use the same process geometry. That would tell us if we are looking for a pure OpenMPI problem or a wider issue, potentially eliminating any questions about numa memory layout. > Will be back after I look at the tool. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] Hang in collectives involving shared memory
Hum, very glad that padb works with Open MPI, I couldn't live without it. In my opinion, the best debug tool for parallel applications, and more importantly, the only one that scales. About the issue, I couldn't reproduce it on my platform (tried 2 nodes with 2 to 8 processes each, nodes are twin 2.93 GHz Nehalem, IB is Mellanox QDR). So my feeling about that is that is may be very hardware related. Especially if you use the hierarch component, some transactions will be done through RDMA on one side and read directly through shared memory on the other side, which can, depending on the hardware, produce very different timings and bugs. Did you try with a different collective component (i.e. not hierarch) ? Or with another interconnect ? [Yes, of course, if it is a race condition, we might well avoid the bug because timings will be different, but that's still information] Perhaps all what I'm saying makes no sense or you already thought about this, anyway, if you want me to try different things, just let me know. Sylvain On Wed, 10 Jun 2009, Ralph Castain wrote: Hi Ashley Thanks! I would definitely be interested and will look at the tool. Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps you might take a glance at that and offer some thoughts? https://svn.open-mpi.org/trac/ompi/ticket/1944 Will be back after I look at the tool. Thanks again Ralph On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman wrote: Ralph, If I may say this is exactly the type of problem the tool I have been working on recently aims to help with and I'd be happy to help you through it. Firstly I'd say of the three collectives you mention, MPI_Allgather, MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one and the last a many-to-one communication pattern. The scenario of a root process falling behind and getting swamped in comms is a plausible one for MPI_Reduce only but doesn't hold water with the other two. You also don't mention if the loop is over a single collective or if you have loop calling a number of different collectives each iteration. padb, the tool I've been working on has the ability to look at parallel jobs and report on the state of collective comms and should help narrow you down on erroneous processes and those simply blocked waiting for comms. I'd recommend using it to look at maybe four or five instances where the application has hung and look for any common features between them. Let me know if you are willing to try this route and I'll talk, the code is downloadable from http://padb.pittman.org.uk and if you want the full collective functionality you'll need to patch openmp with the patch from http://padb.pittman.org.uk/extensions.html Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Hang in collectives involving shared memory
On Wed, 10 Jun 2009, Ralph Castain wrote: Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps you might take a glance at that and offer some thoughts? https://svn.open-mpi.org/trac/ompi/ticket/1944 I wasn't able to reproduce this. I have run with the following setup: - OS is Scientific Linux 5.1 with a custom compiled kernel based on 2.6.22.19, but (due to circumstances that I can't control): checking if MCA component maffinity:libnuma can compile... no - Intel compiler 10.1 - OpenMPI 1.3.2 - nodes have 2 CPUs of type E5440 (quad core), 16GB RAM and a ConnectX IB DDR I've used the platform file that you have provided, but took out the references to PanFS and fixed the paths. I've also used the MCA file that you have provided. I have run with nodes=1:ppn=8 and nodes=2:ppn=8 and the test finished successfully with m=50 several times. This, together with the earlier post also describing a negative result, points to a problem related to your particular setup... -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [OMPI devel] Hang in collectives involving shared memory
I appreciate the input and have captured it in the ticket. Since this appears to be a NUMA-related issue, the lack of NUMA support in your setup makes the test difficult to interpret. I agree, though, that this is likely something peculiar to our particular setup. Of primary concern is that it might be related to the relatively old kernel (2.6.18) on these machines. There has been a lot of change since that kernel was released, and some of those changes may be relevant to this problem. Unfortunately, upgrading the kernel will take persuasive argument. We are going to try and run the reproducers on machines with more modern kernels to see if we get different behavior. Please feel free to follow this further on the ticket. Thanks again! Ralph On Wed, Jun 10, 2009 at 11:29 AM, Bogdan Costescu < bogdan.coste...@iwr.uni-heidelberg.de> wrote: > On Wed, 10 Jun 2009, Ralph Castain wrote: > > Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps >> you might take a glance at that and offer some thoughts? >> >> https://svn.open-mpi.org/trac/ompi/ticket/1944 >> > > I wasn't able to reproduce this. I have run with the following setup: > - OS is Scientific Linux 5.1 with a custom compiled kernel based on > 2.6.22.19, but (due to circumstances that I can't control): > > checking if MCA component maffinity:libnuma can compile... no > > - Intel compiler 10.1 > - OpenMPI 1.3.2 > - nodes have 2 CPUs of type E5440 (quad core), 16GB RAM and a ConnectX IB > DDR > > I've used the platform file that you have provided, but took out the > references to PanFS and fixed the paths. I've also used the MCA file that > you have provided. > > I have run with nodes=1:ppn=8 and nodes=2:ppn=8 and the test finished > successfully with m=50 several times. This, together with the earlier post > also describing a negative result, points to a problem related to your > particular setup... > > -- > Bogdan Costescu > > IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany > Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 > E-mail: bogdan.coste...@iwr.uni-heidelberg.de > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Hang in collectives involving shared memory
Much appreciated! Per some of my other comments on this thread and on the referenced ticket, can you tell me what kernel you have on that machine? I assume you have NUMA support enabled, given that chipset? Thanks! Ralph On Wed, Jun 10, 2009 at 10:29 AM, Sylvain Jeaugey wrote: > Hum, very glad that padb works with Open MPI, I couldn't live without it. > In my opinion, the best debug tool for parallel applications, and more > importantly, the only one that scales. > > About the issue, I couldn't reproduce it on my platform (tried 2 nodes with > 2 to 8 processes each, nodes are twin 2.93 GHz Nehalem, IB is Mellanox QDR). > > So my feeling about that is that is may be very hardware related. > Especially if you use the hierarch component, some transactions will be done > through RDMA on one side and read directly through shared memory on the > other side, which can, depending on the hardware, produce very different > timings and bugs. Did you try with a different collective component (i.e. > not hierarch) ? Or with another interconnect ? [Yes, of course, if it is a > race condition, we might well avoid the bug because timings will be > different, but that's still information] > > Perhaps all what I'm saying makes no sense or you already thought about > this, anyway, if you want me to try different things, just let me know. > > Sylvain > > > On Wed, 10 Jun 2009, Ralph Castain wrote: > > Hi Ashley >> >> Thanks! I would definitely be interested and will look at the tool. >> Meantime, I have filed a bunch of data on this in >> ticket #1944, so perhaps you might take a glance at that and offer some >> thoughts? >> >> https://svn.open-mpi.org/trac/ompi/ticket/1944 >> >> Will be back after I look at the tool. >> >> Thanks again >> Ralph >> >> >> On Wed, Jun 10, 2009 at 8:51 AM, Ashley Pittman >> wrote: >> >> Ralph, >> >> If I may say this is exactly the type of problem the tool I have been >> working on recently aims to help with and I'd be happy to help you >> through it. >> >> Firstly I'd say of the three collectives you mention, MPI_Allgather, >> MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a >> many-to-one >> and the last a many-to-one communication pattern. The scenario of a >> root process falling behind and getting swamped in comms is a >> plausible >> one for MPI_Reduce only but doesn't hold water with the other two. >> You >> also don't mention if the loop is over a single collective or if you >> have loop calling a number of different collectives each iteration. >> >> padb, the tool I've been working on has the ability to look at >> parallel >> jobs and report on the state of collective comms and should help >> narrow >> you down on erroneous processes and those simply blocked waiting for >> comms. I'd recommend using it to look at maybe four or five >> instances >> where the application has hung and look for any common features >> between >> them. >> >> Let me know if you are willing to try this route and I'll talk, the >> code >> is downloadable from http://padb.pittman.org.uk and if you want the >> full >> collective functionality you'll need to patch openmp with the patch >> from >> http://padb.pittman.org.uk/extensions.html >> >> Ashley, >> >> -- >> >> Ashley Pittman, Bath, UK. >> >> Padb - A parallel job inspection tool for cluster computing >> http://padb.pittman.org.uk >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
[OMPI devel] padb and orte
All, As mentioned in another thread I've recently ported padb, a command line job inspection tool (kinda like a parallel debugger) to orte and OpenMPI. Padb is an existing stable product which has worked for a number of years on Slurm and RMS, orte support is new and not widely tested yet although it works for all cases I've tried. For those who haven't used it padb is a open source command-line tool which among other things can collect stack traces, display MPI message queues and present a lot of process information about parallel jobs to the user is an accessible way. Ideally padb will find it's place within the day to day workings of OpenMPI developers and become a recommended tool for users as well, it also has a mode where it can be launched automatically to gather information about job hangs without human intervention, I'd be willing to work with the OpenMPI team to integrate this into the MTT code if desired. I would encourage you to download it and try it out, if it works for you and you like it that's great, if not let me know and I'll do what I can to fix it. There is a website and public mailing lists for padb issues or I am happy to discuss orte specific issues on this list. The website is at http://padb.pittman.org.uk and I welcome any feedback, either here, off-list or on either of the padb mailing lists. Yours, Ashley Pittman, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] Hang in collectives involving shared memory
On Wed, 10 Jun 2009, Ralph Castain wrote: I appreciate the input and have captured it in the ticket. Since this appears to be a NUMA-related issue, the lack of NUMA support in your setup makes the test difficult to interpret. Based on this reasoning, disabling libnuma support in your OpenMPI build should also solve the problem, or do I interpret things the wrong way ? -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [OMPI devel] Hang in collectives involving shared memory
Well, it would - except then -all- the procs would run real slow! :-) Still, might be a reasonable diagnostic step to try...will give it a shot. On Wed, Jun 10, 2009 at 1:12 PM, Bogdan Costescu < bogdan.coste...@iwr.uni-heidelberg.de> wrote: > On Wed, 10 Jun 2009, Ralph Castain wrote: > > I appreciate the input and have captured it in the ticket. Since this >> appears to be a NUMA-related issue, the lack of NUMA support in your setup >> makes the test difficult to interpret. >> > > Based on this reasoning, disabling libnuma support in your OpenMPI build > should also solve the problem, or do I interpret things the wrong way ? > > > -- > Bogdan Costescu > > IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany > Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 > E-mail: bogdan.coste...@iwr.uni-heidelberg.de > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >