Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
Hi Ralph, Thanks for your efforts. I will look at our configuration and see how it may differ from ours. Here is a patch which helps reproducing the bug even with a small number of nodes. diff -r b622b9e8f1ac orte/orted/orted_comm.c --- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009 +0100 +++ b/orte/orted/orted_comm.c Fri Nov 20 14:47:39 2009 +0100 @@ -126,6 +126,13 @@ ORTE_ERROR_LOG(ret); goto CLEANUP; } +{ /* Add delay to reproduce bug */ +char * str = getenv("ORTE_RELAY_DELAY"); +int sec = str ? atoi(str) : 0; +if (sec) { +sleep(sec); +} +} } CLEANUP: Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug. During our experiments, the bug disappeared when we added a delay before calling MPI_Init. So, configurations where processes are launched slowly or take some time before MPI_Init should be immune to this bug. We usually reproduce the bug with one ppn (faster to spawn). Sylvain On Thu, 19 Nov 2009, Ralph Castain wrote: Hi Sylvain I've spent several hours trying to replicate the behavior you described on clusters up to a couple of hundred nodes (all running slurm), without success. I'm becoming increasingly convinced that this is a configuration issue as opposed to a code issue. I have enclosed the platform file I use below. Could you compare it to your configuration? I'm wondering if there is something critical about the config that may be causing the problem (perhaps we have a problem in our default configuration). Also, is there anything else you can tell us about your configuration? How many ppn triggers it, or do you always get the behavior every time you launch over a certain number of nodes? Meantime, I will look into this further. I am going to introduce a "slow down" param that will force the situation you encountered - i.e., will ensure that the relay is still being sent when the daemon receives the first collective input. We can then use that to try and force replication of the behavior you are encountering. Thanks Ralph enable_dlopen=no enable_pty_support=no with_blcr=no with_openib=yes with_memory_manager=no enable_mem_debug=yes enable_mem_profile=no enable_debug_symbols=yes enable_binaries=yes with_devel_headers=yes enable_heterogeneous=no enable_picky=yes enable_debug=yes enable_shared=yes enable_static=yes with_slurm=yes enable_contrib_no_build=libnbc,vt enable_visibility=yes enable_memchecker=no enable_ipv6=no enable_mpi_f77=no enable_mpi_f90=no enable_mpi_cxx=no enable_mpi_cxx_seek=no enable_mca_no_build=pml-dr,pml-crcp2,crcp enable_io_romio=no On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote: On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote: Thank you Ralph for this precious help. I setup a quick-and-dirty patch basically postponing process_msg (hence daemon_collective) until the launch is done. In process_msg, I therefore requeue a process_msg handler and return. That is basically the idea I proposed, just done in a slightly different place In this "all-must-be-non-blocking-and-done-through-opal_progress" algorithm, I don't think that blocking calls like the one in daemon_collective should be allowed. This also applies to the blocking one in send_relay. [Well, actually, one is okay, 2 may lead to interlocking.] Well, that would be problematic - you will find "progressed_wait" used repeatedly in the code. Removing them all would take a -lot- of effort and a major rewrite. I'm not yet convinced it is required. There may be something strange in how you are setup, or your cluster - like I said, this is the first report of a problem we have had, and people with much bigger slurm clusters have been running this code every day for over a year. If you have time doing a nicer patch, it would be great and I would be happy to test it. Otherwise, I will try to implement your idea properly next week (with my limited knowledge of orted). Either way is fine - I'll see if I can get to it. Thanks Ralph For the record, here is the patch I'm currently testing at large scale : diff -r ec68298b3169 -r b622b9e8f1ac orte/mca/grpcomm/bad/grpcomm_bad_module.c --- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 +0100 +++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 +0100 @@ -687,14 +687,6 @@ opal_list_append(&orte_local_jobdata, &jobdat->super); } -/* it may be possible to get here prior to having actually finished processing our - * local launch msg due to the race condition between different nodes and when - * they start their individual procs. Hence, we have to first ensure that we - * -have- finished processing the launch msg, or else we won't know whether - * or not to wait before sending this on - */ -ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1); - /* unpack the collective type */ n
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
Thanks! I'll give it a try. My tests are all conducted with fast launches (just running slurm on large clusters) and using an mpi hello world that calls mpi_init at first instruction. I'll see if adding the delay causes it to misbehave. On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote: > Hi Ralph, > > Thanks for your efforts. I will look at our configuration and see how it may > differ from ours. > > Here is a patch which helps reproducing the bug even with a small number of > nodes. > > diff -r b622b9e8f1ac orte/orted/orted_comm.c > --- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009 +0100 > +++ b/orte/orted/orted_comm.c Fri Nov 20 14:47:39 2009 +0100 > @@ -126,6 +126,13 @@ > ORTE_ERROR_LOG(ret); > goto CLEANUP; > } > +{ /* Add delay to reproduce bug */ > +char * str = getenv("ORTE_RELAY_DELAY"); > +int sec = str ? atoi(str) : 0; > +if (sec) { > +sleep(sec); > +} > +} > } > > CLEANUP: > > Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug. > > During our experiments, the bug disappeared when we added a delay before > calling MPI_Init. So, configurations where processes are launched slowly or > take some time before MPI_Init should be immune to this bug. > > We usually reproduce the bug with one ppn (faster to spawn). > > Sylvain > > On Thu, 19 Nov 2009, Ralph Castain wrote: > >> Hi Sylvain >> >> I've spent several hours trying to replicate the behavior you described on >> clusters up to a couple of hundred nodes (all running slurm), without >> success. I'm becoming increasingly convinced that this is a configuration >> issue as opposed to a code issue. >> >> I have enclosed the platform file I use below. Could you compare it to your >> configuration? I'm wondering if there is something critical about the config >> that may be causing the problem (perhaps we have a problem in our default >> configuration). >> >> Also, is there anything else you can tell us about your configuration? How >> many ppn triggers it, or do you always get the behavior every time you >> launch over a certain number of nodes? >> >> Meantime, I will look into this further. I am going to introduce a "slow >> down" param that will force the situation you encountered - i.e., will >> ensure that the relay is still being sent when the daemon receives the first >> collective input. We can then use that to try and force replication of the >> behavior you are encountering. >> >> Thanks >> Ralph >> >> enable_dlopen=no >> enable_pty_support=no >> with_blcr=no >> with_openib=yes >> with_memory_manager=no >> enable_mem_debug=yes >> enable_mem_profile=no >> enable_debug_symbols=yes >> enable_binaries=yes >> with_devel_headers=yes >> enable_heterogeneous=no >> enable_picky=yes >> enable_debug=yes >> enable_shared=yes >> enable_static=yes >> with_slurm=yes >> enable_contrib_no_build=libnbc,vt >> enable_visibility=yes >> enable_memchecker=no >> enable_ipv6=no >> enable_mpi_f77=no >> enable_mpi_f90=no >> enable_mpi_cxx=no >> enable_mpi_cxx_seek=no >> enable_mca_no_build=pml-dr,pml-crcp2,crcp >> enable_io_romio=no >> >> On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote: >> >>> >>> On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote: >>> Thank you Ralph for this precious help. I setup a quick-and-dirty patch basically postponing process_msg (hence daemon_collective) until the launch is done. In process_msg, I therefore requeue a process_msg handler and return. >>> >>> That is basically the idea I proposed, just done in a slightly different >>> place >>> In this "all-must-be-non-blocking-and-done-through-opal_progress" algorithm, I don't think that blocking calls like the one in daemon_collective should be allowed. This also applies to the blocking one in send_relay. [Well, actually, one is okay, 2 may lead to interlocking.] >>> >>> Well, that would be problematic - you will find "progressed_wait" used >>> repeatedly in the code. Removing them all would take a -lot- of effort and >>> a major rewrite. I'm not yet convinced it is required. There may be >>> something strange in how you are setup, or your cluster - like I said, this >>> is the first report of a problem we have had, and people with much bigger >>> slurm clusters have been running this code every day for over a year. >>> If you have time doing a nicer patch, it would be great and I would be happy to test it. Otherwise, I will try to implement your idea properly next week (with my limited knowledge of orted). >>> >>> Either way is fine - I'll see if I can get to it. >>> >>> Thanks >>> Ralph >>> For the record, here is the patch I'm currently testing at large scale : diff -r ec68298b3169 -r b622b9e8f1ac orte/mca/grpcomm/bad/grpcomm_bad_module.c --- a/orte/mca/grpcomm/bad/grpcomm_bad_modu
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
BTW: does this reproduce on the trunk and/or 1.3.4 as well? I'm wondering because we know the 1.5 branch is skewed relative to the trunk. Could well be a bug sitting over there. On Nov 20, 2009, at 7:06 AM, Ralph Castain wrote: > Thanks! I'll give it a try. > > My tests are all conducted with fast launches (just running slurm on large > clusters) and using an mpi hello world that calls mpi_init at first > instruction. I'll see if adding the delay causes it to misbehave. > > > On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote: > >> Hi Ralph, >> >> Thanks for your efforts. I will look at our configuration and see how it may >> differ from ours. >> >> Here is a patch which helps reproducing the bug even with a small number of >> nodes. >> >> diff -r b622b9e8f1ac orte/orted/orted_comm.c >> --- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009 +0100 >> +++ b/orte/orted/orted_comm.c Fri Nov 20 14:47:39 2009 +0100 >> @@ -126,6 +126,13 @@ >>ORTE_ERROR_LOG(ret); >>goto CLEANUP; >>} >> +{ /* Add delay to reproduce bug */ >> +char * str = getenv("ORTE_RELAY_DELAY"); >> +int sec = str ? atoi(str) : 0; >> +if (sec) { >> +sleep(sec); >> +} >> +} >>} >> >> CLEANUP: >> >> Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug. >> >> During our experiments, the bug disappeared when we added a delay before >> calling MPI_Init. So, configurations where processes are launched slowly or >> take some time before MPI_Init should be immune to this bug. >> >> We usually reproduce the bug with one ppn (faster to spawn). >> >> Sylvain >> >> On Thu, 19 Nov 2009, Ralph Castain wrote: >> >>> Hi Sylvain >>> >>> I've spent several hours trying to replicate the behavior you described on >>> clusters up to a couple of hundred nodes (all running slurm), without >>> success. I'm becoming increasingly convinced that this is a configuration >>> issue as opposed to a code issue. >>> >>> I have enclosed the platform file I use below. Could you compare it to your >>> configuration? I'm wondering if there is something critical about the >>> config that may be causing the problem (perhaps we have a problem in our >>> default configuration). >>> >>> Also, is there anything else you can tell us about your configuration? How >>> many ppn triggers it, or do you always get the behavior every time you >>> launch over a certain number of nodes? >>> >>> Meantime, I will look into this further. I am going to introduce a "slow >>> down" param that will force the situation you encountered - i.e., will >>> ensure that the relay is still being sent when the daemon receives the >>> first collective input. We can then use that to try and force replication >>> of the behavior you are encountering. >>> >>> Thanks >>> Ralph >>> >>> enable_dlopen=no >>> enable_pty_support=no >>> with_blcr=no >>> with_openib=yes >>> with_memory_manager=no >>> enable_mem_debug=yes >>> enable_mem_profile=no >>> enable_debug_symbols=yes >>> enable_binaries=yes >>> with_devel_headers=yes >>> enable_heterogeneous=no >>> enable_picky=yes >>> enable_debug=yes >>> enable_shared=yes >>> enable_static=yes >>> with_slurm=yes >>> enable_contrib_no_build=libnbc,vt >>> enable_visibility=yes >>> enable_memchecker=no >>> enable_ipv6=no >>> enable_mpi_f77=no >>> enable_mpi_f90=no >>> enable_mpi_cxx=no >>> enable_mpi_cxx_seek=no >>> enable_mca_no_build=pml-dr,pml-crcp2,crcp >>> enable_io_romio=no >>> >>> On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote: >>> On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote: > Thank you Ralph for this precious help. > > I setup a quick-and-dirty patch basically postponing process_msg (hence > daemon_collective) until the launch is done. In process_msg, I therefore > requeue a process_msg handler and return. That is basically the idea I proposed, just done in a slightly different place > > In this "all-must-be-non-blocking-and-done-through-opal_progress" > algorithm, I don't think that blocking calls like the one in > daemon_collective should be allowed. This also applies to the blocking > one in send_relay. [Well, actually, one is okay, 2 may lead to > interlocking.] Well, that would be problematic - you will find "progressed_wait" used repeatedly in the code. Removing them all would take a -lot- of effort and a major rewrite. I'm not yet convinced it is required. There may be something strange in how you are setup, or your cluster - like I said, this is the first report of a problem we have had, and people with much bigger slurm clusters have been running this code every day for over a year. > > If you have time doing a nicer patch, it would be great and I would be > happy to test it. Otherwise, I will try to implement your idea pr
[OMPI devel] Fwd: Call for participation: MPI Forum User Survey
The MPI Forum announced at its SC09 BOF that they are soliciting community feedback to help guide the MPI-3 standards process. A survey is available online at the following URL: http://mpi-forum.questionpro.com/ Password: mpi3 In this survey, the MPI Forum is asking as many people as possible for feedback on the MPI-3 process -- what features to include, what features to not include, etc. We encourage you to forward this survey on to as many interested and relevant parties as possible. It will take approximately 10 minutes to complete the questionnaire. No question in the survey is mandatory; feel free to only answer the questions which are relevant to you and your applications. Your answers will help the MPI Forum guide its process to create a genuinely useful MPI-3 standard. This survey closes December 31, 2009. Your survey responses will be strictly confidential and data from this research will be reported only in the aggregate. Your information will be coded and will remain confidential. If you have questions at any time about the survey or the procedures, you may contact the MPI Forum via email to mpi-comme...@mpi-forum.org. Thank you very much for your time and support. -- Jeff Squyres jsquy...@cisco.com