[hwloc-devel] Create success (hwloc r1.0a1r1365)
Creating nightly hwloc snapshot SVN tarball was a success. Snapshot: hwloc 1.0a1r1365 Start time: Thu Nov 19 21:01:04 EST 2009 End time: Thu Nov 19 21:02:58 EST 2009 Your friendly daemon, Cyrador
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
Hi Sylvain I've spent several hours trying to replicate the behavior you described on clusters up to a couple of hundred nodes (all running slurm), without success. I'm becoming increasingly convinced that this is a configuration issue as opposed to a code issue. I have enclosed the platform file I use below. Could you compare it to your configuration? I'm wondering if there is something critical about the config that may be causing the problem (perhaps we have a problem in our default configuration). Also, is there anything else you can tell us about your configuration? How many ppn triggers it, or do you always get the behavior every time you launch over a certain number of nodes? Meantime, I will look into this further. I am going to introduce a "slow down" param that will force the situation you encountered - i.e., will ensure that the relay is still being sent when the daemon receives the first collective input. We can then use that to try and force replication of the behavior you are encountering. Thanks Ralph enable_dlopen=no enable_pty_support=no with_blcr=no with_openib=yes with_memory_manager=no enable_mem_debug=yes enable_mem_profile=no enable_debug_symbols=yes enable_binaries=yes with_devel_headers=yes enable_heterogeneous=no enable_picky=yes enable_debug=yes enable_shared=yes enable_static=yes with_slurm=yes enable_contrib_no_build=libnbc,vt enable_visibility=yes enable_memchecker=no enable_ipv6=no enable_mpi_f77=no enable_mpi_f90=no enable_mpi_cxx=no enable_mpi_cxx_seek=no enable_mca_no_build=pml-dr,pml-crcp2,crcp enable_io_romio=no On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote: > > On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote: > >> Thank you Ralph for this precious help. >> >> I setup a quick-and-dirty patch basically postponing process_msg (hence >> daemon_collective) until the launch is done. In process_msg, I therefore >> requeue a process_msg handler and return. > > That is basically the idea I proposed, just done in a slightly different place > >> >> In this "all-must-be-non-blocking-and-done-through-opal_progress" algorithm, >> I don't think that blocking calls like the one in daemon_collective should >> be allowed. This also applies to the blocking one in send_relay. [Well, >> actually, one is okay, 2 may lead to interlocking.] > > Well, that would be problematic - you will find "progressed_wait" used > repeatedly in the code. Removing them all would take a -lot- of effort and a > major rewrite. I'm not yet convinced it is required. There may be something > strange in how you are setup, or your cluster - like I said, this is the > first report of a problem we have had, and people with much bigger slurm > clusters have been running this code every day for over a year. > >> >> If you have time doing a nicer patch, it would be great and I would be happy >> to test it. Otherwise, I will try to implement your idea properly next week >> (with my limited knowledge of orted). > > Either way is fine - I'll see if I can get to it. > > Thanks > Ralph > >> >> For the record, here is the patch I'm currently testing at large scale : >> >> diff -r ec68298b3169 -r b622b9e8f1ac >> orte/mca/grpcomm/bad/grpcomm_bad_module.c >> --- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 >> +0100 >> +++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 >> +0100 >> @@ -687,14 +687,6 @@ >>opal_list_append(_local_jobdata, >super); >>} >> >> -/* it may be possible to get here prior to having actually finished >> processing our >> - * local launch msg due to the race condition between different nodes >> and when >> - * they start their individual procs. Hence, we have to first ensure >> that we >> - * -have- finished processing the launch msg, or else we won't know >> whether >> - * or not to wait before sending this on >> - */ >> -ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1); >> - >>/* unpack the collective type */ >>n = 1; >>if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, >collective_type, >> , ORTE_GRPCOMM_COLL_T))) { >> @@ -894,6 +886,28 @@ >> >>proc = >sender; >>buf = mev->buffer; >> + >> +jobdat = NULL; >> +for (item = opal_list_get_first(_local_jobdata); >> + item != opal_list_get_end(_local_jobdata); >> + item = opal_list_get_next(item)) { >> +jobdat = (orte_odls_job_t*)item; >> + >> +/* is this the specified job? */ >> +if (jobdat->jobid == proc->jobid) { >> +break; >> +} >> +} >> +if (NULL == jobdat || jobdat->launch_msg_processed != 1) { >> +/* it may be possible to get here prior to having actually finished >> processing our >> + * local launch msg due to the race condition between different >> nodes and when >> + * they start their individual procs. Hence, we have to first >> ensure that we >> + * -have- finished
Re: [hwloc-devel] Crash with ignoring HWLOC_OBJ_NODE in 0.9.2
Michael Raymond, le Thu 19 Nov 2009 14:33:49 -0600, a écrit : > --- hwloc-0.9.2/src/topology-linux.c 2009-11-03 16:40:31.0 -0600 > +++ hwloc-new//src/topology-linux.c 2009-11-19 14:20:43.630035434 -0600 > @@ -536,6 +536,10 @@ >struct dirent *dirent; >hwloc_obj_t node; > > + if (topology->ignored_types[HWLOC_OBJ_NODE] == > HWLOC_IGNORE_TYPE_ALWAYS) { > + return; > + } > + >dir = hwloc_opendir(path, topology->backend_params.sysfs.root_fd); >if (dir) > { Mmm, indeed. And it will happen on other OSes where we get the distances too, e.g. Solaris. Does the attached more generic patch properly fixes it too? > Also I'm concerned about the value of CPUSET_MASK_LEN in > hwloc_admin_disable_set_from_cpuset(). It's only 64 characters but our > Linux boxes can have to 2048 processors. I don't think there's any harm > in bumping that up a little. Mmm, even better, we can avoid using a constant size completely, I've commited a fix. Samuel Index: src/topology.c === --- src/topology.c (révision 1364) +++ src/topology.c (copie de travail) @@ -298,6 +298,9 @@ if (getenv("HWLOC_IGNORE_DISTANCES")) return; + if (topology->ignored_types[HWLOC_OBJ_NODE] == HWLOC_IGNORE_TYPE_ALWAYS) +return; + #ifdef HWLOC_DEBUG hwloc_debug("node distance matrix:\n"); hwloc_debug(" ");
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
Thank you Ralph for this precious help. I setup a quick-and-dirty patch basically postponing process_msg (hence daemon_collective) until the launch is done. In process_msg, I therefore requeue a process_msg handler and return. In this "all-must-be-non-blocking-and-done-through-opal_progress" algorithm, I don't think that blocking calls like the one in daemon_collective should be allowed. This also applies to the blocking one in send_relay. [Well, actually, one is okay, 2 may lead to interlocking.] If you have time doing a nicer patch, it would be great and I would be happy to test it. Otherwise, I will try to implement your idea properly next week (with my limited knowledge of orted). For the record, here is the patch I'm currently testing at large scale : diff -r ec68298b3169 -r b622b9e8f1ac orte/mca/grpcomm/bad/grpcomm_bad_module.c --- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 +0100 +++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 +0100 @@ -687,14 +687,6 @@ opal_list_append(_local_jobdata, >super); } -/* it may be possible to get here prior to having actually finished processing our - * local launch msg due to the race condition between different nodes and when - * they start their individual procs. Hence, we have to first ensure that we - * -have- finished processing the launch msg, or else we won't know whether - * or not to wait before sending this on - */ -ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1); - /* unpack the collective type */ n = 1; if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, >collective_type, , ORTE_GRPCOMM_COLL_T))) { @@ -894,6 +886,28 @@ proc = >sender; buf = mev->buffer; + +jobdat = NULL; +for (item = opal_list_get_first(_local_jobdata); + item != opal_list_get_end(_local_jobdata); + item = opal_list_get_next(item)) { +jobdat = (orte_odls_job_t*)item; + +/* is this the specified job? */ +if (jobdat->jobid == proc->jobid) { +break; +} +} +if (NULL == jobdat || jobdat->launch_msg_processed != 1) { +/* it may be possible to get here prior to having actually finished processing our + * local launch msg due to the race condition between different nodes and when + * they start their individual procs. Hence, we have to first ensure that we + * -have- finished processing the launch msg. Requeue this event until it is done. + */ +int tag = >tag; +ORTE_MESSAGE_EVENT(proc, buf, tag, process_msg); +return; +} /* is the sender a local proc, or a daemon relaying the collective? */ if (ORTE_PROC_MY_NAME->jobid == proc->jobid) { Sylvain On Thu, 19 Nov 2009, Ralph Castain wrote: Very strange. As I said, we routinely launch jobs spanning several hundred nodes without problem. You can see the platform files for that setup in contrib/platform/lanl/tlcc That said, it is always possible you are hitting some kind of race condition we don't hit. In looking at the code, one possibility would be to make all the communications flow through the daemon cmd processor in orte/orted_comm.c. This is the way it used to work until I reorganized the code a year ago for other reasons that never materialized. Unfortunately, the daemon collective has to wait until the local launch cmd has been completely processed so it can know whether or not to wait for contributions from local procs before sending along the collective message, so this kinda limits our options. About the only other thing you could do would be to not send the relay at all until -after- processing the local launch cmd. You can then remove the "wait" in the daemon collective as you will know how many local procs are involved, if any. I used to do it that way and it guarantees it will work. The negative is that we lose some launch speed as the next nodes in the tree don't get the launch message until this node finishes launching all its procs. The way around that, of course, would be to: 1. process the launch message, thus extracting the number of any local procs and setting up all data structures...but do -not- launch the procs at this time (as this is what takes all the time) 2. send the relay - the daemon collective can now proceed without a "wait" in it 3. now launch the local procs It would be a fairly simple reorganization of the code in the orte/mca/odls area. I can do it this weekend if you like, or you can do it - either way is fine, but if you do it, please contribute it back to the trunk. Ralph On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote: I would say I use the default settings, i.e. I don't set anything "special" at configure. I'm launching my processes with SLURM (salloc + mpirun). Sylvain On Wed, 18 Nov 2009, Ralph Castain wrote: How did you configure OMPI? What launch mechanism
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
Very strange. As I said, we routinely launch jobs spanning several hundred nodes without problem. You can see the platform files for that setup in contrib/platform/lanl/tlcc That said, it is always possible you are hitting some kind of race condition we don't hit. In looking at the code, one possibility would be to make all the communications flow through the daemon cmd processor in orte/orted_comm.c. This is the way it used to work until I reorganized the code a year ago for other reasons that never materialized. Unfortunately, the daemon collective has to wait until the local launch cmd has been completely processed so it can know whether or not to wait for contributions from local procs before sending along the collective message, so this kinda limits our options. About the only other thing you could do would be to not send the relay at all until -after- processing the local launch cmd. You can then remove the "wait" in the daemon collective as you will know how many local procs are involved, if any. I used to do it that way and it guarantees it will work. The negative is that we lose some launch speed as the next nodes in the tree don't get the launch message until this node finishes launching all its procs. The way around that, of course, would be to: 1. process the launch message, thus extracting the number of any local procs and setting up all data structures...but do -not- launch the procs at this time (as this is what takes all the time) 2. send the relay - the daemon collective can now proceed without a "wait" in it 3. now launch the local procs It would be a fairly simple reorganization of the code in the orte/mca/odls area. I can do it this weekend if you like, or you can do it - either way is fine, but if you do it, please contribute it back to the trunk. Ralph On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote: > I would say I use the default settings, i.e. I don't set anything "special" > at configure. > > I'm launching my processes with SLURM (salloc + mpirun). > > Sylvain > > On Wed, 18 Nov 2009, Ralph Castain wrote: > >> How did you configure OMPI? >> >> What launch mechanism are you using - ssh? >> >> On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote: >> >>> I don't think so, and I'm not doing it explicitely at least. How do I know ? >>> >>> Sylvain >>> >>> On Tue, 17 Nov 2009, Ralph Castain wrote: >>> We routinely launch across thousands of nodes without a problem...I have never seen it stick in this fashion. Did you build and/or are using ORTE threaded by any chance? If so, that definitely won't work. On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote: > Hi all, > > We are currently experiencing problems at launch on the 1.5 branch on > relatively large number of nodes (at least 80). Some processes are not > spawned and orted processes are deadlocked. > > When MPI processes are calling MPI_Init before send_relay is complete, > the send_relay function and the daemon_collective function are doing a > nice interlock : > > Here is the scenario : >> send_relay > performs the send tree : >> orte_rml_oob_send_buffer > > orte_rml_oob_send > > opal_wait_condition > Waiting on completion from send thus calling opal_progress() >> opal_progress() > But since a collective request arrived from the network, entered : > > daemon_collective > However, daemon_collective is waiting for the job to be initialized (wait > on jobdat->launch_msg_processed) before continuing, thus calling : >> opal_progress() > > At this time, the send may complete, but since we will never go back to > orte_rml_oob_send, we will never perform the launch (setting > jobdat->launch_msg_processed to 1). > > I may try to solve the bug (this is quite a top priority problem for me), > but maybe people who are more familiar with orted than I am may propose a > nice and clean solution ... > > For those who like real (and complete) gdb stacks, here they are : > #0 0x003b7fed4f38 in poll () from /lib64/libc.so.6 > #1 0x7fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0, > tv=0x7fff0d977880) at poll.c:167 > #2 0x7fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) > at event.c:823 > #3 0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746 > #4 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 > #5 0x7fd0dd340a02 in daemon_collective (sender=0x97af50, > data=0x97b010) at grpcomm_bad_module.c:696 > #6 0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, > data=0x97af20) at grpcomm_bad_module.c:901 > #7 0x7fd0de5d5334 in event_process_active (base=0x930230) at > event.c:667 > #8 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) > at
Re: [OMPI devel] Finalize without Detach???
So is there any reason OMPI should not auto-detach buffers at Finalize? I understand technically we don't have to but there are false performance degradations incurred by us not detaching thus making OMPI look significantly slower compared to other MPIs for no real reason. So unless there is some really good reason we shouldn't detach I would think detaching would make sense. --td George Bosilca wrote: There is no such statement in the MPI Standard. In fact one of the example use exactly this: automatic detach at MPI_Finalize (example on page 310 on MPI 2.2). However, as the standard doesn't enforce a specific behavior, each MPI implementation can interpret/implement it differently. Therefore, by expecting the buffer detach at Finalize the user open itself to "inconsistent" behaviors depending on the MPI library used. On the opposite, i.e. by explicitly calling detach, the behavior is well defined in all cases. george. On Nov 18, 2009, at 15:41 , Eugene Loh wrote: George Bosilca wrote: The proper practice based on the MPI Standard will be to call the detach function before finalize. I don't find this described anywhere in the standard. To what chapter/verse should I point a user to convince them that detach before finalize is the proper thing to do? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm
I would say I use the default settings, i.e. I don't set anything "special" at configure. I'm launching my processes with SLURM (salloc + mpirun). Sylvain On Wed, 18 Nov 2009, Ralph Castain wrote: How did you configure OMPI? What launch mechanism are you using - ssh? On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote: I don't think so, and I'm not doing it explicitely at least. How do I know ? Sylvain On Tue, 17 Nov 2009, Ralph Castain wrote: We routinely launch across thousands of nodes without a problem...I have never seen it stick in this fashion. Did you build and/or are using ORTE threaded by any chance? If so, that definitely won't work. On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote: Hi all, We are currently experiencing problems at launch on the 1.5 branch on relatively large number of nodes (at least 80). Some processes are not spawned and orted processes are deadlocked. When MPI processes are calling MPI_Init before send_relay is complete, the send_relay function and the daemon_collective function are doing a nice interlock : Here is the scenario : send_relay performs the send tree : orte_rml_oob_send_buffer > orte_rml_oob_send > opal_wait_condition Waiting on completion from send thus calling opal_progress() > opal_progress() But since a collective request arrived from the network, entered : > daemon_collective However, daemon_collective is waiting for the job to be initialized (wait on jobdat->launch_msg_processed) before continuing, thus calling : > opal_progress() At this time, the send may complete, but since we will never go back to orte_rml_oob_send, we will never perform the launch (setting jobdat->launch_msg_processed to 1). I may try to solve the bug (this is quite a top priority problem for me), but maybe people who are more familiar with orted than I am may propose a nice and clean solution ... For those who like real (and complete) gdb stacks, here they are : #0 0x003b7fed4f38 in poll () from /lib64/libc.so.6 #1 0x7fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0, tv=0x7fff0d977880) at poll.c:167 #2 0x7fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) at event.c:823 #3 0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746 #4 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 #5 0x7fd0dd340a02 in daemon_collective (sender=0x97af50, data=0x97b010) at grpcomm_bad_module.c:696 #6 0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97af20) at grpcomm_bad_module.c:901 #7 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667 #8 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at event.c:839 #9 0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746 #10 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 #11 0x7fd0dd340a02 in daemon_collective (sender=0x979700, data=0x9676e0) at grpcomm_bad_module.c:696 #12 0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x9796d0) at grpcomm_bad_module.c:901 #13 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667 #14 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at event.c:839 #15 0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746 #16 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 #17 0x7fd0dd340a02 in daemon_collective (sender=0x97b420, data=0x97b4e0) at grpcomm_bad_module.c:696 #18 0x7fd0dd341809 in process_msg (fd=-1, opal_event=1, data=0x97b3f0) at grpcomm_bad_module.c:901 #19 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667 #20 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) at event.c:839 #21 0x7fd0de5d556b in opal_event_loop (flags=1) at event.c:746 #22 0x7fd0de5aeb6d in opal_progress () at runtime/opal_progress.c:189 #23 0x7fd0dd969a8a in opal_condition_wait (c=0x97b210, m=0x97b1a8) at ../../../../opal/threads/condition.h:99 #24 0x7fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0, iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153 #25 0x7fd0dd96ac4d in orte_rml_oob_send_buffer (peer=0x7fff0d9785a0, buffer=0x7fff0d9786b0, tag=1, flags=0) at rml_oob_send.c:270 #26 0x7fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at orted/orted_comm.c:127 #27 0x7fd0de86f6de in orte_daemon_cmd_processor (fd=-1, opal_event=1, data=0x965fc0) at orted/orted_comm.c:308 #28 0x7fd0de5d5334 in event_process_active (base=0x930230) at event.c:667 #29 0x7fd0de5d597a in opal_event_base_loop (base=0x930230, flags=0) at event.c:839 #30 0x7fd0de5d556b in opal_event_loop (flags=0) at event.c:746 #31 0x7fd0de5d5418 in opal_event_dispatch () at event.c:682 #32 0x7fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8) at orted/orted_main.c:769 #33 0x004008e2 in main (argc=19, argv=0x7fff0d979ca8) at orted.c:62 Thanks in advance,