Hi Sylvain
Well, I hate to tell you this, but I cannot reproduce the "bug" even with this
code in ORTE no matter what value of ORTE_RELAY_DELAY I use. The system runs
really slow as I increase the delay, but it completes the job just fine. I ran
jobs across 16 nodes on a slurm machine, 1-4 ppn, a "hello world" app that
calls MPI_Init immediately upon execution.
So I have to conclude this is a problem in your setup/config. Are you sure you
didn't --enable-progress-threads?? That is the only way I can recreate this
behavior.
I plan to modify the relay/message processing method anyway to clean it up. But
there doesn't appear to be anything wrong with the current code.
Ralph
On Nov 20, 2009, at 6:55 AM, Sylvain Jeaugey wrote:
> Hi Ralph,
>
> Thanks for your efforts. I will look at our configuration and see how it may
> differ from ours.
>
> Here is a patch which helps reproducing the bug even with a small number of
> nodes.
>
> diff -r b622b9e8f1ac orte/orted/orted_comm.c
> --- a/orte/orted/orted_comm.c Wed Nov 18 09:27:55 2009 +0100
> +++ b/orte/orted/orted_comm.c Fri Nov 20 14:47:39 2009 +0100
> @@ -126,6 +126,13 @@
> ORTE_ERROR_LOG(ret);
> goto CLEANUP;
> }
> + { /* Add delay to reproduce bug */
> + char * str = getenv("ORTE_RELAY_DELAY");
> + int sec = str ? atoi(str) : 0;
> + if (sec) {
> + sleep(sec);
> + }
> + }
> }
>
> CLEANUP:
>
> Just set ORTE_RELAY_DELAY to 1 (second) and you should reproduce the bug.
>
> During our experiments, the bug disappeared when we added a delay before
> calling MPI_Init. So, configurations where processes are launched slowly or
> take some time before MPI_Init should be immune to this bug.
>
> We usually reproduce the bug with one ppn (faster to spawn).
>
> Sylvain
>
> On Thu, 19 Nov 2009, Ralph Castain wrote:
>
>> Hi Sylvain
>>
>> I've spent several hours trying to replicate the behavior you described on
>> clusters up to a couple of hundred nodes (all running slurm), without
>> success. I'm becoming increasingly convinced that this is a configuration
>> issue as opposed to a code issue.
>>
>> I have enclosed the platform file I use below. Could you compare it to your
>> configuration? I'm wondering if there is something critical about the config
>> that may be causing the problem (perhaps we have a problem in our default
>> configuration).
>>
>> Also, is there anything else you can tell us about your configuration? How
>> many ppn triggers it, or do you always get the behavior every time you
>> launch over a certain number of nodes?
>>
>> Meantime, I will look into this further. I am going to introduce a "slow
>> down" param that will force the situation you encountered - i.e., will
>> ensure that the relay is still being sent when the daemon receives the first
>> collective input. We can then use that to try and force replication of the
>> behavior you are encountering.
>>
>> Thanks
>> Ralph
>>
>> enable_dlopen=no
>> enable_pty_support=no
>> with_blcr=no
>> with_openib=yes
>> with_memory_manager=no
>> enable_mem_debug=yes
>> enable_mem_profile=no
>> enable_debug_symbols=yes
>> enable_binaries=yes
>> with_devel_headers=yes
>> enable_heterogeneous=no
>> enable_picky=yes
>> enable_debug=yes
>> enable_shared=yes
>> enable_static=yes
>> with_slurm=yes
>> enable_contrib_no_build=libnbc,vt
>> enable_visibility=yes
>> enable_memchecker=no
>> enable_ipv6=no
>> enable_mpi_f77=no
>> enable_mpi_f90=no
>> enable_mpi_cxx=no
>> enable_mpi_cxx_seek=no
>> enable_mca_no_build=pml-dr,pml-crcp2,crcp
>> enable_io_romio=no
>>
>> On Nov 19, 2009, at 8:08 AM, Ralph Castain wrote:
>>
>>>
>>> On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote:
>>>
>>>> Thank you Ralph for this precious help.
>>>>
>>>> I setup a quick-and-dirty patch basically postponing process_msg (hence
>>>> daemon_collective) until the launch is done. In process_msg, I therefore
>>>> requeue a process_msg handler and return.
>>>
>>> That is basically the idea I proposed, just done in a slightly different
>>> place
>>>
>>>>
>>>> In this "all-must-be-non-blocking-and-done-through-opal_progress"
>>>> algorithm, I don't think that blocking calls like the one in
>>>> daemon_collective should be allowed. This also applies to the blocking one
>>>> in send_relay. [Well, actually, one is okay, 2 may lead to interlocking.]
>>>
>>> Well, that would be problematic - you will find "progressed_wait" used
>>> repeatedly in the code. Removing them all would take a -lot- of effort and
>>> a major rewrite. I'm not yet convinced it is required. There may be
>>> something strange in how you are setup, or your cluster - like I said, this
>>> is the first report of a problem we have had, and people with much bigger
>>> slurm clusters have been running this code every day for over a year.
>>>
>>>>
>>>> If you have time doing a nicer patch, it would be great and I would be
>>>> happy to test it. Otherwise, I will try to implement your idea properly
>>>> next week (with my limited knowledge of orted).
>>>
>>> Either way is fine - I'll see if I can get to it.
>>>
>>> Thanks
>>> Ralph
>>>
>>>>
>>>> For the record, here is the patch I'm currently testing at large scale :
>>>>
>>>> diff -r ec68298b3169 -r b622b9e8f1ac
>>>> orte/mca/grpcomm/bad/grpcomm_bad_module.c
>>>> --- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009
>>>> +0100
>>>> +++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009
>>>> +0100
>>>> @@ -687,14 +687,6 @@
>>>> opal_list_append(&orte_local_jobdata, &jobdat->super);
>>>> }
>>>>
>>>> - /* it may be possible to get here prior to having actually finished
>>>> processing our
>>>> - * local launch msg due to the race condition between different nodes
>>>> and when
>>>> - * they start their individual procs. Hence, we have to first ensure
>>>> that we
>>>> - * -have- finished processing the launch msg, or else we won't know
>>>> whether
>>>> - * or not to wait before sending this on
>>>> - */
>>>> - ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1);
>>>> -
>>>> /* unpack the collective type */
>>>> n = 1;
>>>> if (ORTE_SUCCESS != (rc = opal_dss.unpack(data,
>>>> &jobdat->collective_type, &n, ORTE_GRPCOMM_COLL_T))) {
>>>> @@ -894,6 +886,28 @@
>>>>
>>>> proc = &mev->sender;
>>>> buf = mev->buffer;
>>>> +
>>>> + jobdat = NULL;
>>>> + for (item = opal_list_get_first(&orte_local_jobdata);
>>>> + item != opal_list_get_end(&orte_local_jobdata);
>>>> + item = opal_list_get_next(item)) {
>>>> + jobdat = (orte_odls_job_t*)item;
>>>> +
>>>> + /* is this the specified job? */
>>>> + if (jobdat->jobid == proc->jobid) {
>>>> + break;
>>>> + }
>>>> + }
>>>> + if (NULL == jobdat || jobdat->launch_msg_processed != 1) {
>>>> + /* it may be possible to get here prior to having actually
>>>> finished processing our
>>>> + * local launch msg due to the race condition between different
>>>> nodes and when
>>>> + * they start their individual procs. Hence, we have to first
>>>> ensure that we
>>>> + * -have- finished processing the launch msg. Requeue this event
>>>> until it is done.
>>>> + */
>>>> + int tag = &mev->tag;
>>>> + ORTE_MESSAGE_EVENT(proc, buf, tag, process_msg);
>>>> + return;
>>>> + }
>>>>
>>>> /* is the sender a local proc, or a daemon relaying the collective? */
>>>> if (ORTE_PROC_MY_NAME->jobid == proc->jobid) {
>>>>
>>>> Sylvain
>>>>
>>>> On Thu, 19 Nov 2009, Ralph Castain wrote:
>>>>
>>>>> Very strange. As I said, we routinely launch jobs spanning several
>>>>> hundred nodes without problem. You can see the platform files for that
>>>>> setup in contrib/platform/lanl/tlcc
>>>>>
>>>>> That said, it is always possible you are hitting some kind of race
>>>>> condition we don't hit. In looking at the code, one possibility would be
>>>>> to make all the communications flow through the daemon cmd processor in
>>>>> orte/orted_comm.c. This is the way it used to work until I reorganized
>>>>> the code a year ago for other reasons that never materialized.
>>>>>
>>>>> Unfortunately, the daemon collective has to wait until the local launch
>>>>> cmd has been completely processed so it can know whether or not to wait
>>>>> for contributions from local procs before sending along the collective
>>>>> message, so this kinda limits our options.
>>>>>
>>>>> About the only other thing you could do would be to not send the relay at
>>>>> all until -after- processing the local launch cmd. You can then remove
>>>>> the "wait" in the daemon collective as you will know how many local procs
>>>>> are involved, if any.
>>>>>
>>>>> I used to do it that way and it guarantees it will work. The negative is
>>>>> that we lose some launch speed as the next nodes in the tree don't get
>>>>> the launch message until this node finishes launching all its procs.
>>>>>
>>>>> The way around that, of course, would be to:
>>>>>
>>>>> 1. process the launch message, thus extracting the number of any local
>>>>> procs and setting up all data structures...but do -not- launch the procs
>>>>> at this time (as this is what takes all the time)
>>>>>
>>>>> 2. send the relay - the daemon collective can now proceed without a
>>>>> "wait" in it
>>>>>
>>>>> 3. now launch the local procs
>>>>>
>>>>> It would be a fairly simple reorganization of the code in the
>>>>> orte/mca/odls area. I can do it this weekend if you like, or you can do
>>>>> it - either way is fine, but if you do it, please contribute it back to
>>>>> the trunk.
>>>>>
>>>>> Ralph
>>>>>
>>>>>
>>>>> On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote:
>>>>>
>>>>>> I would say I use the default settings, i.e. I don't set anything
>>>>>> "special" at configure.
>>>>>>
>>>>>> I'm launching my processes with SLURM (salloc + mpirun).
>>>>>>
>>>>>> Sylvain
>>>>>>
>>>>>> On Wed, 18 Nov 2009, Ralph Castain wrote:
>>>>>>
>>>>>>> How did you configure OMPI?
>>>>>>>
>>>>>>> What launch mechanism are you using - ssh?
>>>>>>>
>>>>>>> On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote:
>>>>>>>
>>>>>>>> I don't think so, and I'm not doing it explicitely at least. How do I
>>>>>>>> know ?
>>>>>>>>
>>>>>>>> Sylvain
>>>>>>>>
>>>>>>>> On Tue, 17 Nov 2009, Ralph Castain wrote:
>>>>>>>>
>>>>>>>>> We routinely launch across thousands of nodes without a problem...I
>>>>>>>>> have never seen it stick in this fashion.
>>>>>>>>>
>>>>>>>>> Did you build and/or are using ORTE threaded by any chance? If so,
>>>>>>>>> that definitely won't work.
>>>>>>>>>
>>>>>>>>> On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> We are currently experiencing problems at launch on the 1.5 branch
>>>>>>>>>> on relatively large number of nodes (at least 80). Some processes
>>>>>>>>>> are not spawned and orted processes are deadlocked.
>>>>>>>>>>
>>>>>>>>>> When MPI processes are calling MPI_Init before send_relay is
>>>>>>>>>> complete, the send_relay function and the daemon_collective function
>>>>>>>>>> are doing a nice interlock :
>>>>>>>>>>
>>>>>>>>>> Here is the scenario :
>>>>>>>>>>> send_relay
>>>>>>>>>> performs the send tree :
>>>>>>>>>>> orte_rml_oob_send_buffer
>>>>>>>>>>> orte_rml_oob_send
>>>>>>>>>>> opal_wait_condition
>>>>>>>>>> Waiting on completion from send thus calling opal_progress()
>>>>>>>>>>> opal_progress()
>>>>>>>>>> But since a collective request arrived from the network, entered :
>>>>>>>>>>> daemon_collective
>>>>>>>>>> However, daemon_collective is waiting for the job to be initialized
>>>>>>>>>> (wait on jobdat->launch_msg_processed) before continuing, thus
>>>>>>>>>> calling :
>>>>>>>>>>> opal_progress()
>>>>>>>>>>
>>>>>>>>>> At this time, the send may complete, but since we will never go back
>>>>>>>>>> to orte_rml_oob_send, we will never perform the launch (setting
>>>>>>>>>> jobdat->launch_msg_processed to 1).
>>>>>>>>>>
>>>>>>>>>> I may try to solve the bug (this is quite a top priority problem for
>>>>>>>>>> me), but maybe people who are more familiar with orted than I am may
>>>>>>>>>> propose a nice and clean solution ...
>>>>>>>>>>
>>>>>>>>>> For those who like real (and complete) gdb stacks, here they are :
>>>>>>>>>> #0 0x0000003b7fed4f38 in poll () from /lib64/libc.so.6
>>>>>>>>>> #1 0x00007fd0de5d861a in poll_dispatch (base=0x930230,
>>>>>>>>>> arg=0x91a4b0, tv=0x7fff0d977880) at poll.c:167
>>>>>>>>>> #2 0x00007fd0de5d586f in opal_event_base_loop (base=0x930230,
>>>>>>>>>> flags=1) at event.c:823
>>>>>>>>>> #3 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>>>>>> #4 0x00007fd0de5aeb6d in opal_progress () at
>>>>>>>>>> runtime/opal_progress.c:189
>>>>>>>>>> #5 0x00007fd0dd340a02 in daemon_collective (sender=0x97af50,
>>>>>>>>>> data=0x97b010) at grpcomm_bad_module.c:696
>>>>>>>>>> #6 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1,
>>>>>>>>>> data=0x97af20) at grpcomm_bad_module.c:901
>>>>>>>>>> #7 0x00007fd0de5d5334 in event_process_active (base=0x930230) at
>>>>>>>>>> event.c:667
>>>>>>>>>> #8 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230,
>>>>>>>>>> flags=1) at event.c:839
>>>>>>>>>> #9 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>>>>>> #10 0x00007fd0de5aeb6d in opal_progress () at
>>>>>>>>>> runtime/opal_progress.c:189
>>>>>>>>>> #11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700,
>>>>>>>>>> data=0x9676e0) at grpcomm_bad_module.c:696
>>>>>>>>>> #12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1,
>>>>>>>>>> data=0x9796d0) at grpcomm_bad_module.c:901
>>>>>>>>>> #13 0x00007fd0de5d5334 in event_process_active (base=0x930230) at
>>>>>>>>>> event.c:667
>>>>>>>>>> #14 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230,
>>>>>>>>>> flags=1) at event.c:839
>>>>>>>>>> #15 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>>>>>> #16 0x00007fd0de5aeb6d in opal_progress () at
>>>>>>>>>> runtime/opal_progress.c:189
>>>>>>>>>> #17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420,
>>>>>>>>>> data=0x97b4e0) at grpcomm_bad_module.c:696
>>>>>>>>>> #18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1,
>>>>>>>>>> data=0x97b3f0) at grpcomm_bad_module.c:901
>>>>>>>>>> #19 0x00007fd0de5d5334 in event_process_active (base=0x930230) at
>>>>>>>>>> event.c:667
>>>>>>>>>> #20 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230,
>>>>>>>>>> flags=1) at event.c:839
>>>>>>>>>> #21 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>>>>>> #22 0x00007fd0de5aeb6d in opal_progress () at
>>>>>>>>>> runtime/opal_progress.c:189
>>>>>>>>>> #23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210,
>>>>>>>>>> m=0x97b1a8) at ../../../../opal/threads/condition.h:99
>>>>>>>>>> #24 0x00007fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0,
>>>>>>>>>> iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153
>>>>>>>>>> #25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer
>>>>>>>>>> (peer=0x7fff0d9785a0, buffer=0x7fff0d9786b0, tag=1, flags=0) at
>>>>>>>>>> rml_oob_send.c:270
>>>>>>>>>> #26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at
>>>>>>>>>> orted/orted_comm.c:127
>>>>>>>>>> #27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1,
>>>>>>>>>> opal_event=1, data=0x965fc0) at orted/orted_comm.c:308
>>>>>>>>>> #28 0x00007fd0de5d5334 in event_process_active (base=0x930230) at
>>>>>>>>>> event.c:667
>>>>>>>>>> #29 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230,
>>>>>>>>>> flags=0) at event.c:839
>>>>>>>>>> #30 0x00007fd0de5d556b in opal_event_loop (flags=0) at event.c:746
>>>>>>>>>> #31 0x00007fd0de5d5418 in opal_event_dispatch () at event.c:682
>>>>>>>>>> #32 0x00007fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8)
>>>>>>>>>> at orted/orted_main.c:769
>>>>>>>>>> #33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8) at
>>>>>>>>>> orted.c:62
>>>>>>>>>>
>>>>>>>>>> Thanks in advance,
>>>>>>>>>> Sylvain
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> [email protected]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> [email protected]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> [email protected]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel