Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

Ralph Castain Thu, 19 Nov 2009 10:08:48 -0500

On Nov 19, 2009, at 7:52 AM, Sylvain Jeaugey wrote:

> Thank you Ralph for this precious help.
> 
> I setup a quick-and-dirty patch basically postponing process_msg (hence 
> daemon_collective) until the launch is done. In process_msg, I therefore 
> requeue a process_msg handler and return.


That is basically the idea I proposed, just done in a slightly different place

> 
> In this "all-must-be-non-blocking-and-done-through-opal_progress" algorithm, 
> I don't think that blocking calls like the one in daemon_collective should be 
> allowed. This also applies to the blocking one in send_relay. [Well, 
> actually, one is okay, 2 may lead to interlocking.]

Well, that would be problematic - you will find "progressed_wait" used 
repeatedly in the code. Removing them all would take a -lot- of effort and a 
major rewrite. I'm not yet convinced it is required. There may be something 
strange in how you are setup, or your cluster - like I said, this is the first 
report of a problem we have had, and people with much bigger slurm clusters 
have been running this code every day for over a year.

> 
> If you have time doing a nicer patch, it would be great and I would be happy 
> to test it. Otherwise, I will try to implement your idea properly next week 
> (with my limited knowledge of orted).

Either way is fine - I'll see if I can get to it.

Thanks
Ralph

> 
> For the record, here is the patch I'm currently testing at large scale :
> 
> diff -r ec68298b3169 -r b622b9e8f1ac orte/mca/grpcomm/bad/grpcomm_bad_module.c
> --- a/orte/mca/grpcomm/bad/grpcomm_bad_module.c Mon Nov 09 13:29:16 2009 +0100
> +++ b/orte/mca/grpcomm/bad/grpcomm_bad_module.c Wed Nov 18 09:27:55 2009 +0100
> @@ -687,14 +687,6 @@
>         opal_list_append(&orte_local_jobdata, &jobdat->super);
>     }
> 
> -    /* it may be possible to get here prior to having actually finished 
> processing our
> -     * local launch msg due to the race condition between different nodes 
> and when
> -     * they start their individual procs. Hence, we have to first ensure 
> that we
> -     * -have- finished processing the launch msg, or else we won't know 
> whether
> -     * or not to wait before sending this on
> -     */
> -    ORTE_PROGRESSED_WAIT(jobdat->launch_msg_processed, 0, 1);
> -
>     /* unpack the collective type */
>     n = 1;
>     if (ORTE_SUCCESS != (rc = opal_dss.unpack(data, &jobdat->collective_type, 
> &n, ORTE_GRPCOMM_COLL_T))) {
> @@ -894,6 +886,28 @@
> 
>     proc = &mev->sender;
>     buf = mev->buffer;
> +
> +    jobdat = NULL;
> +    for (item = opal_list_get_first(&orte_local_jobdata);
> +         item != opal_list_get_end(&orte_local_jobdata);
> +         item = opal_list_get_next(item)) {
> +        jobdat = (orte_odls_job_t*)item;
> +
> +        /* is this the specified job? */
> +        if (jobdat->jobid == proc->jobid) {
> +            break;
> +        }
> +    }
> +    if (NULL == jobdat || jobdat->launch_msg_processed != 1) {
> +        /* it may be possible to get here prior to having actually finished 
> processing our
> +         * local launch msg due to the race condition between different 
> nodes and when
> +         * they start their individual procs. Hence, we have to first ensure 
> that we
> +         * -have- finished processing the launch msg. Requeue this event 
> until it is done.
> +         */
> +        int tag = &mev->tag;
> +        ORTE_MESSAGE_EVENT(proc, buf, tag, process_msg);
> +        return;
> +    }
> 
>     /* is the sender a local proc, or a daemon relaying the collective? */
>     if (ORTE_PROC_MY_NAME->jobid == proc->jobid) {
> 
> Sylvain
> 
> On Thu, 19 Nov 2009, Ralph Castain wrote:
> 
>> Very strange. As I said, we routinely launch jobs spanning several hundred 
>> nodes without problem. You can see the platform files for that setup in 
>> contrib/platform/lanl/tlcc
>> 
>> That said, it is always possible you are hitting some kind of race condition 
>> we don't hit. In looking at the code, one possibility would be to make all 
>> the communications flow through the daemon cmd processor in 
>> orte/orted_comm.c. This is the way it used to work until I reorganized the 
>> code a year ago for other reasons that never materialized.
>> 
>> Unfortunately, the daemon collective has to wait until the local launch cmd 
>> has been completely processed so it can know whether or not to wait for 
>> contributions from local procs before sending along the collective message, 
>> so this kinda limits our options.
>> 
>> About the only other thing you could do would be to not send the relay at 
>> all until -after- processing the local launch cmd. You can then remove the 
>> "wait" in the daemon collective as you will know how many local procs are 
>> involved, if any.
>> 
>> I used to do it that way and it guarantees it will work. The negative is 
>> that we lose some launch speed as the next nodes in the tree don't get the 
>> launch message until this node finishes launching all its procs.
>> 
>> The way around that, of course, would be to:
>> 
>> 1.  process the launch message, thus extracting the number of any local 
>> procs and setting up all data structures...but do -not- launch the procs at 
>> this time (as this is what takes all the time)
>> 
>> 2. send the relay - the daemon collective can now proceed without a "wait" 
>> in it
>> 
>> 3. now launch the local procs
>> 
>> It would be a fairly simple reorganization of the code in the orte/mca/odls 
>> area. I can do it this weekend if you like, or you can do it - either way is 
>> fine, but if you do it, please contribute it back to the trunk.
>> 
>> Ralph
>> 
>> 
>> On Nov 19, 2009, at 1:39 AM, Sylvain Jeaugey wrote:
>> 
>>> I would say I use the default settings, i.e. I don't set anything "special" 
>>> at configure.
>>> 
>>> I'm launching my processes with SLURM (salloc + mpirun).
>>> 
>>> Sylvain
>>> 
>>> On Wed, 18 Nov 2009, Ralph Castain wrote:
>>> 
>>>> How did you configure OMPI?
>>>> 
>>>> What launch mechanism are you using - ssh?
>>>> 
>>>> On Nov 17, 2009, at 9:01 AM, Sylvain Jeaugey wrote:
>>>> 
>>>>> I don't think so, and I'm not doing it explicitely at least. How do I 
>>>>> know ?
>>>>> 
>>>>> Sylvain
>>>>> 
>>>>> On Tue, 17 Nov 2009, Ralph Castain wrote:
>>>>> 
>>>>>> We routinely launch across thousands of nodes without a problem...I have 
>>>>>> never seen it stick in this fashion.
>>>>>> 
>>>>>> Did you build and/or are using ORTE threaded by any chance? If so, that 
>>>>>> definitely won't work.
>>>>>> 
>>>>>> On Nov 17, 2009, at 9:27 AM, Sylvain Jeaugey wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> We are currently experiencing problems at launch on the 1.5 branch on 
>>>>>>> relatively large number of nodes (at least 80). Some processes are not 
>>>>>>> spawned and orted processes are deadlocked.
>>>>>>> 
>>>>>>> When MPI processes are calling MPI_Init before send_relay is complete, 
>>>>>>> the send_relay function and the daemon_collective function are doing a 
>>>>>>> nice interlock :
>>>>>>> 
>>>>>>> Here is the scenario :
>>>>>>>> send_relay
>>>>>>> performs the send tree :
>>>>>>>> orte_rml_oob_send_buffer
>>>>>>>> orte_rml_oob_send
>>>>>>> > opal_wait_condition
>>>>>>> Waiting on completion from send thus calling opal_progress()
>>>>>>>  > opal_progress()
>>>>>>> But since a collective request arrived from the network, entered :
>>>>>>>    > daemon_collective
>>>>>>> However, daemon_collective is waiting for the job to be initialized 
>>>>>>> (wait on jobdat->launch_msg_processed) before continuing, thus calling :
>>>>>>>      > opal_progress()
>>>>>>> 
>>>>>>> At this time, the send may complete, but since we will never go back to 
>>>>>>> orte_rml_oob_send, we will never perform the launch (setting 
>>>>>>> jobdat->launch_msg_processed to 1).
>>>>>>> 
>>>>>>> I may try to solve the bug (this is quite a top priority problem for 
>>>>>>> me), but maybe people who are more familiar with orted than I am may 
>>>>>>> propose a nice and clean solution ...
>>>>>>> 
>>>>>>> For those who like real (and complete) gdb stacks, here they are :
>>>>>>> #0  0x0000003b7fed4f38 in poll () from /lib64/libc.so.6
>>>>>>> #1  0x00007fd0de5d861a in poll_dispatch (base=0x930230, arg=0x91a4b0, 
>>>>>>> tv=0x7fff0d977880) at poll.c:167
>>>>>>> #2  0x00007fd0de5d586f in opal_event_base_loop (base=0x930230, flags=1) 
>>>>>>> at event.c:823
>>>>>>> #3  0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>>> #4  0x00007fd0de5aeb6d in opal_progress () at 
>>>>>>> runtime/opal_progress.c:189
>>>>>>> #5  0x00007fd0dd340a02 in daemon_collective (sender=0x97af50, 
>>>>>>> data=0x97b010) at grpcomm_bad_module.c:696
>>>>>>> #6  0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, 
>>>>>>> data=0x97af20) at grpcomm_bad_module.c:901
>>>>>>> #7  0x00007fd0de5d5334 in event_process_active (base=0x930230) at 
>>>>>>> event.c:667
>>>>>>> #8  0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) 
>>>>>>> at event.c:839
>>>>>>> #9  0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>>> #10 0x00007fd0de5aeb6d in opal_progress () at 
>>>>>>> runtime/opal_progress.c:189
>>>>>>> #11 0x00007fd0dd340a02 in daemon_collective (sender=0x979700, 
>>>>>>> data=0x9676e0) at grpcomm_bad_module.c:696
>>>>>>> #12 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, 
>>>>>>> data=0x9796d0) at grpcomm_bad_module.c:901
>>>>>>> #13 0x00007fd0de5d5334 in event_process_active (base=0x930230) at 
>>>>>>> event.c:667
>>>>>>> #14 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) 
>>>>>>> at event.c:839
>>>>>>> #15 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>>> #16 0x00007fd0de5aeb6d in opal_progress () at 
>>>>>>> runtime/opal_progress.c:189
>>>>>>> #17 0x00007fd0dd340a02 in daemon_collective (sender=0x97b420, 
>>>>>>> data=0x97b4e0) at grpcomm_bad_module.c:696
>>>>>>> #18 0x00007fd0dd341809 in process_msg (fd=-1, opal_event=1, 
>>>>>>> data=0x97b3f0) at grpcomm_bad_module.c:901
>>>>>>> #19 0x00007fd0de5d5334 in event_process_active (base=0x930230) at 
>>>>>>> event.c:667
>>>>>>> #20 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=1) 
>>>>>>> at event.c:839
>>>>>>> #21 0x00007fd0de5d556b in opal_event_loop (flags=1) at event.c:746
>>>>>>> #22 0x00007fd0de5aeb6d in opal_progress () at 
>>>>>>> runtime/opal_progress.c:189
>>>>>>> #23 0x00007fd0dd969a8a in opal_condition_wait (c=0x97b210, m=0x97b1a8) 
>>>>>>> at ../../../../opal/threads/condition.h:99
>>>>>>> #24 0x00007fd0dd96a4bf in orte_rml_oob_send (peer=0x7fff0d9785a0, 
>>>>>>> iov=0x7fff0d978530, count=1, tag=1, flags=16) at rml_oob_send.c:153
>>>>>>> #25 0x00007fd0dd96ac4d in orte_rml_oob_send_buffer 
>>>>>>> (peer=0x7fff0d9785a0, buffer=0x7fff0d9786b0, tag=1, flags=0) at 
>>>>>>> rml_oob_send.c:270
>>>>>>> #26 0x00007fd0de86ed2a in send_relay (buf=0x7fff0d9786b0) at 
>>>>>>> orted/orted_comm.c:127
>>>>>>> #27 0x00007fd0de86f6de in orte_daemon_cmd_processor (fd=-1, 
>>>>>>> opal_event=1, data=0x965fc0) at orted/orted_comm.c:308
>>>>>>> #28 0x00007fd0de5d5334 in event_process_active (base=0x930230) at 
>>>>>>> event.c:667
>>>>>>> #29 0x00007fd0de5d597a in opal_event_base_loop (base=0x930230, flags=0) 
>>>>>>> at event.c:839
>>>>>>> #30 0x00007fd0de5d556b in opal_event_loop (flags=0) at event.c:746
>>>>>>> #31 0x00007fd0de5d5418 in opal_event_dispatch () at event.c:682
>>>>>>> #32 0x00007fd0de86e339 in orte_daemon (argc=19, argv=0x7fff0d979ca8) at 
>>>>>>> orted/orted_main.c:769
>>>>>>> #33 0x00000000004008e2 in main (argc=19, argv=0x7fff0d979ca8) at 
>>>>>>> orted.c:62
>>>>>>> 
>>>>>>> Thanks in advance,
>>>>>>> Sylvain
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> [email protected]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> [email protected]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Deadlocks with new (routed) orted launch algorithm

Reply via email to