Posted a possible fix to the intercomm hang. See 
https://github.com/open-mpi/ompi/pull/2061

-Nathan


> On Sep 7, 2016, at 6:53 AM, Nathan Hjelm <hje...@me.com> wrote:
> 
> Looking at the code now. This code was more or less directly translated from 
> the blocking version. I wouldn’t be surprised if there is an error that I 
> didn’t catch with MTT on my laptop.
> 
> That said, there is an old comment about not using bcast to avoid a possible 
> deadlock. Since the collective is now non-blocking that is no longer a 
> problem. The simple answer is to use ibcast instead of iallgather. Will work 
> on that fix now.
> 
> -Nathan
> 
>> On Sep 7, 2016, at 3:02 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>> 
>> Thanks guys,
>> 
>> 
>> so i was finally able to reproduce the bug on my (oversubscribed) VM with 
>> tcp.
>> 
>> 
>> MPI_Intercomm_merge (indirectly) incorrectly invokes iallgatherv.
>> 1,main (MPI_Issend_rtoa_c.c:196)
>> 1,  MPITEST_get_communicator (libmpitest.c:3544)
>> 1,    PMPI_Intercomm_merge (pintercomm_merge.c:131)
>> 1,      ompi_comm_activate (comm_cid.c:514)
>> 1,        ompi_request_wait_completion (request.h:397)
>> 1,          opal_progress (opal_progress.c:221)
>> 1,            ompi_comm_request_progress (comm_request.c:132)
>> 1,              ompi_comm_allreduce_inter_leader_reduce (comm_cid.c:699)
>> 1,                ompi_comm_allreduce_inter_allgather (comm_cid.c:723)
>> 1,                  ompi_coll_libnbc_iallgatherv_inter 
>> (nbc_iallgatherv.c:173)
>> 
>> 
>> global tasks 0 and 1 are both root task 0 of an intercomm on groups A and B
>> 
>> they both invoke iallgatherv with scount=1, but context->rcounts[0]=0 (it 
>> should be 1)
>> per the man page
>> "The type signature associated with sendcount, sendtype, at process j must 
>> be equal to the type signature associated with recvcounts[j], recvtype at 
>> any other process."
>> 
>> so if the initial intention was not to gather only on roots, then this is 
>> not possible with iallgatherv
>> 
>> what happens then is that iallgatherv isend data (scount>0), but no matching 
>> irecv is posted (rcounts[0]==0)
>> then the intercomm is destroyed.
>> and then the message is received later by opal_progress on a communicator 
>> that do not exist (any more)
>> this message is hence stored by pml/ob1 in the 
>> non_existing_communicator_pending list
>> /* btw, can someone kindly explain me the rationale for this ?
>> is there any valid case in which a message can be received on a communicator 
>> that does not exist yet ?
>> if the only valid case is the communicator does not exist any more, should 
>> the message be simply discarded ? */
>> 
>> much later in the test, a new communicator is created with the same cid than 
>> the intercomm, and a hang can occur
>> i can only suspect the message in the non_existing_communicator_pending list 
>> causes that.
>> 
>> 
>> bottom line, i think the root cause is a bad invocation of iallgatherv.
>> Nathan, could you please have a look ?
>> 
>> 
>> fwiw, during my investigations, i was able to get rid of the hang by *not* 
>> recycling CIDs
>> with the patch below.
>> 
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> diff --git a/ompi/communicator/comm_init.c b/ompi/communicator/comm_init.c
>> index f453ca1..7195aa2 100644
>> --- a/ompi/communicator/comm_init.c
>> +++ b/ompi/communicator/comm_init.c
>> @@ -297,7 +297,7 @@ int ompi_comm_finalize(void)
>>     max = opal_pointer_array_get_size(&ompi_mpi_communicators);
>>     for ( i=3; i<max; i++ ) {
>>         comm = (ompi_communicator_t 
>> *)opal_pointer_array_get_item(&ompi_mpi_communicators, i);
>> -        if ( NULL != comm ) {
>> +        if ( NULL != comm && (ompi_communicator_t *)0x1 != comm) {
>>             /* Communicator has not been freed before finalize */
>>             OBJ_RELEASE(comm);
>>             comm=(ompi_communicator_t 
>> *)opal_pointer_array_get_item(&ompi_mpi_communicators, i);
>> @@ -435,7 +435,7 @@ static void ompi_comm_destruct(ompi_communicator_t* comm)
>>          NULL != opal_pointer_array_get_item(&ompi_mpi_communicators,
>>                                              comm->c_contextid)) {
>>         opal_pointer_array_set_item ( &ompi_mpi_communicators,
>> -                                      comm->c_contextid, NULL);
>> +                                      comm->c_contextid, (void *)0x1);
>>     }
>> 
>>     /* reset the ompi_comm_f_to_c_table entry */
>> diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c 
>> b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
>> index 5f3f8fd..1d0f881 100644
>> --- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
>> +++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
>> @@ -128,7 +128,7 @@ void 
>> mca_pml_ob1_recv_frag_callback_match(mca_btl_base_module_t* btl,
>> 
>>     /* communicator pointer */
>>     comm_ptr = ompi_comm_lookup(hdr->hdr_ctx);
>> -    if(OPAL_UNLIKELY(NULL == comm_ptr)) {
>> +    if(OPAL_UNLIKELY(NULL == comm_ptr || (ompi_communicator_t *)0x1 == 
>> comm_ptr)) {
>>         /* This is a special case. A message for a not yet existing
>>          * communicator can happens. Instead of doing a matching we
>>          * will temporarily add it the a pending queue in the PML.
>> 
>> On 9/7/2016 2:28 AM, George Bosilca wrote:
>>> I can make MPI_Issend_rtoa deadlock with vader and sm.
>>> 
>>>  George.
>>> 
>>> 
>>> On Tue, Sep 6, 2016 at 12:06 PM, r...@open-mpi.org <r...@open-mpi.org> 
>>> wrote:
>>> FWIW: those tests hang for me with TCP (I don’t have openib on my cluster). 
>>> I’ll check it with your change as well
>>> 
>>> 
>>>> On Sep 6, 2016, at 1:29 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>>>> 
>>>> Ralph,
>>>> 
>>>> 
>>>> this looks like an other hang :-(
>>>> 
>>>> 
>>>> i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node, 8 cores 
>>>> per socket) with infiniband,
>>>> 
>>>> and i always observe the same hang at the same place.
>>>> 
>>>> 
>>>> surprisingly, i do not get any hang if i blacklist the openib btl
>>>> 
>>>> 
>>>> the patch below can be used to avoid the hang with infiniband or for 
>>>> debugging purpose
>>>> 
>>>> the hang occurs in communicator 6, and if i skip tests on communicator 2, 
>>>> no hang happens.
>>>> 
>>>> the hang occur on an intercomm :
>>>> task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm
>>>> 
>>>> task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm
>>>> 
>>>> task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0, and then 
>>>> both hang in MPI_Wait()
>>>> 
>>>> surprisingly, tasks 0 and 1 run on the same node, so it is very puzzling 
>>>> the hang only occurs with the openib btl,
>>>> 
>>>> since vader should be used here.
>>>> 
>>>> 
>>>> diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c 
>>>> b/intel_tests/src/MPI_Issend_rtoa_c.c
>>>> index 8b26f84..b9a704b 100644
>>>> --- a/intel_tests/src/MPI_Issend_rtoa_c.c
>>>> +++ b/intel_tests/src/MPI_Issend_rtoa_c.c
>>>> @@ -173,8 +177,9 @@ int main(int argc, char *argv[])
>>>> 
>>>>     for (comm_count = 0; comm_count < MPITEST_num_comm_sizes();
>>>>          comm_count++) {
>>>>         comm_index = MPITEST_get_comm_index(comm_count);
>>>>         comm_type = MPITEST_get_comm_type(comm_count);
>>>> +        if (2 == comm_count) continue;
>>>> 
>>>>         /*
>>>> @@ -312,6 +330,9 @@ int main(int argc, char *argv[])
>>>>                      * left sub-communicator
>>>>                      */
>>>> 
>>>> +                    if (6 == comm_count && 12 == length_count && 
>>>> MPITEST_current_rank < 2) {
>>>> +                        /* insert a breakpoint here */
>>>> +                    }
>>>>          * Reset a bunch of variables that will be set when we get our
>>>> 
>>>> 
>>>> as a side note, which is very unlikely related to this issue, i noticed 
>>>> the following programs works fine,
>>>> 
>>>> though it is reasonnable to expect a hang.
>>>> the root cause is MPI_Send uses the eager protocol, and though 
>>>> communicators used by MPI_Send and MPI_Recv
>>>> 
>>>> are different, they have the same (recycled) CID.
>>>> 
>>>> fwiw, the tests also completes with mpich.
>>>> 
>>>> 
>>>> if not already done, should we provide an option not to recycle CIDs ?
>>>> 
>>>> or flush unexpected/unmatched messages when a communicator is free'd ?
>>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> 
>>>> Gilles
>>>> 
>>>> #include <stdio.h>
>>>> #include <mpi.h>
>>>> 
>>>> /* send a message (eager mode) in a communicator, and then
>>>> * receive it in an other communicator, but with the same CID
>>>> */
>>>> int main(int argc, char *argv[]) {
>>>>    int rank, size;
>>>>    int b;
>>>>    MPI_Comm comm;
>>>> 
>>>>    MPI_Init(&argc, &argv);
>>>>    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>    MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>    if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1);
>>>> 
>>>>    MPI_Comm_dup(MPI_COMM_WORLD, &comm);
>>>>    if (0 == rank) {
>>>>        b = 0x55555555;
>>>>        MPI_Send(&b, 1, MPI_INT, 1, 0, comm);
>>>>    }
>>>>    MPI_Comm_free(&comm);
>>>> 
>>>>    MPI_Comm_dup(MPI_COMM_WORLD, &comm);
>>>>    if (1 == rank) {
>>>>        b = 0xAAAAAAAA;
>>>>        MPI_Recv(&b, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE);
>>>>        if (0x55555555 != b) MPI_Abort(MPI_COMM_WORLD, 2);
>>>>    }
>>>>    MPI_Comm_free(&comm);
>>>> 
>>>>    MPI_Finalize();
>>>> 
>>>>    return 0;
>>>> }
>>>> 
>>>> 
>>>> On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote:
>>>>> ok,  will double check tomorrow this was the very same hang i fixed 
>>>>> earlier
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Gilles
>>>>> 
>>>>> On Monday, September 5, 2016, r...@open-mpi.org <r...@open-mpi.org> wrote:
>>>>> I was just looking at the overnight MTT report, and these were present 
>>>>> going back a long ways in both branches. They are in the Intel test suite.
>>>>> 
>>>>> If you have already addressed them, then thanks!
>>>>> 
>>>>>> On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet 
>>>>>> <gilles.gouaillar...@gmail.com> wrote:
>>>>>> 
>>>>>> Ralph,
>>>>>> 
>>>>>> I fixed a hang earlier today in master, and the PR for v2.x is at 
>>>>>> https://github.com/open-mpi/ompi-release/pull/1368
>>>>>> 
>>>>>> Can you please make sure you are running the latest master ?
>>>>>> 
>>>>>> Which testsuite do these tests come from ?
>>>>>> I will have a look tomorrow if the hang is still there
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Gilles
>>>>>> 
>>>>>> r...@open-mpi.org wrote:
>>>>>>> Hey folks
>>>>>>> 
>>>>>>> All of the tests that involve either ISsend_ator, SSend_ator, 
>>>>>>> ISsend_rtoa, or SSend_rtoa are hanging on master and v2.x. Does anyone 
>>>>>>> know what these tests do, and why we never seem to pass them?
>>>>>>> 
>>>>>>> Do we care?
>>>>>>> Ralph
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel@lists.open-mpi.org
>>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel@lists.open-mpi.org
>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>> 
>>>>> 
>>>>> ______________________________
>>>>> _________________
>>>>> devel mailing list
>>>>> 
>>>>> devel@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> _______________________________________________ devel mailing list 
>>>> devel@lists.open-mpi.org 
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> _______________________________________________ devel mailing list 
>>> devel@lists.open-mpi.org 
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> 
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to