Looking at the code now. This code was more or less directly translated from 
the blocking version. I wouldn’t be surprised if there is an error that I 
didn’t catch with MTT on my laptop.

That said, there is an old comment about not using bcast to avoid a possible 
deadlock. Since the collective is now non-blocking that is no longer a problem. 
The simple answer is to use ibcast instead of iallgather. Will work on that fix 
now.

-Nathan

> On Sep 7, 2016, at 3:02 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Thanks guys,
> 
> 
> so i was finally able to reproduce the bug on my (oversubscribed) VM with tcp.
> 
> 
> MPI_Intercomm_merge (indirectly) incorrectly invokes iallgatherv.
> 1,main (MPI_Issend_rtoa_c.c:196)
> 1,  MPITEST_get_communicator (libmpitest.c:3544)
> 1,    PMPI_Intercomm_merge (pintercomm_merge.c:131)
> 1,      ompi_comm_activate (comm_cid.c:514)
> 1,        ompi_request_wait_completion (request.h:397)
> 1,          opal_progress (opal_progress.c:221)
> 1,            ompi_comm_request_progress (comm_request.c:132)
> 1,              ompi_comm_allreduce_inter_leader_reduce (comm_cid.c:699)
> 1,                ompi_comm_allreduce_inter_allgather (comm_cid.c:723)
> 1,                  ompi_coll_libnbc_iallgatherv_inter (nbc_iallgatherv.c:173)
> 
> 
> global tasks 0 and 1 are both root task 0 of an intercomm on groups A and B
> 
> they both invoke iallgatherv with scount=1, but context->rcounts[0]=0 (it 
> should be 1)
> per the man page
> "The type signature associated with sendcount, sendtype, at process j must be 
> equal to the type signature associated with recvcounts[j], recvtype at any 
> other process."
> 
> so if the initial intention was not to gather only on roots, then this is not 
> possible with iallgatherv
> 
> what happens then is that iallgatherv isend data (scount>0), but no matching 
> irecv is posted (rcounts[0]==0)
> then the intercomm is destroyed.
> and then the message is received later by opal_progress on a communicator 
> that do not exist (any more)
> this message is hence stored by pml/ob1 in the 
> non_existing_communicator_pending list
> /* btw, can someone kindly explain me the rationale for this ?
> is there any valid case in which a message can be received on a communicator 
> that does not exist yet ?
> if the only valid case is the communicator does not exist any more, should 
> the message be simply discarded ? */
> 
> much later in the test, a new communicator is created with the same cid than 
> the intercomm, and a hang can occur
> i can only suspect the message in the non_existing_communicator_pending list 
> causes that.
> 
> 
> bottom line, i think the root cause is a bad invocation of iallgatherv.
> Nathan, could you please have a look ?
> 
> 
> fwiw, during my investigations, i was able to get rid of the hang by *not* 
> recycling CIDs
> with the patch below.
> 
> 
> Cheers,
> 
> Gilles
> 
> diff --git a/ompi/communicator/comm_init.c b/ompi/communicator/comm_init.c
> index f453ca1..7195aa2 100644
> --- a/ompi/communicator/comm_init.c
> +++ b/ompi/communicator/comm_init.c
> @@ -297,7 +297,7 @@ int ompi_comm_finalize(void)
>      max = opal_pointer_array_get_size(&ompi_mpi_communicators);
>      for ( i=3; i<max; i++ ) {
>          comm = (ompi_communicator_t 
> *)opal_pointer_array_get_item(&ompi_mpi_communicators, i);
> -        if ( NULL != comm ) {
> +        if ( NULL != comm && (ompi_communicator_t *)0x1 != comm) {
>              /* Communicator has not been freed before finalize */
>              OBJ_RELEASE(comm);
>              comm=(ompi_communicator_t 
> *)opal_pointer_array_get_item(&ompi_mpi_communicators, i);
> @@ -435,7 +435,7 @@ static void ompi_comm_destruct(ompi_communicator_t* comm)
>           NULL != opal_pointer_array_get_item(&ompi_mpi_communicators,
>                                               comm->c_contextid)) {
>          opal_pointer_array_set_item ( &ompi_mpi_communicators,
> -                                      comm->c_contextid, NULL);
> +                                      comm->c_contextid, (void *)0x1);
>      }
>  
>      /* reset the ompi_comm_f_to_c_table entry */
> diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c 
> b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
> index 5f3f8fd..1d0f881 100644
> --- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
> +++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
> @@ -128,7 +128,7 @@ void 
> mca_pml_ob1_recv_frag_callback_match(mca_btl_base_module_t* btl,
>  
>      /* communicator pointer */
>      comm_ptr = ompi_comm_lookup(hdr->hdr_ctx);
> -    if(OPAL_UNLIKELY(NULL == comm_ptr)) {
> +    if(OPAL_UNLIKELY(NULL == comm_ptr || (ompi_communicator_t *)0x1 == 
> comm_ptr)) {
>          /* This is a special case. A message for a not yet existing
>           * communicator can happens. Instead of doing a matching we
>           * will temporarily add it the a pending queue in the PML.
> 
> On 9/7/2016 2:28 AM, George Bosilca wrote:
>> I can make MPI_Issend_rtoa deadlock with vader and sm.
>> 
>>   George.
>> 
>> 
>> On Tue, Sep 6, 2016 at 12:06 PM, r...@open-mpi.org <r...@open-mpi.org> wrote:
>> FWIW: those tests hang for me with TCP (I don’t have openib on my cluster). 
>> I’ll check it with your change as well
>> 
>> 
>>> On Sep 6, 2016, at 1:29 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>>> 
>>> Ralph,
>>> 
>>> 
>>> this looks like an other hang :-(
>>> 
>>> 
>>> i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node, 8 cores 
>>> per socket) with infiniband,
>>> 
>>> and i always observe the same hang at the same place.
>>> 
>>> 
>>> surprisingly, i do not get any hang if i blacklist the openib btl
>>> 
>>> 
>>> the patch below can be used to avoid the hang with infiniband or for 
>>> debugging purpose
>>> 
>>> the hang occurs in communicator 6, and if i skip tests on communicator 2, 
>>> no hang happens.
>>> 
>>> the hang occur on an intercomm :
>>> task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm
>>> 
>>> task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm
>>> 
>>> task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0, and then 
>>> both hang in MPI_Wait()
>>> 
>>> surprisingly, tasks 0 and 1 run on the same node, so it is very puzzling 
>>> the hang only occurs with the openib btl,
>>> 
>>> since vader should be used here.
>>> 
>>> 
>>> diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c 
>>> b/intel_tests/src/MPI_Issend_rtoa_c.c
>>> index 8b26f84..b9a704b 100644
>>> --- a/intel_tests/src/MPI_Issend_rtoa_c.c
>>> +++ b/intel_tests/src/MPI_Issend_rtoa_c.c
>>> @@ -173,8 +177,9 @@ int main(int argc, char *argv[])
>>>  
>>>      for (comm_count = 0; comm_count < MPITEST_num_comm_sizes();
>>>           comm_count++) {
>>>          comm_index = MPITEST_get_comm_index(comm_count);
>>>          comm_type = MPITEST_get_comm_type(comm_count);
>>> +        if (2 == comm_count) continue;
>>>  
>>>          /*
>>> @@ -312,6 +330,9 @@ int main(int argc, char *argv[])
>>>                       * left sub-communicator
>>>                       */
>>>  
>>> +                    if (6 == comm_count && 12 == length_count && 
>>> MPITEST_current_rank < 2) {
>>> +                        /* insert a breakpoint here */
>>> +                    }
>>>           * Reset a bunch of variables that will be set when we get our
>>> 
>>> 
>>> as a side note, which is very unlikely related to this issue, i noticed the 
>>> following programs works fine,
>>> 
>>> though it is reasonnable to expect a hang.
>>> the root cause is MPI_Send uses the eager protocol, and though 
>>> communicators used by MPI_Send and MPI_Recv
>>> 
>>> are different, they have the same (recycled) CID.
>>> 
>>> fwiw, the tests also completes with mpich.
>>> 
>>> 
>>> if not already done, should we provide an option not to recycle CIDs ?
>>> 
>>> or flush unexpected/unmatched messages when a communicator is free'd ?
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> Gilles
>>> 
>>> #include <stdio.h>
>>> #include <mpi.h>
>>> 
>>> /* send a message (eager mode) in a communicator, and then
>>>  * receive it in an other communicator, but with the same CID
>>>  */
>>> int main(int argc, char *argv[]) {
>>>     int rank, size;
>>>     int b;
>>>     MPI_Comm comm;
>>> 
>>>     MPI_Init(&argc, &argv);
>>>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>     MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>     if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1);
>>> 
>>>     MPI_Comm_dup(MPI_COMM_WORLD, &comm);
>>>     if (0 == rank) {
>>>         b = 0x55555555;
>>>         MPI_Send(&b, 1, MPI_INT, 1, 0, comm);
>>>     }
>>>     MPI_Comm_free(&comm);
>>> 
>>>     MPI_Comm_dup(MPI_COMM_WORLD, &comm);
>>>     if (1 == rank) {
>>>         b = 0xAAAAAAAA;
>>>         MPI_Recv(&b, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE);
>>>         if (0x55555555 != b) MPI_Abort(MPI_COMM_WORLD, 2);
>>>     }
>>>     MPI_Comm_free(&comm);
>>> 
>>>     MPI_Finalize();
>>> 
>>>     return 0;
>>> }
>>> 
>>> 
>>> On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote:
>>>> ok,  will double check tomorrow this was the very same hang i fixed earlier
>>>> 
>>>> Cheers,
>>>> 
>>>> Gilles
>>>> 
>>>> On Monday, September 5, 2016, r...@open-mpi.org <r...@open-mpi.org> wrote:
>>>> I was just looking at the overnight MTT report, and these were present 
>>>> going back a long ways in both branches. They are in the Intel test suite.
>>>> 
>>>> If you have already addressed them, then thanks!
>>>> 
>>>> > On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet 
>>>> > <gilles.gouaillar...@gmail.com> wrote:
>>>> >
>>>> > Ralph,
>>>> >
>>>> > I fixed a hang earlier today in master, and the PR for v2.x is at 
>>>> > https://github.com/open-mpi/ompi-release/pull/1368
>>>> >
>>>> > Can you please make sure you are running the latest master ?
>>>> >
>>>> > Which testsuite do these tests come from ?
>>>> > I will have a look tomorrow if the hang is still there
>>>> >
>>>> > Cheers,
>>>> >
>>>> > Gilles
>>>> >
>>>> > r...@open-mpi.org wrote:
>>>> >> Hey folks
>>>> >>
>>>> >> All of the tests that involve either ISsend_ator, SSend_ator, 
>>>> >> ISsend_rtoa, or SSend_rtoa are hanging on master and v2.x. Does anyone 
>>>> >> know what these tests do, and why we never seem to pass them?
>>>> >>
>>>> >> Do we care?
>>>> >> Ralph
>>>> >>
>>>> >> _______________________________________________
>>>> >> devel mailing list
>>>> >> devel@lists.open-mpi.org
>>>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> > _______________________________________________
>>>> > devel mailing list
>>>> > devel@lists.open-mpi.org
>>>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>> 
>>>> 
>>>> ______________________________
>>>> _________________
>>>> devel mailing list
>>>> 
>>>> devel@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> _______________________________________________ devel mailing list 
>>> devel@lists.open-mpi.org 
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> _______________________________________________ devel mailing list 
>> devel@lists.open-mpi.org 
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> _______________________________________________
>> devel mailing list
>> 
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to