Looking at the code now. This code was more or less directly translated from the blocking version. I wouldn’t be surprised if there is an error that I didn’t catch with MTT on my laptop.
That said, there is an old comment about not using bcast to avoid a possible deadlock. Since the collective is now non-blocking that is no longer a problem. The simple answer is to use ibcast instead of iallgather. Will work on that fix now. -Nathan > On Sep 7, 2016, at 3:02 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Thanks guys, > > > so i was finally able to reproduce the bug on my (oversubscribed) VM with tcp. > > > MPI_Intercomm_merge (indirectly) incorrectly invokes iallgatherv. > 1,main (MPI_Issend_rtoa_c.c:196) > 1, MPITEST_get_communicator (libmpitest.c:3544) > 1, PMPI_Intercomm_merge (pintercomm_merge.c:131) > 1, ompi_comm_activate (comm_cid.c:514) > 1, ompi_request_wait_completion (request.h:397) > 1, opal_progress (opal_progress.c:221) > 1, ompi_comm_request_progress (comm_request.c:132) > 1, ompi_comm_allreduce_inter_leader_reduce (comm_cid.c:699) > 1, ompi_comm_allreduce_inter_allgather (comm_cid.c:723) > 1, ompi_coll_libnbc_iallgatherv_inter (nbc_iallgatherv.c:173) > > > global tasks 0 and 1 are both root task 0 of an intercomm on groups A and B > > they both invoke iallgatherv with scount=1, but context->rcounts[0]=0 (it > should be 1) > per the man page > "The type signature associated with sendcount, sendtype, at process j must be > equal to the type signature associated with recvcounts[j], recvtype at any > other process." > > so if the initial intention was not to gather only on roots, then this is not > possible with iallgatherv > > what happens then is that iallgatherv isend data (scount>0), but no matching > irecv is posted (rcounts[0]==0) > then the intercomm is destroyed. > and then the message is received later by opal_progress on a communicator > that do not exist (any more) > this message is hence stored by pml/ob1 in the > non_existing_communicator_pending list > /* btw, can someone kindly explain me the rationale for this ? > is there any valid case in which a message can be received on a communicator > that does not exist yet ? > if the only valid case is the communicator does not exist any more, should > the message be simply discarded ? */ > > much later in the test, a new communicator is created with the same cid than > the intercomm, and a hang can occur > i can only suspect the message in the non_existing_communicator_pending list > causes that. > > > bottom line, i think the root cause is a bad invocation of iallgatherv. > Nathan, could you please have a look ? > > > fwiw, during my investigations, i was able to get rid of the hang by *not* > recycling CIDs > with the patch below. > > > Cheers, > > Gilles > > diff --git a/ompi/communicator/comm_init.c b/ompi/communicator/comm_init.c > index f453ca1..7195aa2 100644 > --- a/ompi/communicator/comm_init.c > +++ b/ompi/communicator/comm_init.c > @@ -297,7 +297,7 @@ int ompi_comm_finalize(void) > max = opal_pointer_array_get_size(&ompi_mpi_communicators); > for ( i=3; i<max; i++ ) { > comm = (ompi_communicator_t > *)opal_pointer_array_get_item(&ompi_mpi_communicators, i); > - if ( NULL != comm ) { > + if ( NULL != comm && (ompi_communicator_t *)0x1 != comm) { > /* Communicator has not been freed before finalize */ > OBJ_RELEASE(comm); > comm=(ompi_communicator_t > *)opal_pointer_array_get_item(&ompi_mpi_communicators, i); > @@ -435,7 +435,7 @@ static void ompi_comm_destruct(ompi_communicator_t* comm) > NULL != opal_pointer_array_get_item(&ompi_mpi_communicators, > comm->c_contextid)) { > opal_pointer_array_set_item ( &ompi_mpi_communicators, > - comm->c_contextid, NULL); > + comm->c_contextid, (void *)0x1); > } > > /* reset the ompi_comm_f_to_c_table entry */ > diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c > b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c > index 5f3f8fd..1d0f881 100644 > --- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c > +++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c > @@ -128,7 +128,7 @@ void > mca_pml_ob1_recv_frag_callback_match(mca_btl_base_module_t* btl, > > /* communicator pointer */ > comm_ptr = ompi_comm_lookup(hdr->hdr_ctx); > - if(OPAL_UNLIKELY(NULL == comm_ptr)) { > + if(OPAL_UNLIKELY(NULL == comm_ptr || (ompi_communicator_t *)0x1 == > comm_ptr)) { > /* This is a special case. A message for a not yet existing > * communicator can happens. Instead of doing a matching we > * will temporarily add it the a pending queue in the PML. > > On 9/7/2016 2:28 AM, George Bosilca wrote: >> I can make MPI_Issend_rtoa deadlock with vader and sm. >> >> George. >> >> >> On Tue, Sep 6, 2016 at 12:06 PM, r...@open-mpi.org <r...@open-mpi.org> wrote: >> FWIW: those tests hang for me with TCP (I don’t have openib on my cluster). >> I’ll check it with your change as well >> >> >>> On Sep 6, 2016, at 1:29 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: >>> >>> Ralph, >>> >>> >>> this looks like an other hang :-( >>> >>> >>> i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node, 8 cores >>> per socket) with infiniband, >>> >>> and i always observe the same hang at the same place. >>> >>> >>> surprisingly, i do not get any hang if i blacklist the openib btl >>> >>> >>> the patch below can be used to avoid the hang with infiniband or for >>> debugging purpose >>> >>> the hang occurs in communicator 6, and if i skip tests on communicator 2, >>> no hang happens. >>> >>> the hang occur on an intercomm : >>> task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm >>> >>> task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm >>> >>> task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0, and then >>> both hang in MPI_Wait() >>> >>> surprisingly, tasks 0 and 1 run on the same node, so it is very puzzling >>> the hang only occurs with the openib btl, >>> >>> since vader should be used here. >>> >>> >>> diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c >>> b/intel_tests/src/MPI_Issend_rtoa_c.c >>> index 8b26f84..b9a704b 100644 >>> --- a/intel_tests/src/MPI_Issend_rtoa_c.c >>> +++ b/intel_tests/src/MPI_Issend_rtoa_c.c >>> @@ -173,8 +177,9 @@ int main(int argc, char *argv[]) >>> >>> for (comm_count = 0; comm_count < MPITEST_num_comm_sizes(); >>> comm_count++) { >>> comm_index = MPITEST_get_comm_index(comm_count); >>> comm_type = MPITEST_get_comm_type(comm_count); >>> + if (2 == comm_count) continue; >>> >>> /* >>> @@ -312,6 +330,9 @@ int main(int argc, char *argv[]) >>> * left sub-communicator >>> */ >>> >>> + if (6 == comm_count && 12 == length_count && >>> MPITEST_current_rank < 2) { >>> + /* insert a breakpoint here */ >>> + } >>> * Reset a bunch of variables that will be set when we get our >>> >>> >>> as a side note, which is very unlikely related to this issue, i noticed the >>> following programs works fine, >>> >>> though it is reasonnable to expect a hang. >>> the root cause is MPI_Send uses the eager protocol, and though >>> communicators used by MPI_Send and MPI_Recv >>> >>> are different, they have the same (recycled) CID. >>> >>> fwiw, the tests also completes with mpich. >>> >>> >>> if not already done, should we provide an option not to recycle CIDs ? >>> >>> or flush unexpected/unmatched messages when a communicator is free'd ? >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> >>> #include <stdio.h> >>> #include <mpi.h> >>> >>> /* send a message (eager mode) in a communicator, and then >>> * receive it in an other communicator, but with the same CID >>> */ >>> int main(int argc, char *argv[]) { >>> int rank, size; >>> int b; >>> MPI_Comm comm; >>> >>> MPI_Init(&argc, &argv); >>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>> if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1); >>> >>> MPI_Comm_dup(MPI_COMM_WORLD, &comm); >>> if (0 == rank) { >>> b = 0x55555555; >>> MPI_Send(&b, 1, MPI_INT, 1, 0, comm); >>> } >>> MPI_Comm_free(&comm); >>> >>> MPI_Comm_dup(MPI_COMM_WORLD, &comm); >>> if (1 == rank) { >>> b = 0xAAAAAAAA; >>> MPI_Recv(&b, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE); >>> if (0x55555555 != b) MPI_Abort(MPI_COMM_WORLD, 2); >>> } >>> MPI_Comm_free(&comm); >>> >>> MPI_Finalize(); >>> >>> return 0; >>> } >>> >>> >>> On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote: >>>> ok, will double check tomorrow this was the very same hang i fixed earlier >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On Monday, September 5, 2016, r...@open-mpi.org <r...@open-mpi.org> wrote: >>>> I was just looking at the overnight MTT report, and these were present >>>> going back a long ways in both branches. They are in the Intel test suite. >>>> >>>> If you have already addressed them, then thanks! >>>> >>>> > On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet >>>> > <gilles.gouaillar...@gmail.com> wrote: >>>> > >>>> > Ralph, >>>> > >>>> > I fixed a hang earlier today in master, and the PR for v2.x is at >>>> > https://github.com/open-mpi/ompi-release/pull/1368 >>>> > >>>> > Can you please make sure you are running the latest master ? >>>> > >>>> > Which testsuite do these tests come from ? >>>> > I will have a look tomorrow if the hang is still there >>>> > >>>> > Cheers, >>>> > >>>> > Gilles >>>> > >>>> > r...@open-mpi.org wrote: >>>> >> Hey folks >>>> >> >>>> >> All of the tests that involve either ISsend_ator, SSend_ator, >>>> >> ISsend_rtoa, or SSend_rtoa are hanging on master and v2.x. Does anyone >>>> >> know what these tests do, and why we never seem to pass them? >>>> >> >>>> >> Do we care? >>>> >> Ralph >>>> >> >>>> >> _______________________________________________ >>>> >> devel mailing list >>>> >> devel@lists.open-mpi.org >>>> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> > _______________________________________________ >>>> > devel mailing list >>>> > devel@lists.open-mpi.org >>>> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> >>>> >>>> ______________________________ >>>> _________________ >>>> devel mailing list >>>> >>>> devel@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> _______________________________________________ devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> _______________________________________________ devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> _______________________________________________ >> devel mailing list >> >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel