Posted a possible fix to the intercomm hang. See https://github.com/open-mpi/ompi/pull/2061
-Nathan > On Sep 7, 2016, at 6:53 AM, Nathan Hjelm <hje...@me.com> wrote: > > Looking at the code now. This code was more or less directly translated from > the blocking version. I wouldn’t be surprised if there is an error that I > didn’t catch with MTT on my laptop. > > That said, there is an old comment about not using bcast to avoid a possible > deadlock. Since the collective is now non-blocking that is no longer a > problem. The simple answer is to use ibcast instead of iallgather. Will work > on that fix now. > > -Nathan > >> On Sep 7, 2016, at 3:02 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: >> >> Thanks guys, >> >> >> so i was finally able to reproduce the bug on my (oversubscribed) VM with >> tcp. >> >> >> MPI_Intercomm_merge (indirectly) incorrectly invokes iallgatherv. >> 1,main (MPI_Issend_rtoa_c.c:196) >> 1, MPITEST_get_communicator (libmpitest.c:3544) >> 1, PMPI_Intercomm_merge (pintercomm_merge.c:131) >> 1, ompi_comm_activate (comm_cid.c:514) >> 1, ompi_request_wait_completion (request.h:397) >> 1, opal_progress (opal_progress.c:221) >> 1, ompi_comm_request_progress (comm_request.c:132) >> 1, ompi_comm_allreduce_inter_leader_reduce (comm_cid.c:699) >> 1, ompi_comm_allreduce_inter_allgather (comm_cid.c:723) >> 1, ompi_coll_libnbc_iallgatherv_inter >> (nbc_iallgatherv.c:173) >> >> >> global tasks 0 and 1 are both root task 0 of an intercomm on groups A and B >> >> they both invoke iallgatherv with scount=1, but context->rcounts[0]=0 (it >> should be 1) >> per the man page >> "The type signature associated with sendcount, sendtype, at process j must >> be equal to the type signature associated with recvcounts[j], recvtype at >> any other process." >> >> so if the initial intention was not to gather only on roots, then this is >> not possible with iallgatherv >> >> what happens then is that iallgatherv isend data (scount>0), but no matching >> irecv is posted (rcounts[0]==0) >> then the intercomm is destroyed. >> and then the message is received later by opal_progress on a communicator >> that do not exist (any more) >> this message is hence stored by pml/ob1 in the >> non_existing_communicator_pending list >> /* btw, can someone kindly explain me the rationale for this ? >> is there any valid case in which a message can be received on a communicator >> that does not exist yet ? >> if the only valid case is the communicator does not exist any more, should >> the message be simply discarded ? */ >> >> much later in the test, a new communicator is created with the same cid than >> the intercomm, and a hang can occur >> i can only suspect the message in the non_existing_communicator_pending list >> causes that. >> >> >> bottom line, i think the root cause is a bad invocation of iallgatherv. >> Nathan, could you please have a look ? >> >> >> fwiw, during my investigations, i was able to get rid of the hang by *not* >> recycling CIDs >> with the patch below. >> >> >> Cheers, >> >> Gilles >> >> diff --git a/ompi/communicator/comm_init.c b/ompi/communicator/comm_init.c >> index f453ca1..7195aa2 100644 >> --- a/ompi/communicator/comm_init.c >> +++ b/ompi/communicator/comm_init.c >> @@ -297,7 +297,7 @@ int ompi_comm_finalize(void) >> max = opal_pointer_array_get_size(&ompi_mpi_communicators); >> for ( i=3; i<max; i++ ) { >> comm = (ompi_communicator_t >> *)opal_pointer_array_get_item(&ompi_mpi_communicators, i); >> - if ( NULL != comm ) { >> + if ( NULL != comm && (ompi_communicator_t *)0x1 != comm) { >> /* Communicator has not been freed before finalize */ >> OBJ_RELEASE(comm); >> comm=(ompi_communicator_t >> *)opal_pointer_array_get_item(&ompi_mpi_communicators, i); >> @@ -435,7 +435,7 @@ static void ompi_comm_destruct(ompi_communicator_t* comm) >> NULL != opal_pointer_array_get_item(&ompi_mpi_communicators, >> comm->c_contextid)) { >> opal_pointer_array_set_item ( &ompi_mpi_communicators, >> - comm->c_contextid, NULL); >> + comm->c_contextid, (void *)0x1); >> } >> >> /* reset the ompi_comm_f_to_c_table entry */ >> diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c >> b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c >> index 5f3f8fd..1d0f881 100644 >> --- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c >> +++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c >> @@ -128,7 +128,7 @@ void >> mca_pml_ob1_recv_frag_callback_match(mca_btl_base_module_t* btl, >> >> /* communicator pointer */ >> comm_ptr = ompi_comm_lookup(hdr->hdr_ctx); >> - if(OPAL_UNLIKELY(NULL == comm_ptr)) { >> + if(OPAL_UNLIKELY(NULL == comm_ptr || (ompi_communicator_t *)0x1 == >> comm_ptr)) { >> /* This is a special case. A message for a not yet existing >> * communicator can happens. Instead of doing a matching we >> * will temporarily add it the a pending queue in the PML. >> >> On 9/7/2016 2:28 AM, George Bosilca wrote: >>> I can make MPI_Issend_rtoa deadlock with vader and sm. >>> >>> George. >>> >>> >>> On Tue, Sep 6, 2016 at 12:06 PM, r...@open-mpi.org <r...@open-mpi.org> >>> wrote: >>> FWIW: those tests hang for me with TCP (I don’t have openib on my cluster). >>> I’ll check it with your change as well >>> >>> >>>> On Sep 6, 2016, at 1:29 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: >>>> >>>> Ralph, >>>> >>>> >>>> this looks like an other hang :-( >>>> >>>> >>>> i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node, 8 cores >>>> per socket) with infiniband, >>>> >>>> and i always observe the same hang at the same place. >>>> >>>> >>>> surprisingly, i do not get any hang if i blacklist the openib btl >>>> >>>> >>>> the patch below can be used to avoid the hang with infiniband or for >>>> debugging purpose >>>> >>>> the hang occurs in communicator 6, and if i skip tests on communicator 2, >>>> no hang happens. >>>> >>>> the hang occur on an intercomm : >>>> task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm >>>> >>>> task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm >>>> >>>> task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0, and then >>>> both hang in MPI_Wait() >>>> >>>> surprisingly, tasks 0 and 1 run on the same node, so it is very puzzling >>>> the hang only occurs with the openib btl, >>>> >>>> since vader should be used here. >>>> >>>> >>>> diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c >>>> b/intel_tests/src/MPI_Issend_rtoa_c.c >>>> index 8b26f84..b9a704b 100644 >>>> --- a/intel_tests/src/MPI_Issend_rtoa_c.c >>>> +++ b/intel_tests/src/MPI_Issend_rtoa_c.c >>>> @@ -173,8 +177,9 @@ int main(int argc, char *argv[]) >>>> >>>> for (comm_count = 0; comm_count < MPITEST_num_comm_sizes(); >>>> comm_count++) { >>>> comm_index = MPITEST_get_comm_index(comm_count); >>>> comm_type = MPITEST_get_comm_type(comm_count); >>>> + if (2 == comm_count) continue; >>>> >>>> /* >>>> @@ -312,6 +330,9 @@ int main(int argc, char *argv[]) >>>> * left sub-communicator >>>> */ >>>> >>>> + if (6 == comm_count && 12 == length_count && >>>> MPITEST_current_rank < 2) { >>>> + /* insert a breakpoint here */ >>>> + } >>>> * Reset a bunch of variables that will be set when we get our >>>> >>>> >>>> as a side note, which is very unlikely related to this issue, i noticed >>>> the following programs works fine, >>>> >>>> though it is reasonnable to expect a hang. >>>> the root cause is MPI_Send uses the eager protocol, and though >>>> communicators used by MPI_Send and MPI_Recv >>>> >>>> are different, they have the same (recycled) CID. >>>> >>>> fwiw, the tests also completes with mpich. >>>> >>>> >>>> if not already done, should we provide an option not to recycle CIDs ? >>>> >>>> or flush unexpected/unmatched messages when a communicator is free'd ? >>>> >>>> >>>> Cheers, >>>> >>>> >>>> Gilles >>>> >>>> #include <stdio.h> >>>> #include <mpi.h> >>>> >>>> /* send a message (eager mode) in a communicator, and then >>>> * receive it in an other communicator, but with the same CID >>>> */ >>>> int main(int argc, char *argv[]) { >>>> int rank, size; >>>> int b; >>>> MPI_Comm comm; >>>> >>>> MPI_Init(&argc, &argv); >>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>>> if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1); >>>> >>>> MPI_Comm_dup(MPI_COMM_WORLD, &comm); >>>> if (0 == rank) { >>>> b = 0x55555555; >>>> MPI_Send(&b, 1, MPI_INT, 1, 0, comm); >>>> } >>>> MPI_Comm_free(&comm); >>>> >>>> MPI_Comm_dup(MPI_COMM_WORLD, &comm); >>>> if (1 == rank) { >>>> b = 0xAAAAAAAA; >>>> MPI_Recv(&b, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE); >>>> if (0x55555555 != b) MPI_Abort(MPI_COMM_WORLD, 2); >>>> } >>>> MPI_Comm_free(&comm); >>>> >>>> MPI_Finalize(); >>>> >>>> return 0; >>>> } >>>> >>>> >>>> On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote: >>>>> ok, will double check tomorrow this was the very same hang i fixed >>>>> earlier >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> On Monday, September 5, 2016, r...@open-mpi.org <r...@open-mpi.org> wrote: >>>>> I was just looking at the overnight MTT report, and these were present >>>>> going back a long ways in both branches. They are in the Intel test suite. >>>>> >>>>> If you have already addressed them, then thanks! >>>>> >>>>>> On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet >>>>>> <gilles.gouaillar...@gmail.com> wrote: >>>>>> >>>>>> Ralph, >>>>>> >>>>>> I fixed a hang earlier today in master, and the PR for v2.x is at >>>>>> https://github.com/open-mpi/ompi-release/pull/1368 >>>>>> >>>>>> Can you please make sure you are running the latest master ? >>>>>> >>>>>> Which testsuite do these tests come from ? >>>>>> I will have a look tomorrow if the hang is still there >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> r...@open-mpi.org wrote: >>>>>>> Hey folks >>>>>>> >>>>>>> All of the tests that involve either ISsend_ator, SSend_ator, >>>>>>> ISsend_rtoa, or SSend_rtoa are hanging on master and v2.x. Does anyone >>>>>>> know what these tests do, and why we never seem to pass them? >>>>>>> >>>>>>> Do we care? >>>>>>> Ralph >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> devel@lists.open-mpi.org >>>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> devel@lists.open-mpi.org >>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>>> >>>>> >>>>> ______________________________ >>>>> _________________ >>>>> devel mailing list >>>>> >>>>> devel@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>>> _______________________________________________ devel mailing list >>>> devel@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> _______________________________________________ devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> _______________________________________________ >>> devel mailing list >>> >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel