Thanks guys,

so i was finally able to reproduce the bug on my (oversubscribed) VM with tcp.


MPI_Intercomm_merge (indirectly) incorrectly invokes iallgatherv.

1,main (MPI_Issend_rtoa_c.c:196)
1,  MPITEST_get_communicator (libmpitest.c:3544)
1,    PMPI_Intercomm_merge (pintercomm_merge.c:131)
1,      ompi_comm_activate (comm_cid.c:514)
1,        ompi_request_wait_completion (request.h:397)
1,          opal_progress (opal_progress.c:221)
1,            ompi_comm_request_progress (comm_request.c:132)
1,              ompi_comm_allreduce_inter_leader_reduce (comm_cid.c:699)
1,                ompi_comm_allreduce_inter_allgather (comm_cid.c:723)
1, ompi_coll_libnbc_iallgatherv_inter (nbc_iallgatherv.c:173)


global tasks 0 and 1 are both root task 0 of an intercomm on groups A and B

they both invoke iallgatherv with scount=1, but context->rcounts[0]=0 (it should be 1)
per the man page
"The type signature associated with sendcount, sendtype, at process j must be equal to the type signature associated with recvcounts[j], recvtype at any other process."

so if the initial intention was not to gather only on roots, then this is not possible with iallgatherv

what happens then is that iallgatherv isend data (scount>0), but no matching irecv is posted (rcounts[0]==0)
then the intercomm is destroyed.
and then the message is received later by opal_progress on a communicator that do not exist (any more) this message is hence stored by pml/ob1 in the non_existing_communicator_pending list
/* btw, can someone kindly explain me the rationale for this ?
is there any valid case in which a message can be received on a communicator that does not exist yet ? if the only valid case is the communicator does not exist any more, should the message be simply discarded ? */

much later in the test, a new communicator is created with the same cid than the intercomm, and a hang can occur i can only suspect the message in the non_existing_communicator_pending list causes that.


bottom line, i think the root cause is a bad invocation of iallgatherv.
Nathan, could you please have a look ?


fwiw, during my investigations, i was able to get rid of the hang by *not* recycling CIDs
with the patch below.


Cheers,

Gilles

diff --git a/ompi/communicator/comm_init.c b/ompi/communicator/comm_init.c
index f453ca1..7195aa2 100644
--- a/ompi/communicator/comm_init.c
+++ b/ompi/communicator/comm_init.c
@@ -297,7 +297,7 @@ int ompi_comm_finalize(void)
     max = opal_pointer_array_get_size(&ompi_mpi_communicators);
     for ( i=3; i<max; i++ ) {
comm = (ompi_communicator_t *)opal_pointer_array_get_item(&ompi_mpi_communicators, i);
-        if ( NULL != comm ) {
+        if ( NULL != comm && (ompi_communicator_t *)0x1 != comm) {
             /* Communicator has not been freed before finalize */
             OBJ_RELEASE(comm);
comm=(ompi_communicator_t *)opal_pointer_array_get_item(&ompi_mpi_communicators, i); @@ -435,7 +435,7 @@ static void ompi_comm_destruct(ompi_communicator_t* comm)
          NULL != opal_pointer_array_get_item(&ompi_mpi_communicators,
                                              comm->c_contextid)) {
         opal_pointer_array_set_item ( &ompi_mpi_communicators,
-                                      comm->c_contextid, NULL);
+                                      comm->c_contextid, (void *)0x1);
     }

     /* reset the ompi_comm_f_to_c_table entry */
diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index 5f3f8fd..1d0f881 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -128,7 +128,7 @@ void mca_pml_ob1_recv_frag_callback_match(mca_btl_base_module_t* btl,

     /* communicator pointer */
     comm_ptr = ompi_comm_lookup(hdr->hdr_ctx);
-    if(OPAL_UNLIKELY(NULL == comm_ptr)) {
+ if(OPAL_UNLIKELY(NULL == comm_ptr || (ompi_communicator_t *)0x1 == comm_ptr)) {
         /* This is a special case. A message for a not yet existing
          * communicator can happens. Instead of doing a matching we
          * will temporarily add it the a pending queue in the PML.

On 9/7/2016 2:28 AM, George Bosilca wrote:
I can make MPI_Issend_rtoa deadlock with vader and sm.

  George.


On Tue, Sep 6, 2016 at 12:06 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

    FWIW: those tests hang for me with TCP (I don’t have openib on my
    cluster). I’ll check it with your change as well


    On Sep 6, 2016, at 1:29 AM, Gilles Gouaillardet
    <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:

    Ralph,


    this looks like an other hang :-(


    i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node,
    8 cores per socket) with infiniband,

    and i always observe the same hang at the same place.


    surprisingly, i do not get any hang if i blacklist the openib btl


    the patch below can be used to avoid the hang with infiniband or
    for debugging purpose

    the hang occurs in communicator 6, and if i skip tests on
    communicator 2, no hang happens.

    the hang occur on an intercomm :

    task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm

    task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm

    task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0,
    and then both hang in MPI_Wait()

    surprisingly, tasks 0 and 1 run on the same node, so it is very
    puzzling the hang only occurs with the openib btl,

    since vader should be used here.


    diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c
    b/intel_tests/src/MPI_Issend_rtoa_c.c
    index 8b26f84..b9a704b 100644
    --- a/intel_tests/src/MPI_Issend_rtoa_c.c
    +++ b/intel_tests/src/MPI_Issend_rtoa_c.c
    @@ -173,8 +177,9 @@ int main(int argc, char *argv[])

         for (comm_count = 0; comm_count < MPITEST_num_comm_sizes();
              comm_count++) {
             comm_index = MPITEST_get_comm_index(comm_count);
             comm_type = MPITEST_get_comm_type(comm_count);
    +        if (2 == comm_count) continue;

             /*
    @@ -312,6 +330,9 @@ int main(int argc, char *argv[])
                          * left sub-communicator
                          */

    +                    if (6 == comm_count && 12 == length_count &&
    MPITEST_current_rank < 2) {
    +                        /* insert a breakpoint here */
    +                    }
              * Reset a bunch of variables that will be set when we
    get our



    as a side note, which is very unlikely related to this issue, i
    noticed the following programs works fine,

    though it is reasonnable to expect a hang.

    the root cause is MPI_Send uses the eager protocol, and though
    communicators used by MPI_Send and MPI_Recv

    are different, they have the same (recycled) CID.

    fwiw, the tests also completes with mpich.


    if not already done, should we provide an option not to recycle
    CIDs ?

    or flush unexpected/unmatched messages when a communicator is
    free'd ?


    Cheers,


    Gilles


    #include <stdio.h>
    #include <mpi.h>

    /* send a message (eager mode) in a communicator, and then
     * receive it in an other communicator, but with the same CID
     */
    int main(int argc, char *argv[]) {
        int rank, size;
        int b;
        MPI_Comm comm;

        MPI_Init(&argc, &argv);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Comm_size(MPI_COMM_WORLD, &size);
        if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1);

        MPI_Comm_dup(MPI_COMM_WORLD, &comm);
        if (0 == rank) {
            b = 0x55555555;
            MPI_Send(&b, 1, MPI_INT, 1, 0, comm);
        }
        MPI_Comm_free(&comm);

        MPI_Comm_dup(MPI_COMM_WORLD, &comm);
        if (1 == rank) {
            b = 0xAAAAAAAA;
            MPI_Recv(&b, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE);
            if (0x55555555 != b) MPI_Abort(MPI_COMM_WORLD, 2);
        }
        MPI_Comm_free(&comm);

        MPI_Finalize();

        return 0;
    }


    On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote:
    ok,  will double check tomorrow this was the very same hang i
    fixed earlier

    Cheers,

    Gilles

    On Monday, September 5, 2016, r...@open-mpi.org
    <mailto:r...@open-mpi.org> <r...@open-mpi.org
    <mailto:r...@open-mpi.org>> wrote:

        I was just looking at the overnight MTT report, and these
        were present going back a long ways in both branches. They
        are in the Intel test suite.

        If you have already addressed them, then thanks!

        > On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet
        <gilles.gouaillar...@gmail.com> wrote:
        >
        > Ralph,
        >
        > I fixed a hang earlier today in master, and the PR for
        v2.x is at
        https://github.com/open-mpi/ompi-release/pull/1368
        <https://github.com/open-mpi/ompi-release/pull/1368>
        >
        > Can you please make sure you are running the latest master ?
        >
        > Which testsuite do these tests come from ?
        > I will have a look tomorrow if the hang is still there
        >
        > Cheers,
        >
        > Gilles
        >
        > r...@open-mpi.org wrote:
        >> Hey folks
        >>
        >> All of the tests that involve either ISsend_ator,
        SSend_ator, ISsend_rtoa, or SSend_rtoa are hanging on master
        and v2.x. Does anyone know what these tests do, and why we
        never seem to pass them?
        >>
        >> Do we care?
        >> Ralph
        >>
        >> _______________________________________________
        >> devel mailing list
        >> devel@lists.open-mpi.org
        >>
        https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
        > _______________________________________________
        > devel mailing list
        > devel@lists.open-mpi.org
        > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

        _______________________________________________
        devel mailing list
        devel@lists.open-mpi.org
        https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>



    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
    _______________________________________________ devel mailing
    list devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
    _______________________________________________ devel mailing list
    devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to