Ralph,

this looks like an other hang :-(


i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node, 8 cores per socket) with infiniband,

and i always observe the same hang at the same place.


surprisingly, i do not get any hang if i blacklist the openib btl


the patch below can be used to avoid the hang with infiniband or for debugging purpose

the hang occurs in communicator 6, and if i skip tests on communicator 2, no hang happens.

the hang occur on an intercomm :

task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm

task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm

task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0, and then both hang in MPI_Wait()

surprisingly, tasks 0 and 1 run on the same node, so it is very puzzling the hang only occurs with the openib btl,

since vader should be used here.


diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c b/intel_tests/src/MPI_Issend_rtoa_c.c
index 8b26f84..b9a704b 100644
--- a/intel_tests/src/MPI_Issend_rtoa_c.c
+++ b/intel_tests/src/MPI_Issend_rtoa_c.c
@@ -173,8 +177,9 @@ int main(int argc, char *argv[])

     for (comm_count = 0; comm_count < MPITEST_num_comm_sizes();
          comm_count++) {
         comm_index = MPITEST_get_comm_index(comm_count);
         comm_type = MPITEST_get_comm_type(comm_count);
+        if (2 == comm_count) continue;

         /*
@@ -312,6 +330,9 @@ int main(int argc, char *argv[])
                      * left sub-communicator
                      */

+ if (6 == comm_count && 12 == length_count && MPITEST_current_rank < 2) {
+                        /* insert a breakpoint here */
+                    }
          * Reset a bunch of variables that will be set when we get our



as a side note, which is very unlikely related to this issue, i noticed the following programs works fine,

though it is reasonnable to expect a hang.

the root cause is MPI_Send uses the eager protocol, and though communicators used by MPI_Send and MPI_Recv

are different, they have the same (recycled) CID.

fwiw, the tests also completes with mpich.


if not already done, should we provide an option not to recycle CIDs ?

or flush unexpected/unmatched messages when a communicator is free'd ?


Cheers,


Gilles


#include <stdio.h>
#include <mpi.h>

/* send a message (eager mode) in a communicator, and then
 * receive it in an other communicator, but with the same CID
 */
int main(int argc, char *argv[]) {
    int rank, size;
    int b;
    MPI_Comm comm;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1);

    MPI_Comm_dup(MPI_COMM_WORLD, &comm);
    if (0 == rank) {
        b = 0x55555555;
        MPI_Send(&b, 1, MPI_INT, 1, 0, comm);
    }
    MPI_Comm_free(&comm);

    MPI_Comm_dup(MPI_COMM_WORLD, &comm);
    if (1 == rank) {
        b = 0xAAAAAAAA;
        MPI_Recv(&b, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE);
        if (0x55555555 != b) MPI_Abort(MPI_COMM_WORLD, 2);
    }
    MPI_Comm_free(&comm);

    MPI_Finalize();

    return 0;
}


On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote:
ok, will double check tomorrow this was the very same hang i fixed earlier

Cheers,

Gilles

On Monday, September 5, 2016, r...@open-mpi.org <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

    I was just looking at the overnight MTT report, and these were
    present going back a long ways in both branches. They are in the
    Intel test suite.

    If you have already addressed them, then thanks!

    > On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet
    <gilles.gouaillar...@gmail.com <javascript:;>> wrote:
    >
    > Ralph,
    >
    > I fixed a hang earlier today in master, and the PR for v2.x is
    at https://github.com/open-mpi/ompi-release/pull/1368
    <https://github.com/open-mpi/ompi-release/pull/1368>
    >
    > Can you please make sure you are running the latest master ?
    >
    > Which testsuite do these tests come from ?
    > I will have a look tomorrow if the hang is still there
    >
    > Cheers,
    >
    > Gilles
    >
    > r...@open-mpi.org <javascript:;> wrote:
    >> Hey folks
    >>
    >> All of the tests that involve either ISsend_ator, SSend_ator,
    ISsend_rtoa, or SSend_rtoa are hanging on master and v2.x. Does
    anyone know what these tests do, and why we never seem to pass them?
    >>
    >> Do we care?
    >> Ralph
    >>
    >> _______________________________________________
    >> devel mailing list
    >> devel@lists.open-mpi.org <javascript:;>
    >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
    > _______________________________________________
    > devel mailing list
    > devel@lists.open-mpi.org <javascript:;>
    > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>

    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org <javascript:;>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>



_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to