Ralph,
this looks like an other hang :-(
i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node, 8
cores per socket) with infiniband,
and i always observe the same hang at the same place.
surprisingly, i do not get any hang if i blacklist the openib btl
the patch below can be used to avoid the hang with infiniband or for
debugging purpose
the hang occurs in communicator 6, and if i skip tests on communicator
2, no hang happens.
the hang occur on an intercomm :
task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm
task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm
task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0, and then
both hang in MPI_Wait()
surprisingly, tasks 0 and 1 run on the same node, so it is very puzzling
the hang only occurs with the openib btl,
since vader should be used here.
diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c
b/intel_tests/src/MPI_Issend_rtoa_c.c
index 8b26f84..b9a704b 100644
--- a/intel_tests/src/MPI_Issend_rtoa_c.c
+++ b/intel_tests/src/MPI_Issend_rtoa_c.c
@@ -173,8 +177,9 @@ int main(int argc, char *argv[])
for (comm_count = 0; comm_count < MPITEST_num_comm_sizes();
comm_count++) {
comm_index = MPITEST_get_comm_index(comm_count);
comm_type = MPITEST_get_comm_type(comm_count);
+ if (2 == comm_count) continue;
/*
@@ -312,6 +330,9 @@ int main(int argc, char *argv[])
* left sub-communicator
*/
+ if (6 == comm_count && 12 == length_count &&
MPITEST_current_rank < 2) {
+ /* insert a breakpoint here */
+ }
* Reset a bunch of variables that will be set when we get our
as a side note, which is very unlikely related to this issue, i noticed
the following programs works fine,
though it is reasonnable to expect a hang.
the root cause is MPI_Send uses the eager protocol, and though
communicators used by MPI_Send and MPI_Recv
are different, they have the same (recycled) CID.
fwiw, the tests also completes with mpich.
if not already done, should we provide an option not to recycle CIDs ?
or flush unexpected/unmatched messages when a communicator is free'd ?
Cheers,
Gilles
#include <stdio.h>
#include <mpi.h>
/* send a message (eager mode) in a communicator, and then
* receive it in an other communicator, but with the same CID
*/
int main(int argc, char *argv[]) {
int rank, size;
int b;
MPI_Comm comm;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1);
MPI_Comm_dup(MPI_COMM_WORLD, &comm);
if (0 == rank) {
b = 0x55555555;
MPI_Send(&b, 1, MPI_INT, 1, 0, comm);
}
MPI_Comm_free(&comm);
MPI_Comm_dup(MPI_COMM_WORLD, &comm);
if (1 == rank) {
b = 0xAAAAAAAA;
MPI_Recv(&b, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE);
if (0x55555555 != b) MPI_Abort(MPI_COMM_WORLD, 2);
}
MPI_Comm_free(&comm);
MPI_Finalize();
return 0;
}
On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote:
ok, will double check tomorrow this was the very same hang i fixed
earlier
Cheers,
Gilles
On Monday, September 5, 2016, r...@open-mpi.org
<mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
wrote:
I was just looking at the overnight MTT report, and these were
present going back a long ways in both branches. They are in the
Intel test suite.
If you have already addressed them, then thanks!
> On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com <javascript:;>> wrote:
>
> Ralph,
>
> I fixed a hang earlier today in master, and the PR for v2.x is
at https://github.com/open-mpi/ompi-release/pull/1368
<https://github.com/open-mpi/ompi-release/pull/1368>
>
> Can you please make sure you are running the latest master ?
>
> Which testsuite do these tests come from ?
> I will have a look tomorrow if the hang is still there
>
> Cheers,
>
> Gilles
>
> r...@open-mpi.org <javascript:;> wrote:
>> Hey folks
>>
>> All of the tests that involve either ISsend_ator, SSend_ator,
ISsend_rtoa, or SSend_rtoa are hanging on master and v2.x. Does
anyone know what these tests do, and why we never seem to pass them?
>>
>> Do we care?
>> Ralph
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <javascript:;>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org <javascript:;>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <javascript:;>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel