Thanks guys,
so i was finally able to reproduce the bug on my (oversubscribed) VM
with tcp.
MPI_Intercomm_merge (indirectly) incorrectly invokes iallgatherv.
1,main (MPI_Issend_rtoa_c.c:196)
1, MPITEST_get_communicator (libmpitest.c:3544)
1, PMPI_Intercomm_merge (pintercomm_merge.c:131)
1, ompi_comm_activate (comm_cid.c:514)
1, ompi_request_wait_completion (request.h:397)
1, opal_progress (opal_progress.c:221)
1, ompi_comm_request_progress (comm_request.c:132)
1, ompi_comm_allreduce_inter_leader_reduce (comm_cid.c:699)
1, ompi_comm_allreduce_inter_allgather (comm_cid.c:723)
1, ompi_coll_libnbc_iallgatherv_inter
(nbc_iallgatherv.c:173)
global tasks 0 and 1 are both root task 0 of an intercomm on groups A and B
they both invoke iallgatherv with scount=1, but context->rcounts[0]=0
(it should be 1)
per the man page
"The type signature associated with sendcount, sendtype, at process j
must be equal to the type signature associated with recvcounts[j],
recvtype at any other process."
so if the initial intention was not to gather only on roots, then this
is not possible with iallgatherv
what happens then is that iallgatherv isend data (scount>0), but no
matching irecv is posted (rcounts[0]==0)
then the intercomm is destroyed.
and then the message is received later by opal_progress on a
communicator that do not exist (any more)
this message is hence stored by pml/ob1 in the
non_existing_communicator_pending list
/* btw, can someone kindly explain me the rationale for this ?
is there any valid case in which a message can be received on a
communicator that does not exist yet ?
if the only valid case is the communicator does not exist any more,
should the message be simply discarded ? */
much later in the test, a new communicator is created with the same cid
than the intercomm, and a hang can occur
i can only suspect the message in the non_existing_communicator_pending
list causes that.
bottom line, i think the root cause is a bad invocation of iallgatherv.
Nathan, could you please have a look ?
fwiw, during my investigations, i was able to get rid of the hang by
*not* recycling CIDs
with the patch below.
Cheers,
Gilles
diff --git a/ompi/communicator/comm_init.c b/ompi/communicator/comm_init.c
index f453ca1..7195aa2 100644
--- a/ompi/communicator/comm_init.c
+++ b/ompi/communicator/comm_init.c
@@ -297,7 +297,7 @@ int ompi_comm_finalize(void)
max = opal_pointer_array_get_size(&ompi_mpi_communicators);
for ( i=3; i<max; i++ ) {
comm = (ompi_communicator_t
*)opal_pointer_array_get_item(&ompi_mpi_communicators, i);
- if ( NULL != comm ) {
+ if ( NULL != comm && (ompi_communicator_t *)0x1 != comm) {
/* Communicator has not been freed before finalize */
OBJ_RELEASE(comm);
comm=(ompi_communicator_t
*)opal_pointer_array_get_item(&ompi_mpi_communicators, i);
@@ -435,7 +435,7 @@ static void ompi_comm_destruct(ompi_communicator_t*
comm)
NULL != opal_pointer_array_get_item(&ompi_mpi_communicators,
comm->c_contextid)) {
opal_pointer_array_set_item ( &ompi_mpi_communicators,
- comm->c_contextid, NULL);
+ comm->c_contextid, (void *)0x1);
}
/* reset the ompi_comm_f_to_c_table entry */
diff --git a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
index 5f3f8fd..1d0f881 100644
--- a/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
+++ b/ompi/mca/pml/ob1/pml_ob1_recvfrag.c
@@ -128,7 +128,7 @@ void
mca_pml_ob1_recv_frag_callback_match(mca_btl_base_module_t* btl,
/* communicator pointer */
comm_ptr = ompi_comm_lookup(hdr->hdr_ctx);
- if(OPAL_UNLIKELY(NULL == comm_ptr)) {
+ if(OPAL_UNLIKELY(NULL == comm_ptr || (ompi_communicator_t *)0x1 ==
comm_ptr)) {
/* This is a special case. A message for a not yet existing
* communicator can happens. Instead of doing a matching we
* will temporarily add it the a pending queue in the PML.
On 9/7/2016 2:28 AM, George Bosilca wrote:
I can make MPI_Issend_rtoa deadlock with vader and sm.
George.
On Tue, Sep 6, 2016 at 12:06 PM, r...@open-mpi.org
<mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
wrote:
FWIW: those tests hang for me with TCP (I don’t have openib on my
cluster). I’ll check it with your change as well
On Sep 6, 2016, at 1:29 AM, Gilles Gouaillardet
<gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:
Ralph,
this looks like an other hang :-(
i ran MPI_Issend_rtoa_c on 32 tasks (2 nodes, 2 sockets per node,
8 cores per socket) with infiniband,
and i always observe the same hang at the same place.
surprisingly, i do not get any hang if i blacklist the openib btl
the patch below can be used to avoid the hang with infiniband or
for debugging purpose
the hang occurs in communicator 6, and if i skip tests on
communicator 2, no hang happens.
the hang occur on an intercomm :
task 0 (from MPI_COMM_WORLD) has rank 0 in group A of the intercomm
task 1 (from MPI_COMM_WORLD) has rank 0 in group B of the intercomm
task 0 MPI_Issend to task 1, and task 1 MPI_Irecv from task 0,
and then both hang in MPI_Wait()
surprisingly, tasks 0 and 1 run on the same node, so it is very
puzzling the hang only occurs with the openib btl,
since vader should be used here.
diff --git a/intel_tests/src/MPI_Issend_rtoa_c.c
b/intel_tests/src/MPI_Issend_rtoa_c.c
index 8b26f84..b9a704b 100644
--- a/intel_tests/src/MPI_Issend_rtoa_c.c
+++ b/intel_tests/src/MPI_Issend_rtoa_c.c
@@ -173,8 +177,9 @@ int main(int argc, char *argv[])
for (comm_count = 0; comm_count < MPITEST_num_comm_sizes();
comm_count++) {
comm_index = MPITEST_get_comm_index(comm_count);
comm_type = MPITEST_get_comm_type(comm_count);
+ if (2 == comm_count) continue;
/*
@@ -312,6 +330,9 @@ int main(int argc, char *argv[])
* left sub-communicator
*/
+ if (6 == comm_count && 12 == length_count &&
MPITEST_current_rank < 2) {
+ /* insert a breakpoint here */
+ }
* Reset a bunch of variables that will be set when we
get our
as a side note, which is very unlikely related to this issue, i
noticed the following programs works fine,
though it is reasonnable to expect a hang.
the root cause is MPI_Send uses the eager protocol, and though
communicators used by MPI_Send and MPI_Recv
are different, they have the same (recycled) CID.
fwiw, the tests also completes with mpich.
if not already done, should we provide an option not to recycle
CIDs ?
or flush unexpected/unmatched messages when a communicator is
free'd ?
Cheers,
Gilles
#include <stdio.h>
#include <mpi.h>
/* send a message (eager mode) in a communicator, and then
* receive it in an other communicator, but with the same CID
*/
int main(int argc, char *argv[]) {
int rank, size;
int b;
MPI_Comm comm;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (2 > size) MPI_Abort(MPI_COMM_WORLD, 1);
MPI_Comm_dup(MPI_COMM_WORLD, &comm);
if (0 == rank) {
b = 0x55555555;
MPI_Send(&b, 1, MPI_INT, 1, 0, comm);
}
MPI_Comm_free(&comm);
MPI_Comm_dup(MPI_COMM_WORLD, &comm);
if (1 == rank) {
b = 0xAAAAAAAA;
MPI_Recv(&b, 1, MPI_INT, 0, 0, comm, MPI_STATUS_IGNORE);
if (0x55555555 != b) MPI_Abort(MPI_COMM_WORLD, 2);
}
MPI_Comm_free(&comm);
MPI_Finalize();
return 0;
}
On 9/6/2016 12:03 AM, Gilles Gouaillardet wrote:
ok, will double check tomorrow this was the very same hang i
fixed earlier
Cheers,
Gilles
On Monday, September 5, 2016, r...@open-mpi.org
<mailto:r...@open-mpi.org> <r...@open-mpi.org
<mailto:r...@open-mpi.org>> wrote:
I was just looking at the overnight MTT report, and these
were present going back a long ways in both branches. They
are in the Intel test suite.
If you have already addressed them, then thanks!
> On Sep 5, 2016, at 7:48 AM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com> wrote:
>
> Ralph,
>
> I fixed a hang earlier today in master, and the PR for
v2.x is at
https://github.com/open-mpi/ompi-release/pull/1368
<https://github.com/open-mpi/ompi-release/pull/1368>
>
> Can you please make sure you are running the latest master ?
>
> Which testsuite do these tests come from ?
> I will have a look tomorrow if the hang is still there
>
> Cheers,
>
> Gilles
>
> r...@open-mpi.org wrote:
>> Hey folks
>>
>> All of the tests that involve either ISsend_ator,
SSend_ator, ISsend_rtoa, or SSend_rtoa are hanging on master
and v2.x. Does anyone know what these tests do, and why we
never seem to pass them?
>>
>> Do we care?
>> Ralph
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________ devel mailing
list devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________ devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
<https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel