Folks,

while debugging the dynamic/intercomm_create from the ibm test suite, i
found something odd.

i ran *without* any batch manager on a VM (one socket and four cpus)
mpirun -np 1 ./dynamic/intercomm_create

it hangs by default
it works with --mca coll ^ml

basically :
- task 0 spawns task 1
- task 0 spawns task 2
- a communicator is created for the 3 tasks via MPI_Intercomm_create()

MPI_Intercomm_create() calls ompi_comm_get_rprocs() which calls
ompi_proc_set_locality()

then, on task 1, ompi_proc_set_locality() calls
opal_dstore.fetch(opal_dstore_internal, "task 2"->proc_name, ...) which
fails and this is OK
then
opal_dstore_fetch(opal_dstore_nonpeer, "task 2"->proc_name, ...) which
fails and this is *not* OK

/* on task 2, the first fetch for "task 1" fails but the second success */

my analysis is that when task 2 was created, it updated its
opal_dstore_nonpeer with info from "task 1" which was previously spawned by
task 0.
when task 1 was spawned, task 2 did not exist yet and hence
opal_dstore_nonpeer contains no reference to task 2.
but when task 2 was spawned, opal_dstore_nonpeer of task 1 has not been
updated, hence the failure

(on task 1, proc_flags of task 2 has incorrect locality, this likely
confuses coll ml and hang the test)

should task1 have received new information when task 2 was spawned ?
shoud task2 have sent information to task1 when it was spawned ?
should task1 have (tried to) get fresh information before invoking
MPI_Intercomm_create() ?

incidentally, i found ompi_proc_set_locality calls opal_dstore.store with
identifier &proc (the argument is &proc->proc_name everywhere else, so this
is likely a bug/typo. the attached patch fixes this.

Thanks in advance for your feedback,

Gilles
Index: ompi/proc/proc.c
===================================================================
--- ompi/proc/proc.c	(revision 31891)
+++ ompi/proc/proc.c	(working copy)
@@ -231,7 +231,7 @@
     kvn.key = strdup(OPAL_DSTORE_LOCALITY);
     kvn.type = OPAL_HWLOC_LOCALITY_T;
     kvn.data.uint16 = locality;
-    ret = opal_dstore.store(opal_dstore_internal, (opal_identifier_t*)&proc, &kvn);
+    ret = opal_dstore.store(opal_dstore_internal, (opal_identifier_t*)&proc->proc_name, &kvn);
     OBJ_DESTRUCT(&kvn);
     /* set the proc's local value as well */
     proc->proc_flags = locality;

Reply via email to