Folks, while debugging the dynamic/intercomm_create from the ibm test suite, i found something odd.
i ran *without* any batch manager on a VM (one socket and four cpus) mpirun -np 1 ./dynamic/intercomm_create it hangs by default it works with --mca coll ^ml basically : - task 0 spawns task 1 - task 0 spawns task 2 - a communicator is created for the 3 tasks via MPI_Intercomm_create() MPI_Intercomm_create() calls ompi_comm_get_rprocs() which calls ompi_proc_set_locality() then, on task 1, ompi_proc_set_locality() calls opal_dstore.fetch(opal_dstore_internal, "task 2"->proc_name, ...) which fails and this is OK then opal_dstore_fetch(opal_dstore_nonpeer, "task 2"->proc_name, ...) which fails and this is *not* OK /* on task 2, the first fetch for "task 1" fails but the second success */ my analysis is that when task 2 was created, it updated its opal_dstore_nonpeer with info from "task 1" which was previously spawned by task 0. when task 1 was spawned, task 2 did not exist yet and hence opal_dstore_nonpeer contains no reference to task 2. but when task 2 was spawned, opal_dstore_nonpeer of task 1 has not been updated, hence the failure (on task 1, proc_flags of task 2 has incorrect locality, this likely confuses coll ml and hang the test) should task1 have received new information when task 2 was spawned ? shoud task2 have sent information to task1 when it was spawned ? should task1 have (tried to) get fresh information before invoking MPI_Intercomm_create() ? incidentally, i found ompi_proc_set_locality calls opal_dstore.store with identifier &proc (the argument is &proc->proc_name everywhere else, so this is likely a bug/typo. the attached patch fixes this. Thanks in advance for your feedback, Gilles
Index: ompi/proc/proc.c =================================================================== --- ompi/proc/proc.c (revision 31891) +++ ompi/proc/proc.c (working copy) @@ -231,7 +231,7 @@ kvn.key = strdup(OPAL_DSTORE_LOCALITY); kvn.type = OPAL_HWLOC_LOCALITY_T; kvn.data.uint16 = locality; - ret = opal_dstore.store(opal_dstore_internal, (opal_identifier_t*)&proc, &kvn); + ret = opal_dstore.store(opal_dstore_internal, (opal_identifier_t*)&proc->proc_name, &kvn); OBJ_DESTRUCT(&kvn); /* set the proc's local value as well */ proc->proc_flags = locality;