i finally got it :-)

/* i previously got it "almost" right ... */

here is what happens on job 2 (with trunk) :
MPI_Intercomm_create calls ompi_comm_get_rprocs that calls ompi_proc_unpack
=> ompi_proc_unpack store job 3 info into opal_dstore_peer


then ompi_comm_get_rprocs calls ompi_proc_set_locality(job 3)
=> ompi_proc_set_locality fetch information job 3 info from
opal_dstore_internal (not found) and then opal_dstore_nonpeer (not found
again) and then fails.
this is just the consequence of ompi_proc_unpack stored job 3 info in
opal_dstore_peer and not in opal_dstore_nonpeer

i do not understand which of opal_dstore_peer and opal_dstore_nonpeer
should be used and when, so i wrote a defensive patch (fetch both
opal_dstore_nonpeer and then opal_dstore_peer if not previously found).

please someone review this and comment/fix it if needed
(for example, store in opal_dstore_nonpeer instead of opal_dstore_peer
*or*
fetch in opal_dstore_peer instead of opal_dstore_nonpeer
and/or something else )

and then, locality is correctly set, coll ml receives correct information
and this does not
hang any more if mpirun is invoked without --mca coll ^ml and on a single
node single socket VM)

bottom line, job 2 *did* receive information of job 3 but failed to
store/fetch it in the right opal_store !

v1.8 is unaffected since there is only one dstore

Cheers,

Gilles


On Wed, May 28, 2014 at 4:51 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Hmmm...I did some digging, and the best I can tell is that root cause is
> that the second job ("b" in the test program) is never actually calling
> connect_accept!  This looks like a change may have occurred in
> Intercomm_create that is causing it to not recognize the need to do so.
>
> Anyone confirm that diagnosis?
>
> FWIW: job 1 clearly receives and has all the required info in the correct
> places - it is ready to provide it to job 2, if/when job 2 actually calls
> connect_accept.
>
> On May 27, 2014, at 10:13 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> > Hi Gilles
> >
> > I concur on the typo and fixed it - thanks for catching it. I'll have to
> look into the problem you reported as it has been fixed in the past, and
> was working last I checked it. The info required for this 3-way
> connect/accept is supposed to be in the modex provided by the common
> communicator.
> >
> > On May 27, 2014, at 3:51 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
> >
> >> Folks,
> >>
> >> while debugging the dynamic/intercomm_create from the ibm test suite, i
> found something odd.
> >>
> >> i ran *without* any batch manager on a VM (one socket and four cpus)
> >> mpirun -np 1 ./dynamic/intercomm_create
> >>
> >> it hangs by default
> >> it works with --mca coll ^ml
> >>
> >> basically :
> >> - task 0 spawns task 1
> >> - task 0 spawns task 2
> >> - a communicator is created for the 3 tasks via MPI_Intercomm_create()
> >>
> >> MPI_Intercomm_create() calls ompi_comm_get_rprocs() which calls
> ompi_proc_set_locality()
> >>
> >> then, on task 1, ompi_proc_set_locality() calls
> >> opal_dstore.fetch(opal_dstore_internal, "task 2"->proc_name, ...) which
> fails and this is OK
> >> then
> >> opal_dstore_fetch(opal_dstore_nonpeer, "task 2"->proc_name, ...) which
> fails and this is *not* OK
> >>
> >> /* on task 2, the first fetch for "task 1" fails but the second success
> */
> >>
> >> my analysis is that when task 2 was created, it updated its
> opal_dstore_nonpeer with info from "task 1" which was previously spawned by
> task 0.
> >> when task 1 was spawned, task 2 did not exist yet and hence
> opal_dstore_nonpeer contains no reference to task 2.
> >> but when task 2 was spawned, opal_dstore_nonpeer of task 1 has not been
> updated, hence the failure
> >>
> >> (on task 1, proc_flags of task 2 has incorrect locality, this likely
> confuses coll ml and hang the test)
> >>
> >> should task1 have received new information when task 2 was spawned ?
> >> shoud task2 have sent information to task1 when it was spawned ?
> >> should task1 have (tried to) get fresh information before invoking
> MPI_Intercomm_create() ?
> >>
> >> incidentally, i found ompi_proc_set_locality calls opal_dstore.store
> with
> >> identifier &proc (the argument is &proc->proc_name everywhere else, so
> this
> >> is likely a bug/typo. the attached patch fixes this.
> >>
> >> Thanks in advance for your feedback,
> >>
> >> Gilles
> >> <proc.patch>_______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14848.php
> >
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14861.php
>
Index: ompi/proc/proc.c
===================================================================
--- ompi/proc/proc.c	(revision 31899)
+++ ompi/proc/proc.c	(working copy)
@@ -13,6 +13,8 @@
  * Copyright (c) 2012      Los Alamos National Security, LLC.  All rights
  *                         reserved. 
  * Copyright (c) 2013-2014 Intel, Inc. All rights reserved
+ * Copyright (c) 2014      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -155,8 +157,12 @@
     if (OMPI_SUCCESS != (ret = opal_dstore.fetch(opal_dstore_nonpeer,
                                                  (opal_identifier_t*)&proc->proc_name,
                                                  OMPI_RTE_NODE_ID, &myvals))) {
-        OPAL_LIST_DESTRUCT(&myvals);
-        return ret;
+        if (OMPI_SUCCESS != (ret = opal_dstore.fetch(opal_dstore_peer,
+                                                     (opal_identifier_t*)&proc->proc_name,
+                                                     OMPI_RTE_NODE_ID, &myvals))) {
+            OPAL_LIST_DESTRUCT(&myvals);
+            return ret;
+        }
     }
     kv = (opal_value_t*)opal_list_get_first(&myvals);
     vpid = kv->data.uint32;
@@ -198,9 +204,13 @@
                                                          (opal_identifier_t*)&proc->proc_name,
                                                          OPAL_DSTORE_CPUSET, &myvals))) {
                 /* check the nonpeer data in case of comm_spawn */
-                ret = opal_dstore.fetch(opal_dstore_nonpeer,
-                                        (opal_identifier_t*)&proc->proc_name,
-                                        OPAL_DSTORE_CPUSET, &myvals);
+                if (OMPI_SUCCESS != ( ret = opal_dstore.fetch(opal_dstore_nonpeer,
+                                                              (opal_identifier_t*)&proc->proc_name,
+                                                              OPAL_DSTORE_CPUSET, &myvals))) {
+                    ret = opal_dstore.fetch(opal_dstore_peer,
+                                            (opal_identifier_t*)&proc->proc_name,
+                                            OPAL_DSTORE_CPUSET, &myvals);
+                }
             }
             if (OMPI_SUCCESS != ret) {
                 /* we don't know their cpuset, so nothing more we can say */

Reply via email to