Thanks George,

fwiw, note the current behavior is a bit more "twisted" than that.

OPAL_MODEX_RECV_VALUE() returns successfully (e.g. err == OPAL_SUCCESS) but the OPAL_PMIX_NODEID (e.g. val) value is -1.

that means orted did "push" OPAL_PMIX_NODEID, but with an unitialized value of -1 (this is set in the constructor).

fortunatly, you used the same -1 special value if OPAL_MODEX_RECV_VALUE() had failed (e.g. OPAL_ERR_NOT_FOUND),

so bottom line, your commit does fix the crash.


Cheers,

Gilles

On 8/12/2016 2:09 AM, George Bosilca wrote:
I just pushed a solution to this problem in 8d0baf140f. If we are unable to extract the expected information from the RTE, we simply build a non-reordered communicator and gracefully return.

That being said, not being able to correctly retrieve OPAL_PMIX_NODEID has the potential to drastically decrease the performance as no specialized hierarchies can be built without the RTE information.

  George.


On Wed, Aug 10, 2016 at 3:57 AM, Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:

    Ralph,


    i noticed dist-graph/distgraph_test_4 from the ibm test suite
    fails when using a hostfile and running no task on the host
    running mpirun.

    n0$ mpirun --host n1:1,n2:1 -np 2 ./dist-graph/distgraph_test_4


    the root cause is OPAL_PMIX_NODEID is correctly set ( 0, 1, 2) by
    mpirun, but for some reasons, orted sets it to -1 everywhere.

    an indirect consequence is a crash of the test (it believes tasks
    run on zero distinct nodes instead of 2)


    this occurs only master, and v2.x is fine.


    Could you please have a look ?


    Cheers,


    Gilles

    _______________________________________________
    devel mailing list
    devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
    https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
    <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel>




_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to