Paul,

generally speaking, that is a good point.
an other option could be to write a script that detects symbols defined more than once.

In this case, mca_coll_hcoll module is linked with the proprietary libhcoll.so.
the ml symbols are defined in both mca_coll_ml.so and libhcoll.so
i am not sure (i blame my poor understanding of linkers) this is an error if
Open MPI is configure'd with --disable-dlopen

Cheers,

Gilles

On 6/26/2015 8:12 AM, Paul Hargrove wrote:
I can see cloning of existing component's source as a starting point for a new one as a common occurrence (at least relative to creating new components from zero).
So, this is probably not the last time this will ever occur.

Would a build with --disable-dlopen have detected this problem (by failing to build libmpi due to multiply defined symbols)? If so, then maybe Jenkins should apply this test (which would NOT depend on the dlopen load order).

-Paul


On Thu, Jun 25, 2015 at 3:03 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote:

    Devendar literally just reproduced here at the developer meeting, too.

    Sweet -- ok, so we understand what is going on.

    Devendar/Mellanox is going to talk about this internally and get
    back to us.


    > On Jun 25, 2015, at 2:59 PM, Gilles Gouaillardet
    <gilles.gouaillar...@gmail.com
    <mailto:gilles.gouaillar...@gmail.com>> wrote:
    >
    > Jeff,
    >
    > this is exactly what happens.
    >
    > I will send a stack trace later
    >
    > Cheers,
    >
    > Gilles
    >
    > On Thursday, June 25, 2015, Jeff Squyres (jsquyres)
    <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote:
    > Gilles --
    >
    > Can you send a stack trace from one of these crashes?
    >
    > I am *guessing* that the following is happening:
    >
    > 1. coll selection begins
    > 2. coll ml is queried, and disqualifies itself (but is not
    dlclosed yet)
    > 3. coll hcol is queried, which ends up calling down into
    libhcol.  libhcol calls a coll_ml_* symbol (which is apparently in
    a different .o file in the library), but the linker has already
    resolved that coll_ml_* symbol in the coll ml DSO.  So the
    execution transfers back up into the coll ml DSO, and ... kaboom.
    >
    > A simple stack trace will confirm this -- it should show
    execution going down into libhcol and then back up into coll ml.
    >
    >
    >
    >
    > > On Jun 25, 2015, at 1:03 AM, Gilles Gouaillardet
    <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:
    > >
    > > Folks,
    > >
    > > this is a followup on an issue reported by Daniel on the users
    mailing list :
    > > OpenMPI is built with hcoll from Mellanox.
    > > the coll ml module has default priority zero.
    > >
    > > on my cluster, it works just fine
    > > on Daniel's cluster, it crashes.
    > >
    > > i was able to reproduce the crash by tweaking
    mca_base_component_path and ensure
    > > the coll ml module is loaded first.
    > >
    > > basically, i found two issues :
    > > 1) libhcoll.so (vendor lib provided by Mellanox, i tested
    hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.2-x86_64) seems to include
    its own coll ml, since there are some *public* symbols that are
    common to this module (ml_open, ml_coll_hier_barrier_setup, ...)
    > > 2) coll ml priority is zero, and even if the library is
    dlclose'd, it seems this is uneffective
    > > (nothing changed in /proc/xxx/maps before and after dlclose)
    > >
    > >
    > > there are two workarounds :
    > > mpirun --mca coll ^ml
    > > or
    > > mpirun --mca coll ^hcoll ... (probably not what is needed
    though ...)
    > >
    > > is it expected the library is not unloaded after dlclose ?
    > >
    > > Mellanox folks,
    > > can you please double check how libhcoll is built ?
    > > i guess it would work if the ml_ symbols were private to the
    library.
    > > if not, the only workaround is to mpirun --mca coll ^ml
    > > otherwise, it might crash (if coll_ml is loaded before
    coll_hcoll, which is really system dependent)
    > >
    > > Cheers,
    > >
    > > Gilles
    > > On 6/25/2015 10:46 AM, Gilles Gouaillardet wrote:
    > >> Daniel,
    > >>
    > >> thanks for the logs.
    > >>
    > >> an other workaround is to
    > >> mpirun --mca coll ^hcoll ...
    > >>
    > >> i was able to reproduce the issue, and it surprisingly occurs
    only if the coll_ml module is loaded *before* the hcoll module.
    > >> /* this is not the case on my system, so i had to hack my
    mca_base_component_path in order to reproduce the issue */
    > >>
    > >> as far as i understand, libhcoll is a proprietary software,
    so i cannot dig into it.
    > >> that being said, i noticed libhcoll defines some symbols
    (such as ml_coll_hier_barrier_setup) that are also defined by the
    coll_ml module, so it is likely hcoll coll_ml and openmpi coll_ml
    are not binary compatible hence the error.
    > >>
    > >> i will dig a bit more and see if this is even supposed to
    happen (since coll_ml_priority is zero, why is the module still
    loaded ?)
    > >>
    > >> as far as i am concerned, you *have to* mpirun --mca coll ^ml
    or update your user/system wide config file to blacklist the
    coll_ml module to ensure this is working.
    > >>
    > >> Mike and Mellanox folks, could you please comment on that ?
    > >>
    > >> Cheers,
    > >>
    > >> Gilles
    > >>
    > >>
    > >>
    > >> On 6/24/2015 5:23 PM, Daniel Letai wrote:
    > >>> Gilles,
    > >>>
    > >>> Attached the two output logs.
    > >>>
    > >>> Thanks,
    > >>> Daniel
    > >>>
    > >>> On 06/22/2015 08:08 AM, Gilles Gouaillardet wrote:
    > >>>> Daniel,
    > >>>>
    > >>>> i double checked this and i cannot make any sense with
    these logs.
    > >>>>
    > >>>> if coll_ml_priority is zero, then i do not any way how
    ml_coll_hier_barrier_setup can be invoked.
    > >>>>
    > >>>> could you please run again with --mca coll_base_verbose 100
    > >>>> with and without --mca coll ^ml
    > >>>>
    > >>>> Cheers,
    > >>>>
    > >>>> Gilles
    > >>>>
    > >>>> On 6/22/2015 12:08 AM, Gilles Gouaillardet wrote:
    > >>>>> Daniel,
    > >>>>>
    > >>>>> ok, thanks
    > >>>>>
    > >>>>> it seems that even if priority is zero, some code gets
    executed
    > >>>>> I will confirm this tomorrow and send you a patch to work
    around the issue if that if my guess is proven right
    > >>>>>
    > >>>>> Cheers,
    > >>>>>
    > >>>>> Gilles
    > >>>>>
    > >>>>> On Sunday, June 21, 2015, Daniel Letai <d...@letai.org.il
    <mailto:d...@letai.org.il>> wrote:
    > >>>>> MCA coll: parameter "coll_ml_priority" (current value:
    "0", data source: default, level: 9 dev/all, type: int)
    > >>>>>
    > >>>>> Not sure how to read this, but for any n>1 mpirun only
    works with --mca coll ^ml
    > >>>>>
    > >>>>> Thanks for helping
    > >>>>>
    > >>>>> On 06/18/2015 04:36 PM, Gilles Gouaillardet wrote:
    > >>>>>> This is really odd...
    > >>>>>>
    > >>>>>> you can run
    > >>>>>> ompi_info --all
    > >>>>>> and search coll_ml_priority
    > >>>>>>
    > >>>>>> it will display the current value and the origin
    > >>>>>> (e.g. default, system wide config, user config, cli,
    environment variable)
    > >>>>>>
    > >>>>>> Cheers,
    > >>>>>>
    > >>>>>> Gilles
    > >>>>>>
    > >>>>>> On Thursday, June 18, 2015, Daniel Letai
    <d...@letai.org.il <mailto:d...@letai.org.il>> wrote:
    > >>>>>> No, that's the issue.
    > >>>>>> I had to disable it to get things working.
    > >>>>>>
    > >>>>>> That's why I included my config settings - I couldn't
    figure out which option enabled it, so I could remove it from the
    configuration...
    > >>>>>>
    > >>>>>> On 06/18/2015 02:43 PM, Gilles Gouaillardet wrote:
    > >>>>>>> Daniel,
    > >>>>>>>
    > >>>>>>> ML module is not ready for production and is disabled by
    default.
    > >>>>>>>
    > >>>>>>> Did you explicitly enable this module ?
    > >>>>>>> If yes, I encourage you to disable it
    > >>>>>>>
    > >>>>>>> Cheers,
    > >>>>>>>
    > >>>>>>> Gilles
    > >>>>>>>
    > >>>>>>> On Thursday, June 18, 2015, Daniel Letai
    <d...@letai.org.il <mailto:d...@letai.org.il>> wrote:
    > >>>>>>> given a simple hello.c:
    > >>>>>>>
    > >>>>>>> #include <stdio.h>
    > >>>>>>> #include <mpi.h>
    > >>>>>>>
    > >>>>>>> int main(int argc, char* argv[])
    > >>>>>>> {
    > >>>>>>>         int size, rank, len;
    > >>>>>>>         char name[MPI_MAX_PROCESSOR_NAME];
    > >>>>>>>
    > >>>>>>>  MPI_Init(&argc, &argv);
    > >>>>>>>  MPI_Comm_size(MPI_COMM_WORLD, &size);
    > >>>>>>>  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    > >>>>>>>  MPI_Get_processor_name(name, &len);
    > >>>>>>>
    > >>>>>>>         printf("%s: Process %d out of %d\n", name, rank,
    size);
    > >>>>>>>
    > >>>>>>>  MPI_Finalize();ffff
    > >>>>>>> }
    > >>>>>>>
    > >>>>>>> for n=1
    > >>>>>>> mpirun -n 1 ./hello
    > >>>>>>> it works correctly.
    > >>>>>>>
    > >>>>>>> for n>1 it segfaults with signal 11
    > >>>>>>> used gdb to trace the problem to ml coll:
    > >>>>>>>
    > >>>>>>> Program received signal SIGSEGV, Segmentation fault.
    > >>>>>>> 0x00007ffff6750845 in ml_coll_hier_barrier_setup()
    > >>>>>>>     from <path to openmpi 1.8.5>/lib/openmpi/mca_coll_ml.so
    > >>>>>>>
    > >>>>>>> running with
    > >>>>>>> mpirun -n 2 --mca coll ^ml ./hello
    > >>>>>>> works correctly
    > >>>>>>>
    > >>>>>>> using mellanox ofed 2.3-2.0.5-rhel6.4-x86_64, if it's at
    all relevant.
    > >>>>>>> openmpi 1.8.5 was built with following options:
    > >>>>>>> rpmbuild --rebuild --define 'configure_options
--with-verbs=/usr --with-verbs-libdir=/usr/lib64 CC=gcc CXX=g++ FC=gfortran
    CFLAGS="-g -O3" --enable-mpirun-prefix-by-default
    --enable-orterun-prefix-by-default --disable-debug
    --with-knem=/opt/knem-1.1.1.90mlnx --with-platform=optimized
    --without-mpi-param-check
     --with-contrib-vt-flags=--disable-iotrace
    --enable-builtin-atomics --enable-cxx-exceptions
    --enable-sparse-groups --enable-mpi-thread-multiple
    --enable-memchecker --enable-btl-openib-failover
    --with-hwloc=internal --with-verbs --with-x --with-slurm
    --with-pmi=/opt/slurm --with-fca=/opt/mellanox/fca
    --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll'
    openmpi-1.8.5-1.src.rpm
    > >>>>>>>
    > >>>>>>> gcc version 5.1.1
    > >>>>>>>
    > >>>>>>> Thanks in advance
    > >>>>>>> _______________________________________________
    > >>>>>>> users mailing list
    > >>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >>>>>>> Subscription:
    http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >>>>>>> Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/06/27154.php
    > >>>>>>>
    > >>>>>>>
    > >>>>>>> _______________________________________________
    > >>>>>>> users mailing list
    > >>>>>>>
    > >>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >>>>>>>
    > >>>>>>> Subscription:
    > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >>>>>>>
    > >>>>>>> Link to this post:
    > >>>>>>>
    http://www.open-mpi.org/community/lists/users/2015/06/27155.php
    > >>>>>>
    > >>>>>>
    > >>>>>>
    > >>>>>> _______________________________________________
    > >>>>>> users mailing list
    > >>>>>>
    > >>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >>>>>>
    > >>>>>> Subscription:
    > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >>>>>>
    > >>>>>> Link to this post:
    > >>>>>>
    http://www.open-mpi.org/community/lists/users/2015/06/27157.php
    > >>>>>
    > >>>>>
    > >>>>>
    > >>>>> _______________________________________________
    > >>>>> users mailing list
    > >>>>>
    > >>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >>>>>
    > >>>>> Subscription:
    > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >>>>>
    > >>>>> Link to this post:
    > >>>>>
    http://www.open-mpi.org/community/lists/users/2015/06/27169.php
    > >>>>
    > >>>>
    > >>>>
    > >>>> _______________________________________________
    > >>>> users mailing list
    > >>>>
    > >>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >>>>
    > >>>> Subscription:
    > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >>>>
    > >>>> Link to this post:
    > >>>> http://www.open-mpi.org/community/lists/users/2015/06/27170.php
    > >>>
    > >>>
    > >>>
    > >>> _______________________________________________
    > >>> users mailing list
    > >>>
    > >>> us...@open-mpi.org <mailto:us...@open-mpi.org>
    > >>>
    > >>> Subscription:
    > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
    > >>>
    > >>> Link to this post:
    > >>> http://www.open-mpi.org/community/lists/users/2015/06/27183.php
    > >>
    > >>
    > >>
    > >> _______________________________________________
    > >> devel mailing list
    > >>
    > >> de...@open-mpi.org <mailto:de...@open-mpi.org>
    > >>
    > >> Subscription:
    > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
    > >>
    > >> Link to this post:
    > >> http://www.open-mpi.org/community/lists/devel/2015/06/17528.php
    > >
    > > _______________________________________________
    > > devel mailing list
    > > de...@open-mpi.org <mailto:de...@open-mpi.org>
    > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
    > > Link to this post:
    http://www.open-mpi.org/community/lists/devel/2015/06/17529.php
    >
    >
    > --
    > Jeff Squyres
    > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    > For corporate legal information go to:
    http://www.cisco.com/web/about/doing_business/legal/cri/
    >
    > _______________________________________________
    > devel mailing list
    > de...@open-mpi.org <mailto:de...@open-mpi.org>
    > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
    > Link to this post:
    http://www.open-mpi.org/community/lists/devel/2015/06/17533.php
    > _______________________________________________
    > devel mailing list
    > de...@open-mpi.org <mailto:de...@open-mpi.org>
    > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
    > Link to this post:
    http://www.open-mpi.org/community/lists/devel/2015/06/17537.php


    --
    Jeff Squyres
    jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    For corporate legal information go to:
    http://www.cisco.com/web/about/doing_business/legal/cri/

    _______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this post:
    http://www.open-mpi.org/community/lists/devel/2015/06/17538.php




--
Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900


_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/06/17539.php

Reply via email to