I can see cloning of existing component's source as a starting point
for a new one as a common occurrence (at least relative to creating
new components from zero).
So, this is probably not the last time this will ever occur.
Would a build with --disable-dlopen have detected this problem (by
failing to build libmpi due to multiply defined symbols)?
If so, then maybe Jenkins should apply this test (which would NOT
depend on the dlopen load order).
-Paul
On Thu, Jun 25, 2015 at 3:03 PM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote:
Devendar literally just reproduced here at the developer meeting, too.
Sweet -- ok, so we understand what is going on.
Devendar/Mellanox is going to talk about this internally and get
back to us.
> On Jun 25, 2015, at 2:59 PM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com
<mailto:gilles.gouaillar...@gmail.com>> wrote:
>
> Jeff,
>
> this is exactly what happens.
>
> I will send a stack trace later
>
> Cheers,
>
> Gilles
>
> On Thursday, June 25, 2015, Jeff Squyres (jsquyres)
<jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote:
> Gilles --
>
> Can you send a stack trace from one of these crashes?
>
> I am *guessing* that the following is happening:
>
> 1. coll selection begins
> 2. coll ml is queried, and disqualifies itself (but is not
dlclosed yet)
> 3. coll hcol is queried, which ends up calling down into
libhcol. libhcol calls a coll_ml_* symbol (which is apparently in
a different .o file in the library), but the linker has already
resolved that coll_ml_* symbol in the coll ml DSO. So the
execution transfers back up into the coll ml DSO, and ... kaboom.
>
> A simple stack trace will confirm this -- it should show
execution going down into libhcol and then back up into coll ml.
>
>
>
>
> > On Jun 25, 2015, at 1:03 AM, Gilles Gouaillardet
<gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:
> >
> > Folks,
> >
> > this is a followup on an issue reported by Daniel on the users
mailing list :
> > OpenMPI is built with hcoll from Mellanox.
> > the coll ml module has default priority zero.
> >
> > on my cluster, it works just fine
> > on Daniel's cluster, it crashes.
> >
> > i was able to reproduce the crash by tweaking
mca_base_component_path and ensure
> > the coll ml module is loaded first.
> >
> > basically, i found two issues :
> > 1) libhcoll.so (vendor lib provided by Mellanox, i tested
hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.2-x86_64) seems to include
its own coll ml, since there are some *public* symbols that are
common to this module (ml_open, ml_coll_hier_barrier_setup, ...)
> > 2) coll ml priority is zero, and even if the library is
dlclose'd, it seems this is uneffective
> > (nothing changed in /proc/xxx/maps before and after dlclose)
> >
> >
> > there are two workarounds :
> > mpirun --mca coll ^ml
> > or
> > mpirun --mca coll ^hcoll ... (probably not what is needed
though ...)
> >
> > is it expected the library is not unloaded after dlclose ?
> >
> > Mellanox folks,
> > can you please double check how libhcoll is built ?
> > i guess it would work if the ml_ symbols were private to the
library.
> > if not, the only workaround is to mpirun --mca coll ^ml
> > otherwise, it might crash (if coll_ml is loaded before
coll_hcoll, which is really system dependent)
> >
> > Cheers,
> >
> > Gilles
> > On 6/25/2015 10:46 AM, Gilles Gouaillardet wrote:
> >> Daniel,
> >>
> >> thanks for the logs.
> >>
> >> an other workaround is to
> >> mpirun --mca coll ^hcoll ...
> >>
> >> i was able to reproduce the issue, and it surprisingly occurs
only if the coll_ml module is loaded *before* the hcoll module.
> >> /* this is not the case on my system, so i had to hack my
mca_base_component_path in order to reproduce the issue */
> >>
> >> as far as i understand, libhcoll is a proprietary software,
so i cannot dig into it.
> >> that being said, i noticed libhcoll defines some symbols
(such as ml_coll_hier_barrier_setup) that are also defined by the
coll_ml module, so it is likely hcoll coll_ml and openmpi coll_ml
are not binary compatible hence the error.
> >>
> >> i will dig a bit more and see if this is even supposed to
happen (since coll_ml_priority is zero, why is the module still
loaded ?)
> >>
> >> as far as i am concerned, you *have to* mpirun --mca coll ^ml
or update your user/system wide config file to blacklist the
coll_ml module to ensure this is working.
> >>
> >> Mike and Mellanox folks, could you please comment on that ?
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >>
> >>
> >> On 6/24/2015 5:23 PM, Daniel Letai wrote:
> >>> Gilles,
> >>>
> >>> Attached the two output logs.
> >>>
> >>> Thanks,
> >>> Daniel
> >>>
> >>> On 06/22/2015 08:08 AM, Gilles Gouaillardet wrote:
> >>>> Daniel,
> >>>>
> >>>> i double checked this and i cannot make any sense with
these logs.
> >>>>
> >>>> if coll_ml_priority is zero, then i do not any way how
ml_coll_hier_barrier_setup can be invoked.
> >>>>
> >>>> could you please run again with --mca coll_base_verbose 100
> >>>> with and without --mca coll ^ml
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Gilles
> >>>>
> >>>> On 6/22/2015 12:08 AM, Gilles Gouaillardet wrote:
> >>>>> Daniel,
> >>>>>
> >>>>> ok, thanks
> >>>>>
> >>>>> it seems that even if priority is zero, some code gets
executed
> >>>>> I will confirm this tomorrow and send you a patch to work
around the issue if that if my guess is proven right
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Gilles
> >>>>>
> >>>>> On Sunday, June 21, 2015, Daniel Letai <d...@letai.org.il
<mailto:d...@letai.org.il>> wrote:
> >>>>> MCA coll: parameter "coll_ml_priority" (current value:
"0", data source: default, level: 9 dev/all, type: int)
> >>>>>
> >>>>> Not sure how to read this, but for any n>1 mpirun only
works with --mca coll ^ml
> >>>>>
> >>>>> Thanks for helping
> >>>>>
> >>>>> On 06/18/2015 04:36 PM, Gilles Gouaillardet wrote:
> >>>>>> This is really odd...
> >>>>>>
> >>>>>> you can run
> >>>>>> ompi_info --all
> >>>>>> and search coll_ml_priority
> >>>>>>
> >>>>>> it will display the current value and the origin
> >>>>>> (e.g. default, system wide config, user config, cli,
environment variable)
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Gilles
> >>>>>>
> >>>>>> On Thursday, June 18, 2015, Daniel Letai
<d...@letai.org.il <mailto:d...@letai.org.il>> wrote:
> >>>>>> No, that's the issue.
> >>>>>> I had to disable it to get things working.
> >>>>>>
> >>>>>> That's why I included my config settings - I couldn't
figure out which option enabled it, so I could remove it from the
configuration...
> >>>>>>
> >>>>>> On 06/18/2015 02:43 PM, Gilles Gouaillardet wrote:
> >>>>>>> Daniel,
> >>>>>>>
> >>>>>>> ML module is not ready for production and is disabled by
default.
> >>>>>>>
> >>>>>>> Did you explicitly enable this module ?
> >>>>>>> If yes, I encourage you to disable it
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>>
> >>>>>>> Gilles
> >>>>>>>
> >>>>>>> On Thursday, June 18, 2015, Daniel Letai
<d...@letai.org.il <mailto:d...@letai.org.il>> wrote:
> >>>>>>> given a simple hello.c:
> >>>>>>>
> >>>>>>> #include <stdio.h>
> >>>>>>> #include <mpi.h>
> >>>>>>>
> >>>>>>> int main(int argc, char* argv[])
> >>>>>>> {
> >>>>>>> int size, rank, len;
> >>>>>>> char name[MPI_MAX_PROCESSOR_NAME];
> >>>>>>>
> >>>>>>> MPI_Init(&argc, &argv);
> >>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size);
> >>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >>>>>>> MPI_Get_processor_name(name, &len);
> >>>>>>>
> >>>>>>> printf("%s: Process %d out of %d\n", name, rank,
size);
> >>>>>>>
> >>>>>>> MPI_Finalize();ffff
> >>>>>>> }
> >>>>>>>
> >>>>>>> for n=1
> >>>>>>> mpirun -n 1 ./hello
> >>>>>>> it works correctly.
> >>>>>>>
> >>>>>>> for n>1 it segfaults with signal 11
> >>>>>>> used gdb to trace the problem to ml coll:
> >>>>>>>
> >>>>>>> Program received signal SIGSEGV, Segmentation fault.
> >>>>>>> 0x00007ffff6750845 in ml_coll_hier_barrier_setup()
> >>>>>>> from <path to openmpi 1.8.5>/lib/openmpi/mca_coll_ml.so
> >>>>>>>
> >>>>>>> running with
> >>>>>>> mpirun -n 2 --mca coll ^ml ./hello
> >>>>>>> works correctly
> >>>>>>>
> >>>>>>> using mellanox ofed 2.3-2.0.5-rhel6.4-x86_64, if it's at
all relevant.
> >>>>>>> openmpi 1.8.5 was built with following options:
> >>>>>>> rpmbuild --rebuild --define 'configure_options
--with-verbs=/usr
--with-verbs-libdir=/usr/lib64 CC=gcc CXX=g++ FC=gfortran
CFLAGS="-g -O3" --enable-mpirun-prefix-by-default
--enable-orterun-prefix-by-default --disable-debug
--with-knem=/opt/knem-1.1.1.90mlnx --with-platform=optimized
--without-mpi-param-check
--with-contrib-vt-flags=--disable-iotrace
--enable-builtin-atomics --enable-cxx-exceptions
--enable-sparse-groups --enable-mpi-thread-multiple
--enable-memchecker --enable-btl-openib-failover
--with-hwloc=internal --with-verbs --with-x --with-slurm
--with-pmi=/opt/slurm --with-fca=/opt/mellanox/fca
--with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll'
openmpi-1.8.5-1.src.rpm
> >>>>>>>
> >>>>>>> gcc version 5.1.1
> >>>>>>>
> >>>>>>> Thanks in advance
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
> >>>>>>> Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>> Link to this post:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> users mailing list
> >>>>>>>
> >>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
> >>>>>>>
> >>>>>>> Subscription:
> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>>
> >>>>>>> Link to this post:
> >>>>>>>
http://www.open-mpi.org/community/lists/users/2015/06/27155.php
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> users mailing list
> >>>>>>
> >>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
> >>>>>>
> >>>>>> Subscription:
> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>>
> >>>>>> Link to this post:
> >>>>>>
http://www.open-mpi.org/community/lists/users/2015/06/27157.php
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>>
> >>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
> >>>>>
> >>>>> Subscription:
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>> Link to this post:
> >>>>>
http://www.open-mpi.org/community/lists/users/2015/06/27169.php
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>>
> >>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
> >>>>
> >>>> Subscription:
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>>> Link to this post:
> >>>> http://www.open-mpi.org/community/lists/users/2015/06/27170.php
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>>
> >>> us...@open-mpi.org <mailto:us...@open-mpi.org>
> >>>
> >>> Subscription:
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>> Link to this post:
> >>> http://www.open-mpi.org/community/lists/users/2015/06/27183.php
> >>
> >>
> >>
> >> _______________________________________________
> >> devel mailing list
> >>
> >> de...@open-mpi.org <mailto:de...@open-mpi.org>
> >>
> >> Subscription:
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >> Link to this post:
> >> http://www.open-mpi.org/community/lists/devel/2015/06/17528.php
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org <mailto:de...@open-mpi.org>
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/06/17529.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
> For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/06/17533.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/06/17537.php
--
Jeff Squyres
jsquy...@cisco.com <mailto:jsquy...@cisco.com>
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/06/17538.php
--
Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/06/17539.php