Jeff, this is exactly what happens.
I will send a stack trace later Cheers, Gilles On Thursday, June 25, 2015, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Gilles -- > > Can you send a stack trace from one of these crashes? > > I am *guessing* that the following is happening: > > 1. coll selection begins > 2. coll ml is queried, and disqualifies itself (but is not dlclosed yet) > 3. coll hcol is queried, which ends up calling down into libhcol. libhcol > calls a coll_ml_* symbol (which is apparently in a different .o file in the > library), but the linker has already resolved that coll_ml_* symbol in the > coll ml DSO. So the execution transfers back up into the coll ml DSO, and > ... kaboom. > > A simple stack trace will confirm this -- it should show execution going > down into libhcol and then back up into coll ml. > > > > > > On Jun 25, 2015, at 1:03 AM, Gilles Gouaillardet <gil...@rist.or.jp > <javascript:;>> wrote: > > > > Folks, > > > > this is a followup on an issue reported by Daniel on the users mailing > list : > > OpenMPI is built with hcoll from Mellanox. > > the coll ml module has default priority zero. > > > > on my cluster, it works just fine > > on Daniel's cluster, it crashes. > > > > i was able to reproduce the crash by tweaking mca_base_component_path > and ensure > > the coll ml module is loaded first. > > > > basically, i found two issues : > > 1) libhcoll.so (vendor lib provided by Mellanox, i tested > hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.2-x86_64) seems to include its own > coll ml, since there are some *public* symbols that are common to this > module (ml_open, ml_coll_hier_barrier_setup, ...) > > 2) coll ml priority is zero, and even if the library is dlclose'd, it > seems this is uneffective > > (nothing changed in /proc/xxx/maps before and after dlclose) > > > > > > there are two workarounds : > > mpirun --mca coll ^ml > > or > > mpirun --mca coll ^hcoll ... (probably not what is needed though ...) > > > > is it expected the library is not unloaded after dlclose ? > > > > Mellanox folks, > > can you please double check how libhcoll is built ? > > i guess it would work if the ml_ symbols were private to the library. > > if not, the only workaround is to mpirun --mca coll ^ml > > otherwise, it might crash (if coll_ml is loaded before coll_hcoll, which > is really system dependent) > > > > Cheers, > > > > Gilles > > On 6/25/2015 10:46 AM, Gilles Gouaillardet wrote: > >> Daniel, > >> > >> thanks for the logs. > >> > >> an other workaround is to > >> mpirun --mca coll ^hcoll ... > >> > >> i was able to reproduce the issue, and it surprisingly occurs only if > the coll_ml module is loaded *before* the hcoll module. > >> /* this is not the case on my system, so i had to hack my > mca_base_component_path in order to reproduce the issue */ > >> > >> as far as i understand, libhcoll is a proprietary software, so i cannot > dig into it. > >> that being said, i noticed libhcoll defines some symbols (such as > ml_coll_hier_barrier_setup) that are also defined by the coll_ml module, so > it is likely hcoll coll_ml and openmpi coll_ml are not binary compatible > hence the error. > >> > >> i will dig a bit more and see if this is even supposed to happen (since > coll_ml_priority is zero, why is the module still loaded ?) > >> > >> as far as i am concerned, you *have to* mpirun --mca coll ^ml or update > your user/system wide config file to blacklist the coll_ml module to ensure > this is working. > >> > >> Mike and Mellanox folks, could you please comment on that ? > >> > >> Cheers, > >> > >> Gilles > >> > >> > >> > >> On 6/24/2015 5:23 PM, Daniel Letai wrote: > >>> Gilles, > >>> > >>> Attached the two output logs. > >>> > >>> Thanks, > >>> Daniel > >>> > >>> On 06/22/2015 08:08 AM, Gilles Gouaillardet wrote: > >>>> Daniel, > >>>> > >>>> i double checked this and i cannot make any sense with these logs. > >>>> > >>>> if coll_ml_priority is zero, then i do not any way how > ml_coll_hier_barrier_setup can be invoked. > >>>> > >>>> could you please run again with --mca coll_base_verbose 100 > >>>> with and without --mca coll ^ml > >>>> > >>>> Cheers, > >>>> > >>>> Gilles > >>>> > >>>> On 6/22/2015 12:08 AM, Gilles Gouaillardet wrote: > >>>>> Daniel, > >>>>> > >>>>> ok, thanks > >>>>> > >>>>> it seems that even if priority is zero, some code gets executed > >>>>> I will confirm this tomorrow and send you a patch to work around the > issue if that if my guess is proven right > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Gilles > >>>>> > >>>>> On Sunday, June 21, 2015, Daniel Letai <d...@letai.org.il > <javascript:;>> wrote: > >>>>> MCA coll: parameter "coll_ml_priority" (current value: "0", data > source: default, level: 9 dev/all, type: int) > >>>>> > >>>>> Not sure how to read this, but for any n>1 mpirun only works with > --mca coll ^ml > >>>>> > >>>>> Thanks for helping > >>>>> > >>>>> On 06/18/2015 04:36 PM, Gilles Gouaillardet wrote: > >>>>>> This is really odd... > >>>>>> > >>>>>> you can run > >>>>>> ompi_info --all > >>>>>> and search coll_ml_priority > >>>>>> > >>>>>> it will display the current value and the origin > >>>>>> (e.g. default, system wide config, user config, cli, environment > variable) > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Gilles > >>>>>> > >>>>>> On Thursday, June 18, 2015, Daniel Letai <d...@letai.org.il > <javascript:;>> wrote: > >>>>>> No, that's the issue. > >>>>>> I had to disable it to get things working. > >>>>>> > >>>>>> That's why I included my config settings - I couldn't figure out > which option enabled it, so I could remove it from the configuration... > >>>>>> > >>>>>> On 06/18/2015 02:43 PM, Gilles Gouaillardet wrote: > >>>>>>> Daniel, > >>>>>>> > >>>>>>> ML module is not ready for production and is disabled by default. > >>>>>>> > >>>>>>> Did you explicitly enable this module ? > >>>>>>> If yes, I encourage you to disable it > >>>>>>> > >>>>>>> Cheers, > >>>>>>> > >>>>>>> Gilles > >>>>>>> > >>>>>>> On Thursday, June 18, 2015, Daniel Letai <d...@letai.org.il > <javascript:;>> wrote: > >>>>>>> given a simple hello.c: > >>>>>>> > >>>>>>> #include <stdio.h> > >>>>>>> #include <mpi.h> > >>>>>>> > >>>>>>> int main(int argc, char* argv[]) > >>>>>>> { > >>>>>>> int size, rank, len; > >>>>>>> char name[MPI_MAX_PROCESSOR_NAME]; > >>>>>>> > >>>>>>> MPI_Init(&argc, &argv); > >>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size); > >>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); > >>>>>>> MPI_Get_processor_name(name, &len); > >>>>>>> > >>>>>>> printf("%s: Process %d out of %d\n", name, rank, size); > >>>>>>> > >>>>>>> MPI_Finalize();ffff > >>>>>>> } > >>>>>>> > >>>>>>> for n=1 > >>>>>>> mpirun -n 1 ./hello > >>>>>>> it works correctly. > >>>>>>> > >>>>>>> for n>1 it segfaults with signal 11 > >>>>>>> used gdb to trace the problem to ml coll: > >>>>>>> > >>>>>>> Program received signal SIGSEGV, Segmentation fault. > >>>>>>> 0x00007ffff6750845 in ml_coll_hier_barrier_setup() > >>>>>>> from <path to openmpi 1.8.5>/lib/openmpi/mca_coll_ml.so > >>>>>>> > >>>>>>> running with > >>>>>>> mpirun -n 2 --mca coll ^ml ./hello > >>>>>>> works correctly > >>>>>>> > >>>>>>> using mellanox ofed 2.3-2.0.5-rhel6.4-x86_64, if it's at all > relevant. > >>>>>>> openmpi 1.8.5 was built with following options: > >>>>>>> rpmbuild --rebuild --define 'configure_options --with-verbs=/usr > --with-verbs-libdir=/usr/lib64 CC=gcc > CXX=g++ FC=gfortran CFLAGS="-g -O3" --enable-mpirun-prefix-by-default > --enable-orterun-prefix-by-default --disable-debug > --with-knem=/opt/knem-1.1.1.90mlnx --with-platform=optimized > --without-mpi-param-check > --with-contrib-vt-flags=--disable-iotrace --enable-builtin-atomics > --enable-cxx-exceptions --enable-sparse-groups --enable-mpi-thread-multiple > --enable-memchecker --enable-btl-openib-failover --with-hwloc=internal > --with-verbs --with-x --with-slurm --with-pmi=/opt/slurm > --with-fca=/opt/mellanox/fca --with-mxm=/opt/mellanox/mxm > --with-hcoll=/opt/mellanox/hcoll' openmpi-1.8.5-1.src.rpm > >>>>>>> > >>>>>>> gcc version 5.1.1 > >>>>>>> > >>>>>>> Thanks in advance > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> us...@open-mpi.org <javascript:;> > >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27154.php > >>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> > >>>>>>> us...@open-mpi.org <javascript:;> > >>>>>>> > >>>>>>> Subscription: > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> > >>>>>>> Link to this post: > >>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27155.php > >>>>>> > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> > >>>>>> us...@open-mpi.org <javascript:;> > >>>>>> > >>>>>> Subscription: > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>>> Link to this post: > >>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27157.php > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> > >>>>> us...@open-mpi.org <javascript:;> > >>>>> > >>>>> Subscription: > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>>> Link to this post: > >>>>> http://www.open-mpi.org/community/lists/users/2015/06/27169.php > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> > >>>> us...@open-mpi.org <javascript:;> > >>>> > >>>> Subscription: > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/users/2015/06/27170.php > >>> > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> > >>> us...@open-mpi.org <javascript:;> > >>> > >>> Subscription: > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/users/2015/06/27183.php > >> > >> > >> > >> _______________________________________________ > >> devel mailing list > >> > >> de...@open-mpi.org <javascript:;> > >> > >> Subscription: > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2015/06/17528.php > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org <javascript:;> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/06/17529.php > > > -- > Jeff Squyres > jsquy...@cisco.com <javascript:;> > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org <javascript:;> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/06/17533.php >