That appears to be correct.
On Thu, Jun 25, 2015 at 9:51 AM, Shamis, Pavel <sham...@ornl.gov> wrote: > As I read this thread - this issue is not related to the ML bootstrap > itself, > but the naming conflict between public functions in HCOLL and ML. > > Did I get it right ? > > If this the case, we can work with Mellanox folks to resolve this conflict. > > Best, > > Pavel (Pasha) Shamis > --- > Computer Science Research Group > Computer Science and Math Division > Oak Ridge National Laboratory > > > > > > > On Jun 25, 2015, at 10:34 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > > Gilles -- > > > > Can you send a stack trace from one of these crashes? > > > > I am *guessing* that the following is happening: > > > > 1. coll selection begins > > 2. coll ml is queried, and disqualifies itself (but is not dlclosed yet) > > 3. coll hcol is queried, which ends up calling down into libhcol. > libhcol calls a coll_ml_* symbol (which is apparently in a different .o > file in the library), but the linker has already resolved that coll_ml_* > symbol in the coll ml DSO. So the execution transfers back up into the > coll ml DSO, and ... kaboom. > > > > A simple stack trace will confirm this -- it should show execution going > down into libhcol and then back up into coll ml. > > > > > > > > > >> On Jun 25, 2015, at 1:03 AM, Gilles Gouaillardet <gil...@rist.or.jp> > wrote: > >> > >> Folks, > >> > >> this is a followup on an issue reported by Daniel on the users mailing > list : > >> OpenMPI is built with hcoll from Mellanox. > >> the coll ml module has default priority zero. > >> > >> on my cluster, it works just fine > >> on Daniel's cluster, it crashes. > >> > >> i was able to reproduce the crash by tweaking mca_base_component_path > and ensure > >> the coll ml module is loaded first. > >> > >> basically, i found two issues : > >> 1) libhcoll.so (vendor lib provided by Mellanox, i tested > hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.2-x86_64) seems to include its own > coll ml, since there are some *public* symbols that are common to this > module (ml_open, ml_coll_hier_barrier_setup, ...) > >> 2) coll ml priority is zero, and even if the library is dlclose'd, it > seems this is uneffective > >> (nothing changed in /proc/xxx/maps before and after dlclose) > >> > >> > >> there are two workarounds : > >> mpirun --mca coll ^ml > >> or > >> mpirun --mca coll ^hcoll ... (probably not what is needed though ...) > >> > >> is it expected the library is not unloaded after dlclose ? > >> > >> Mellanox folks, > >> can you please double check how libhcoll is built ? > >> i guess it would work if the ml_ symbols were private to the library. > >> if not, the only workaround is to mpirun --mca coll ^ml > >> otherwise, it might crash (if coll_ml is loaded before coll_hcoll, > which is really system dependent) > >> > >> Cheers, > >> > >> Gilles > >> On 6/25/2015 10:46 AM, Gilles Gouaillardet wrote: > >>> Daniel, > >>> > >>> thanks for the logs. > >>> > >>> an other workaround is to > >>> mpirun --mca coll ^hcoll ... > >>> > >>> i was able to reproduce the issue, and it surprisingly occurs only if > the coll_ml module is loaded *before* the hcoll module. > >>> /* this is not the case on my system, so i had to hack my > mca_base_component_path in order to reproduce the issue */ > >>> > >>> as far as i understand, libhcoll is a proprietary software, so i > cannot dig into it. > >>> that being said, i noticed libhcoll defines some symbols (such as > ml_coll_hier_barrier_setup) that are also defined by the coll_ml module, so > it is likely hcoll coll_ml and openmpi coll_ml are not binary compatible > hence the error. > >>> > >>> i will dig a bit more and see if this is even supposed to happen > (since coll_ml_priority is zero, why is the module still loaded ?) > >>> > >>> as far as i am concerned, you *have to* mpirun --mca coll ^ml or > update your user/system wide config file to blacklist the coll_ml module to > ensure this is working. > >>> > >>> Mike and Mellanox folks, could you please comment on that ? > >>> > >>> Cheers, > >>> > >>> Gilles > >>> > >>> > >>> > >>> On 6/24/2015 5:23 PM, Daniel Letai wrote: > >>>> Gilles, > >>>> > >>>> Attached the two output logs. > >>>> > >>>> Thanks, > >>>> Daniel > >>>> > >>>> On 06/22/2015 08:08 AM, Gilles Gouaillardet wrote: > >>>>> Daniel, > >>>>> > >>>>> i double checked this and i cannot make any sense with these logs. > >>>>> > >>>>> if coll_ml_priority is zero, then i do not any way how > ml_coll_hier_barrier_setup can be invoked. > >>>>> > >>>>> could you please run again with --mca coll_base_verbose 100 > >>>>> with and without --mca coll ^ml > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Gilles > >>>>> > >>>>> On 6/22/2015 12:08 AM, Gilles Gouaillardet wrote: > >>>>>> Daniel, > >>>>>> > >>>>>> ok, thanks > >>>>>> > >>>>>> it seems that even if priority is zero, some code gets executed > >>>>>> I will confirm this tomorrow and send you a patch to work around > the issue if that if my guess is proven right > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Gilles > >>>>>> > >>>>>> On Sunday, June 21, 2015, Daniel Letai <d...@letai.org.il> wrote: > >>>>>> MCA coll: parameter "coll_ml_priority" (current value: "0", data > source: default, level: 9 dev/all, type: int) > >>>>>> > >>>>>> Not sure how to read this, but for any n>1 mpirun only works with > --mca coll ^ml > >>>>>> > >>>>>> Thanks for helping > >>>>>> > >>>>>> On 06/18/2015 04:36 PM, Gilles Gouaillardet wrote: > >>>>>>> This is really odd... > >>>>>>> > >>>>>>> you can run > >>>>>>> ompi_info --all > >>>>>>> and search coll_ml_priority > >>>>>>> > >>>>>>> it will display the current value and the origin > >>>>>>> (e.g. default, system wide config, user config, cli, environment > variable) > >>>>>>> > >>>>>>> Cheers, > >>>>>>> > >>>>>>> Gilles > >>>>>>> > >>>>>>> On Thursday, June 18, 2015, Daniel Letai <d...@letai.org.il> > wrote: > >>>>>>> No, that's the issue. > >>>>>>> I had to disable it to get things working. > >>>>>>> > >>>>>>> That's why I included my config settings - I couldn't figure out > which option enabled it, so I could remove it from the configuration... > >>>>>>> > >>>>>>> On 06/18/2015 02:43 PM, Gilles Gouaillardet wrote: > >>>>>>>> Daniel, > >>>>>>>> > >>>>>>>> ML module is not ready for production and is disabled by default. > >>>>>>>> > >>>>>>>> Did you explicitly enable this module ? > >>>>>>>> If yes, I encourage you to disable it > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> > >>>>>>>> Gilles > >>>>>>>> > >>>>>>>> On Thursday, June 18, 2015, Daniel Letai <d...@letai.org.il> > wrote: > >>>>>>>> given a simple hello.c: > >>>>>>>> > >>>>>>>> #include <stdio.h> > >>>>>>>> #include <mpi.h> > >>>>>>>> > >>>>>>>> int main(int argc, char* argv[]) > >>>>>>>> { > >>>>>>>> int size, rank, len; > >>>>>>>> char name[MPI_MAX_PROCESSOR_NAME]; > >>>>>>>> > >>>>>>>> MPI_Init(&argc, &argv); > >>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size); > >>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); > >>>>>>>> MPI_Get_processor_name(name, &len); > >>>>>>>> > >>>>>>>> printf("%s: Process %d out of %d\n", name, rank, size); > >>>>>>>> > >>>>>>>> MPI_Finalize();ffff > >>>>>>>> } > >>>>>>>> > >>>>>>>> for n=1 > >>>>>>>> mpirun -n 1 ./hello > >>>>>>>> it works correctly. > >>>>>>>> > >>>>>>>> for n>1 it segfaults with signal 11 > >>>>>>>> used gdb to trace the problem to ml coll: > >>>>>>>> > >>>>>>>> Program received signal SIGSEGV, Segmentation fault. > >>>>>>>> 0x00007ffff6750845 in ml_coll_hier_barrier_setup() > >>>>>>>> from <path to openmpi 1.8.5>/lib/openmpi/mca_coll_ml.so > >>>>>>>> > >>>>>>>> running with > >>>>>>>> mpirun -n 2 --mca coll ^ml ./hello > >>>>>>>> works correctly > >>>>>>>> > >>>>>>>> using mellanox ofed 2.3-2.0.5-rhel6.4-x86_64, if it's at all > relevant. > >>>>>>>> openmpi 1.8.5 was built with following options: > >>>>>>>> rpmbuild --rebuild --define 'configure_options --with-verbs=/usr > --with-verbs-libdir=/usr/lib64 CC=gcc > CXX=g++ FC=gfortran CFLAGS="-g -O3" --enable-mpirun-prefix-by-default > --enable-orterun-prefix-by-default --disable-debug > --with-knem=/opt/knem-1.1.1.90mlnx --with-platform=optimized > --without-mpi-param-check > --with-contrib-vt-flags=--disable-iotrace --enable-builtin-atomics > --enable-cxx-exceptions --enable-sparse-groups --enable-mpi-thread-multiple > --enable-memchecker --enable-btl-openib-failover --with-hwloc=internal > --with-verbs --with-x --with-slurm --with-pmi=/opt/slurm > --with-fca=/opt/mellanox/fca --with-mxm=/opt/mellanox/mxm > --with-hcoll=/opt/mellanox/hcoll' openmpi-1.8.5-1.src.rpm > >>>>>>>> > >>>>>>>> gcc version 5.1.1 > >>>>>>>> > >>>>>>>> Thanks in advance > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> us...@open-mpi.org > >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27154.php > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> > >>>>>>>> us...@open-mpi.org > >>>>>>>> > >>>>>>>> Subscription: > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> > >>>>>>>> Link to this post: > >>>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27155.php > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> > >>>>>>> us...@open-mpi.org > >>>>>>> > >>>>>>> Subscription: > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> > >>>>>>> Link to this post: > >>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27157.php > >>>>>> > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> > >>>>>> us...@open-mpi.org > >>>>>> > >>>>>> Subscription: > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>>> Link to this post: > >>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27169.php > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> > >>>>> us...@open-mpi.org > >>>>> > >>>>> Subscription: > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>>> Link to this post: > >>>>> http://www.open-mpi.org/community/lists/users/2015/06/27170.php > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> > >>>> us...@open-mpi.org > >>>> > >>>> Subscription: > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/users/2015/06/27183.php > >>> > >>> > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> > >>> de...@open-mpi.org > >>> > >>> Subscription: > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/devel/2015/06/17528.php > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/06/17529.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/06/17533.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/06/17535.php >