Should we .ompi_ignore ml?
> On Jun 25, 2015, at 4:41 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: > > Thanks, Gilles. > > We are addressing this. > > Josh > > Sent from my iPhone > > On Jun 25, 2015, at 11:03 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > >> Folks, >> >> this is a followup on an issue reported by Daniel on the users mailing list : >> OpenMPI is built with hcoll from Mellanox. >> the coll ml module has default priority zero. >> >> on my cluster, it works just fine >> on Daniel's cluster, it crashes. >> >> i was able to reproduce the crash by tweaking mca_base_component_path and >> ensure >> the coll ml module is loaded first. >> >> basically, i found two issues : >> 1) libhcoll.so (vendor lib provided by Mellanox, i tested >> hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.2-x86_64) seems to include its own >> coll ml, since there are some *public* symbols that are common to this >> module (ml_open, ml_coll_hier_barrier_setup, ...) >> 2) coll ml priority is zero, and even if the library is dlclose'd, it seems >> this is uneffective >> (nothing changed in /proc/xxx/maps before and after dlclose) >> >> >> there are two workarounds : >> mpirun --mca coll ^ml >> or >> mpirun --mca coll ^hcoll ... (probably not what is needed though ...) >> >> is it expected the library is not unloaded after dlclose ? >> >> Mellanox folks, >> can you please double check how libhcoll is built ? >> i guess it would work if the ml_ symbols were private to the library. >> if not, the only workaround is to mpirun --mca coll ^ml >> otherwise, it might crash (if coll_ml is loaded before coll_hcoll, which is >> really system dependent) >> >> Cheers, >> >> Gilles >> On 6/25/2015 10:46 AM, Gilles Gouaillardet wrote: >>> Daniel, >>> >>> thanks for the logs. >>> >>> an other workaround is to >>> mpirun --mca coll ^hcoll ... >>> >>> i was able to reproduce the issue, and it surprisingly occurs only if the >>> coll_ml module is loaded *before* the hcoll module. >>> /* this is not the case on my system, so i had to hack my >>> mca_base_component_path in order to reproduce the issue */ >>> >>> as far as i understand, libhcoll is a proprietary software, so i cannot dig >>> into it. >>> that being said, i noticed libhcoll defines some symbols (such as >>> ml_coll_hier_barrier_setup) that are also defined by the coll_ml module, so >>> it is likely hcoll coll_ml and openmpi coll_ml are not binary compatible >>> hence the error. >>> >>> i will dig a bit more and see if this is even supposed to happen (since >>> coll_ml_priority is zero, why is the module still loaded ?) >>> >>> as far as i am concerned, you *have to* mpirun --mca coll ^ml or update >>> your user/system wide config file to blacklist the coll_ml module to ensure >>> this is working. >>> >>> Mike and Mellanox folks, could you please comment on that ? >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> >>> On 6/24/2015 5:23 PM, Daniel Letai wrote: >>>> Gilles, >>>> >>>> Attached the two output logs. >>>> >>>> Thanks, >>>> Daniel >>>> >>>> On 06/22/2015 08:08 AM, Gilles Gouaillardet wrote: >>>>> Daniel, >>>>> >>>>> i double checked this and i cannot make any sense with these logs. >>>>> >>>>> if coll_ml_priority is zero, then i do not any way how >>>>> ml_coll_hier_barrier_setup can be invoked. >>>>> >>>>> could you please run again with --mca coll_base_verbose 100 >>>>> with and without --mca coll ^ml >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> On 6/22/2015 12:08 AM, Gilles Gouaillardet wrote: >>>>>> Daniel, >>>>>> >>>>>> ok, thanks >>>>>> >>>>>> it seems that even if priority is zero, some code gets executed >>>>>> I will confirm this tomorrow and send you a patch to work around the >>>>>> issue if that if my guess is proven right >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> >>>>>> On Sunday, June 21, 2015, Daniel Letai <d...@letai.org.il> wrote: >>>>>> MCA coll: parameter "coll_ml_priority" (current value: "0", data source: >>>>>> default, level: 9 dev/all, type: int) >>>>>> >>>>>> Not sure how to read this, but for any n>1 mpirun only works with --mca >>>>>> coll ^ml >>>>>> >>>>>> Thanks for helping >>>>>> >>>>>> On 06/18/2015 04:36 PM, Gilles Gouaillardet wrote: >>>>>>> This is really odd... >>>>>>> >>>>>>> you can run >>>>>>> ompi_info --all >>>>>>> and search coll_ml_priority >>>>>>> >>>>>>> it will display the current value and the origin >>>>>>> (e.g. default, system wide config, user config, cli, environment >>>>>>> variable) >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Gilles >>>>>>> >>>>>>> On Thursday, June 18, 2015, Daniel Letai <d...@letai.org.il> wrote: >>>>>>> No, that's the issue. >>>>>>> I had to disable it to get things working. >>>>>>> >>>>>>> That's why I included my config settings - I couldn't figure out which >>>>>>> option enabled it, so I could remove it from the configuration... >>>>>>> >>>>>>> On 06/18/2015 02:43 PM, Gilles Gouaillardet wrote: >>>>>>>> Daniel, >>>>>>>> >>>>>>>> ML module is not ready for production and is disabled by default. >>>>>>>> >>>>>>>> Did you explicitly enable this module ? >>>>>>>> If yes, I encourage you to disable it >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Gilles >>>>>>>> >>>>>>>> On Thursday, June 18, 2015, Daniel Letai <d...@letai.org.il> wrote: >>>>>>>> given a simple hello.c: >>>>>>>> >>>>>>>> #include <stdio.h> >>>>>>>> #include <mpi.h> >>>>>>>> >>>>>>>> int main(int argc, char* argv[]) >>>>>>>> { >>>>>>>> int size, rank, len; >>>>>>>> char name[MPI_MAX_PROCESSOR_NAME]; >>>>>>>> >>>>>>>> MPI_Init(&argc, &argv); >>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>>>>>> MPI_Get_processor_name(name, &len); >>>>>>>> >>>>>>>> printf("%s: Process %d out of %d\n", name, rank, size); >>>>>>>> >>>>>>>> MPI_Finalize();ffff >>>>>>>> } >>>>>>>> >>>>>>>> for n=1 >>>>>>>> mpirun -n 1 ./hello >>>>>>>> it works correctly. >>>>>>>> >>>>>>>> for n>1 it segfaults with signal 11 >>>>>>>> used gdb to trace the problem to ml coll: >>>>>>>> >>>>>>>> Program received signal SIGSEGV, Segmentation fault. >>>>>>>> 0x00007ffff6750845 in ml_coll_hier_barrier_setup() >>>>>>>> from <path to openmpi 1.8.5>/lib/openmpi/mca_coll_ml.so >>>>>>>> >>>>>>>> running with >>>>>>>> mpirun -n 2 --mca coll ^ml ./hello >>>>>>>> works correctly >>>>>>>> >>>>>>>> using mellanox ofed 2.3-2.0.5-rhel6.4-x86_64, if it's at all relevant. >>>>>>>> openmpi 1.8.5 was built with following options: >>>>>>>> rpmbuild --rebuild --define 'configure_options --with-verbs=/usr >>>>>>>> --with-verbs-libdir=/usr/lib64 CC=gcc CXX=g++ FC=gfortran CFLAGS="-g >>>>>>>> -O3" --enable-mpirun-prefix-by-default >>>>>>>> --enable-orterun-prefix-by-default --disable-debug >>>>>>>> --with-knem=/opt/knem-1.1.1.90mlnx --with-platform=optimized >>>>>>>> --without-mpi-param-check --with-contrib-vt-flags=--disable-iotrace >>>>>>>> --enable-builtin-atomics --enable-cxx-exceptions >>>>>>>> --enable-sparse-groups --enable-mpi-thread-multiple >>>>>>>> --enable-memchecker --enable-btl-openib-failover --with-hwloc=internal >>>>>>>> --with-verbs --with-x --with-slurm --with-pmi=/opt/slurm >>>>>>>> --with-fca=/opt/mellanox/fca --with-mxm=/opt/mellanox/mxm >>>>>>>> --with-hcoll=/opt/mellanox/hcoll' openmpi-1.8.5-1.src.rpm >>>>>>>> >>>>>>>> gcc version 5.1.1 >>>>>>>> >>>>>>>> Thanks in advance >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27154.php >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> >>>>>>>> us...@open-mpi.org >>>>>>>> >>>>>>>> Subscription: >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27155.php >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> >>>>>>> us...@open-mpi.org >>>>>>> >>>>>>> Subscription: >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27157.php >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> >>>>>> us...@open-mpi.org >>>>>> >>>>>> Subscription: >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27169.php >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> >>>>> us...@open-mpi.org >>>>> >>>>> Subscription: >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2015/06/27170.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> >>>> us...@open-mpi.org >>>> >>>> Subscription: >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/06/27183.php >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> >>> de...@open-mpi.org >>> >>> Subscription: >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/06/17528.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/06/17529.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/06/17530.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/