We at least need to release an immediate 1.8.7 to rectify the situation, either by "rm -rf" of ml or ompi_ignore it. I'll ompi_ignore it in the 1.10 branch for now as that hasn't been released yet - if we can get a fix in the next week or two, we can "unignore" it for the release. I'm still angling for 1.10 release in the first half of July.
On Thu, Jun 25, 2015 at 5:27 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > Should we .ompi_ignore ml? > > > > On Jun 25, 2015, at 4:41 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: > > > > Thanks, Gilles. > > > > We are addressing this. > > > > Josh > > > > Sent from my iPhone > > > > On Jun 25, 2015, at 11:03 AM, Gilles Gouaillardet <gil...@rist.or.jp> > wrote: > > > >> Folks, > >> > >> this is a followup on an issue reported by Daniel on the users mailing > list : > >> OpenMPI is built with hcoll from Mellanox. > >> the coll ml module has default priority zero. > >> > >> on my cluster, it works just fine > >> on Daniel's cluster, it crashes. > >> > >> i was able to reproduce the crash by tweaking mca_base_component_path > and ensure > >> the coll ml module is loaded first. > >> > >> basically, i found two issues : > >> 1) libhcoll.so (vendor lib provided by Mellanox, i tested > hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.2-x86_64) seems to include its own > coll ml, since there are some *public* symbols that are common to this > module (ml_open, ml_coll_hier_barrier_setup, ...) > >> 2) coll ml priority is zero, and even if the library is dlclose'd, it > seems this is uneffective > >> (nothing changed in /proc/xxx/maps before and after dlclose) > >> > >> > >> there are two workarounds : > >> mpirun --mca coll ^ml > >> or > >> mpirun --mca coll ^hcoll ... (probably not what is needed though ...) > >> > >> is it expected the library is not unloaded after dlclose ? > >> > >> Mellanox folks, > >> can you please double check how libhcoll is built ? > >> i guess it would work if the ml_ symbols were private to the library. > >> if not, the only workaround is to mpirun --mca coll ^ml > >> otherwise, it might crash (if coll_ml is loaded before coll_hcoll, > which is really system dependent) > >> > >> Cheers, > >> > >> Gilles > >> On 6/25/2015 10:46 AM, Gilles Gouaillardet wrote: > >>> Daniel, > >>> > >>> thanks for the logs. > >>> > >>> an other workaround is to > >>> mpirun --mca coll ^hcoll ... > >>> > >>> i was able to reproduce the issue, and it surprisingly occurs only if > the coll_ml module is loaded *before* the hcoll module. > >>> /* this is not the case on my system, so i had to hack my > mca_base_component_path in order to reproduce the issue */ > >>> > >>> as far as i understand, libhcoll is a proprietary software, so i > cannot dig into it. > >>> that being said, i noticed libhcoll defines some symbols (such as > ml_coll_hier_barrier_setup) that are also defined by the coll_ml module, so > it is likely hcoll coll_ml and openmpi coll_ml are not binary compatible > hence the error. > >>> > >>> i will dig a bit more and see if this is even supposed to happen > (since coll_ml_priority is zero, why is the module still loaded ?) > >>> > >>> as far as i am concerned, you *have to* mpirun --mca coll ^ml or > update your user/system wide config file to blacklist the coll_ml module to > ensure this is working. > >>> > >>> Mike and Mellanox folks, could you please comment on that ? > >>> > >>> Cheers, > >>> > >>> Gilles > >>> > >>> > >>> > >>> On 6/24/2015 5:23 PM, Daniel Letai wrote: > >>>> Gilles, > >>>> > >>>> Attached the two output logs. > >>>> > >>>> Thanks, > >>>> Daniel > >>>> > >>>> On 06/22/2015 08:08 AM, Gilles Gouaillardet wrote: > >>>>> Daniel, > >>>>> > >>>>> i double checked this and i cannot make any sense with these logs. > >>>>> > >>>>> if coll_ml_priority is zero, then i do not any way how > ml_coll_hier_barrier_setup can be invoked. > >>>>> > >>>>> could you please run again with --mca coll_base_verbose 100 > >>>>> with and without --mca coll ^ml > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Gilles > >>>>> > >>>>> On 6/22/2015 12:08 AM, Gilles Gouaillardet wrote: > >>>>>> Daniel, > >>>>>> > >>>>>> ok, thanks > >>>>>> > >>>>>> it seems that even if priority is zero, some code gets executed > >>>>>> I will confirm this tomorrow and send you a patch to work around > the issue if that if my guess is proven right > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Gilles > >>>>>> > >>>>>> On Sunday, June 21, 2015, Daniel Letai <d...@letai.org.il> wrote: > >>>>>> MCA coll: parameter "coll_ml_priority" (current value: "0", data > source: default, level: 9 dev/all, type: int) > >>>>>> > >>>>>> Not sure how to read this, but for any n>1 mpirun only works with > --mca coll ^ml > >>>>>> > >>>>>> Thanks for helping > >>>>>> > >>>>>> On 06/18/2015 04:36 PM, Gilles Gouaillardet wrote: > >>>>>>> This is really odd... > >>>>>>> > >>>>>>> you can run > >>>>>>> ompi_info --all > >>>>>>> and search coll_ml_priority > >>>>>>> > >>>>>>> it will display the current value and the origin > >>>>>>> (e.g. default, system wide config, user config, cli, environment > variable) > >>>>>>> > >>>>>>> Cheers, > >>>>>>> > >>>>>>> Gilles > >>>>>>> > >>>>>>> On Thursday, June 18, 2015, Daniel Letai <d...@letai.org.il> > wrote: > >>>>>>> No, that's the issue. > >>>>>>> I had to disable it to get things working. > >>>>>>> > >>>>>>> That's why I included my config settings - I couldn't figure out > which option enabled it, so I could remove it from the configuration... > >>>>>>> > >>>>>>> On 06/18/2015 02:43 PM, Gilles Gouaillardet wrote: > >>>>>>>> Daniel, > >>>>>>>> > >>>>>>>> ML module is not ready for production and is disabled by default. > >>>>>>>> > >>>>>>>> Did you explicitly enable this module ? > >>>>>>>> If yes, I encourage you to disable it > >>>>>>>> > >>>>>>>> Cheers, > >>>>>>>> > >>>>>>>> Gilles > >>>>>>>> > >>>>>>>> On Thursday, June 18, 2015, Daniel Letai <d...@letai.org.il> > wrote: > >>>>>>>> given a simple hello.c: > >>>>>>>> > >>>>>>>> #include <stdio.h> > >>>>>>>> #include <mpi.h> > >>>>>>>> > >>>>>>>> int main(int argc, char* argv[]) > >>>>>>>> { > >>>>>>>> int size, rank, len; > >>>>>>>> char name[MPI_MAX_PROCESSOR_NAME]; > >>>>>>>> > >>>>>>>> MPI_Init(&argc, &argv); > >>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size); > >>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); > >>>>>>>> MPI_Get_processor_name(name, &len); > >>>>>>>> > >>>>>>>> printf("%s: Process %d out of %d\n", name, rank, size); > >>>>>>>> > >>>>>>>> MPI_Finalize();ffff > >>>>>>>> } > >>>>>>>> > >>>>>>>> for n=1 > >>>>>>>> mpirun -n 1 ./hello > >>>>>>>> it works correctly. > >>>>>>>> > >>>>>>>> for n>1 it segfaults with signal 11 > >>>>>>>> used gdb to trace the problem to ml coll: > >>>>>>>> > >>>>>>>> Program received signal SIGSEGV, Segmentation fault. > >>>>>>>> 0x00007ffff6750845 in ml_coll_hier_barrier_setup() > >>>>>>>> from <path to openmpi 1.8.5>/lib/openmpi/mca_coll_ml.so > >>>>>>>> > >>>>>>>> running with > >>>>>>>> mpirun -n 2 --mca coll ^ml ./hello > >>>>>>>> works correctly > >>>>>>>> > >>>>>>>> using mellanox ofed 2.3-2.0.5-rhel6.4-x86_64, if it's at all > relevant. > >>>>>>>> openmpi 1.8.5 was built with following options: > >>>>>>>> rpmbuild --rebuild --define 'configure_options --with-verbs=/usr > --with-verbs-libdir=/usr/lib64 CC=gcc CXX=g++ FC=gfortran CFLAGS="-g -O3" > --enable-mpirun-prefix-by-default --enable-orterun-prefix-by-default > --disable-debug --with-knem=/opt/knem-1.1.1.90mlnx > --with-platform=optimized --without-mpi-param-check > --with-contrib-vt-flags=--disable-iotrace --enable-builtin-atomics > --enable-cxx-exceptions --enable-sparse-groups --enable-mpi-thread-multiple > --enable-memchecker --enable-btl-openib-failover --with-hwloc=internal > --with-verbs --with-x --with-slurm --with-pmi=/opt/slurm > --with-fca=/opt/mellanox/fca --with-mxm=/opt/mellanox/mxm > --with-hcoll=/opt/mellanox/hcoll' openmpi-1.8.5-1.src.rpm > >>>>>>>> > >>>>>>>> gcc version 5.1.1 > >>>>>>>> > >>>>>>>> Thanks in advance > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> us...@open-mpi.org > >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/users/2015/06/27154.php > >>>>>>>> > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> users mailing list > >>>>>>>> > >>>>>>>> us...@open-mpi.org > >>>>>>>> > >>>>>>>> Subscription: > >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>>> > >>>>>>>> Link to this post: > >>>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27155.php > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> users mailing list > >>>>>>> > >>>>>>> us...@open-mpi.org > >>>>>>> > >>>>>>> Subscription: > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>>> > >>>>>>> Link to this post: > >>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27157.php > >>>>>> > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> > >>>>>> us...@open-mpi.org > >>>>>> > >>>>>> Subscription: > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>>> Link to this post: > >>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27169.php > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> > >>>>> us...@open-mpi.org > >>>>> > >>>>> Subscription: > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>>> Link to this post: > >>>>> http://www.open-mpi.org/community/lists/users/2015/06/27170.php > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> > >>>> us...@open-mpi.org > >>>> > >>>> Subscription: > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/users/2015/06/27183.php > >>> > >>> > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> > >>> de...@open-mpi.org > >>> > >>> Subscription: > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/devel/2015/06/17528.php > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/06/17529.php > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/06/17530.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/06/17531.php >