Should we .ompi_ignore ml?

> On Jun 25, 2015, at 4:41 AM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> 
> Thanks, Gilles.
> 
> We are addressing this.
> 
> Josh
> 
> Sent from my iPhone
> 
> On Jun 25, 2015, at 11:03 AM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
>> Folks,
>> 
>> this is a followup on an issue reported by Daniel on the users mailing list :
>> OpenMPI is built with hcoll from Mellanox.
>> the coll ml module has default priority zero.
>> 
>> on my cluster, it works just fine
>> on Daniel's cluster, it crashes.
>> 
>> i was able to reproduce the crash by tweaking mca_base_component_path and 
>> ensure
>> the coll ml module is loaded first.
>> 
>> basically, i found two issues :
>> 1) libhcoll.so (vendor lib provided by Mellanox, i tested 
>> hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.2-x86_64) seems to include its own 
>> coll ml, since there are some *public* symbols that are common to this 
>> module (ml_open, ml_coll_hier_barrier_setup, ...)
>> 2) coll ml priority is zero, and even if the library is dlclose'd, it seems 
>> this is uneffective
>> (nothing changed in /proc/xxx/maps before and after dlclose)
>> 
>> 
>> there are two workarounds :
>> mpirun --mca coll ^ml
>> or
>> mpirun --mca coll ^hcoll ... (probably not what is needed though ...)
>> 
>> is it expected the library is not unloaded after dlclose ?
>> 
>> Mellanox folks,
>> can you please double check how libhcoll is built ?
>> i guess it would work if the ml_ symbols were private to the library.
>> if not, the only workaround is to mpirun --mca coll ^ml
>> otherwise, it might crash (if coll_ml is loaded before coll_hcoll, which is 
>> really system dependent)
>> 
>> Cheers,
>> 
>> Gilles
>> On 6/25/2015 10:46 AM, Gilles Gouaillardet wrote:
>>> Daniel,
>>> 
>>> thanks for the logs.
>>> 
>>> an other workaround is to
>>> mpirun --mca coll ^hcoll ...
>>> 
>>> i was able to reproduce the issue, and it surprisingly occurs only if the 
>>> coll_ml module is loaded *before* the hcoll module.
>>> /* this is not the case on my system, so i had to hack my 
>>> mca_base_component_path in order to reproduce the issue */
>>> 
>>> as far as i understand, libhcoll is a proprietary software, so i cannot dig 
>>> into it.
>>> that being said, i noticed libhcoll defines some symbols (such as 
>>> ml_coll_hier_barrier_setup) that are also defined by the coll_ml module, so 
>>> it is likely hcoll coll_ml and openmpi coll_ml are not binary compatible 
>>> hence the error.
>>> 
>>> i will dig a bit more and see if this is even supposed to happen (since 
>>> coll_ml_priority is zero, why is the module still loaded ?)
>>> 
>>> as far as i am concerned, you *have to* mpirun --mca coll ^ml or update 
>>> your user/system wide config file to blacklist the coll_ml module to ensure 
>>> this is working.
>>> 
>>> Mike and Mellanox folks, could you please comment on that ?
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> 
>>> 
>>> On 6/24/2015 5:23 PM, Daniel Letai wrote:
>>>> Gilles,
>>>> 
>>>> Attached the two output logs.
>>>> 
>>>> Thanks,
>>>> Daniel
>>>> 
>>>> On 06/22/2015 08:08 AM, Gilles Gouaillardet wrote:
>>>>> Daniel,
>>>>> 
>>>>> i double checked this and i cannot make any sense with these logs.
>>>>> 
>>>>> if coll_ml_priority is zero, then i do not any way how 
>>>>> ml_coll_hier_barrier_setup can be invoked.
>>>>> 
>>>>> could you please run again with --mca coll_base_verbose 100
>>>>> with and without --mca coll ^ml
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Gilles
>>>>> 
>>>>> On 6/22/2015 12:08 AM, Gilles Gouaillardet wrote:
>>>>>> Daniel,
>>>>>> 
>>>>>> ok, thanks
>>>>>> 
>>>>>> it seems that even if priority is zero, some code gets executed
>>>>>> I will confirm this tomorrow and send you a patch to work around the 
>>>>>> issue if that if my guess is proven right
>>>>>> 
>>>>>> Cheers,
>>>>>> 
>>>>>> Gilles
>>>>>> 
>>>>>> On Sunday, June 21, 2015, Daniel Letai <d...@letai.org.il> wrote:
>>>>>> MCA coll: parameter "coll_ml_priority" (current value: "0", data source: 
>>>>>> default, level: 9 dev/all, type: int)
>>>>>> 
>>>>>> Not sure how to read this, but for any n>1 mpirun only works with --mca 
>>>>>> coll ^ml
>>>>>> 
>>>>>> Thanks for helping
>>>>>> 
>>>>>> On 06/18/2015 04:36 PM, Gilles Gouaillardet wrote:
>>>>>>> This is really odd...
>>>>>>> 
>>>>>>> you can run
>>>>>>> ompi_info --all 
>>>>>>> and search coll_ml_priority
>>>>>>> 
>>>>>>> it will display the current value and the origin
>>>>>>> (e.g. default, system wide config, user config, cli, environment 
>>>>>>> variable)
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> 
>>>>>>> Gilles
>>>>>>> 
>>>>>>> On Thursday, June 18, 2015, Daniel Letai <d...@letai.org.il> wrote:
>>>>>>> No, that's the issue.
>>>>>>> I had to disable it to get things working.
>>>>>>> 
>>>>>>> That's why I included my config settings - I couldn't figure out which 
>>>>>>> option enabled it, so I could remove it from the configuration...
>>>>>>> 
>>>>>>> On 06/18/2015 02:43 PM, Gilles Gouaillardet wrote:
>>>>>>>> Daniel,
>>>>>>>> 
>>>>>>>> ML module is not ready for production and is disabled by default.
>>>>>>>> 
>>>>>>>> Did you explicitly enable this module ?
>>>>>>>> If yes, I encourage you to disable it
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> 
>>>>>>>> Gilles
>>>>>>>> 
>>>>>>>> On Thursday, June 18, 2015, Daniel Letai <d...@letai.org.il> wrote:
>>>>>>>> given a simple hello.c:
>>>>>>>> 
>>>>>>>> #include <stdio.h>
>>>>>>>> #include <mpi.h>
>>>>>>>> 
>>>>>>>> int main(int argc, char* argv[])
>>>>>>>> {
>>>>>>>>         int size, rank, len;
>>>>>>>>         char name[MPI_MAX_PROCESSOR_NAME];
>>>>>>>> 
>>>>>>>>         MPI_Init(&argc, &argv);
>>>>>>>>         MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>>>>>         MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>>>>         MPI_Get_processor_name(name, &len);
>>>>>>>> 
>>>>>>>>         printf("%s: Process %d out of %d\n", name, rank, size);
>>>>>>>> 
>>>>>>>>         MPI_Finalize();ffff
>>>>>>>> }
>>>>>>>> 
>>>>>>>> for n=1
>>>>>>>> mpirun -n 1 ./hello
>>>>>>>> it works correctly.
>>>>>>>> 
>>>>>>>> for n>1 it segfaults with signal 11
>>>>>>>> used gdb to trace the problem to ml coll:
>>>>>>>> 
>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>> 0x00007ffff6750845 in ml_coll_hier_barrier_setup()
>>>>>>>>     from <path to openmpi 1.8.5>/lib/openmpi/mca_coll_ml.so
>>>>>>>> 
>>>>>>>> running with
>>>>>>>> mpirun -n 2 --mca coll ^ml ./hello
>>>>>>>> works correctly
>>>>>>>> 
>>>>>>>> using mellanox ofed 2.3-2.0.5-rhel6.4-x86_64, if it's at all relevant.
>>>>>>>> openmpi 1.8.5 was built with following options:
>>>>>>>> rpmbuild --rebuild --define 'configure_options --with-verbs=/usr 
>>>>>>>> --with-verbs-libdir=/usr/lib64 CC=gcc CXX=g++ FC=gfortran CFLAGS="-g 
>>>>>>>> -O3" --enable-mpirun-prefix-by-default 
>>>>>>>> --enable-orterun-prefix-by-default --disable-debug 
>>>>>>>> --with-knem=/opt/knem-1.1.1.90mlnx --with-platform=optimized 
>>>>>>>> --without-mpi-param-check --with-contrib-vt-flags=--disable-iotrace 
>>>>>>>> --enable-builtin-atomics --enable-cxx-exceptions 
>>>>>>>> --enable-sparse-groups --enable-mpi-thread-multiple 
>>>>>>>> --enable-memchecker --enable-btl-openib-failover --with-hwloc=internal 
>>>>>>>> --with-verbs --with-x --with-slurm --with-pmi=/opt/slurm 
>>>>>>>> --with-fca=/opt/mellanox/fca --with-mxm=/opt/mellanox/mxm 
>>>>>>>> --with-hcoll=/opt/mellanox/hcoll' openmpi-1.8.5-1.src.rpm
>>>>>>>> 
>>>>>>>> gcc version 5.1.1
>>>>>>>> 
>>>>>>>> Thanks in advance
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post: 
>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27154.php
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> 
>>>>>>>> us...@open-mpi.org
>>>>>>>> 
>>>>>>>> Subscription: 
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> Link to this post: 
>>>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27155.php
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> 
>>>>>>> us...@open-mpi.org
>>>>>>> 
>>>>>>> Subscription: 
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27157.php
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> 
>>>>>> us...@open-mpi.org
>>>>>> 
>>>>>> Subscription: 
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27169.php
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> 
>>>>> us...@open-mpi.org
>>>>> 
>>>>> Subscription: 
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2015/06/27170.php
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> 
>>>> us...@open-mpi.org
>>>> 
>>>> Subscription: 
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/06/27183.php
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> 
>>> de...@open-mpi.org
>>> 
>>> Subscription: 
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2015/06/17528.php
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/06/17529.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/06/17530.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to