do you have "-disable-dlopen" in your configure option? This might force
coll_ml to be loaded first even with -mca coll ^ml.

next HPCX is expected to release by end of Aug.

-Devendar

On Wed, Aug 12, 2015 at 3:30 PM, David Shrader <dshra...@lanl.gov> wrote:

> I remember seeing those, but forgot about them. I am curious, though, why
> using '-mca coll ^ml' wouldn't work for me.
>
> We'll watch for the next HPCX release. Is there an ETA on when that
> release may happen? Thank you for the help!
> David
>
>
> On 08/12/2015 04:04 PM, Deva wrote:
>
> David,
>
> This is because of hcoll symbols conflict with ml coll module inside OMPI.
> HCOLL is derived from ml module. This issue is fixed in hcoll library and
> will be available in next HPCX release.
>
> Some earlier discussion on this issue:
> http://www.open-mpi.org/community/lists/users/2015/06/27154.php
> http://www.open-mpi.org/community/lists/devel/2015/06/17562.php
>
> -Devendar
>
> On Wed, Aug 12, 2015 at 2:52 PM, David Shrader <dshra...@lanl.gov> wrote:
>
>> Interesting... the seg faults went away:
>>
>> [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
>> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>> [1439416182.732720] [zo-fe1:14690:0]         shm.c:65   MXM  WARN  Could
>> not open the KNEM device file at /dev/knem : No such file or direc
>> tory. Won't use knem.
>> [1439416182.733640] [zo-fe1:14689:0]         shm.c:65   MXM  WARN  Could
>> not open the KNEM device file at /dev/knem : No such file or direc
>> tory. Won't use knem.
>> 0: Running on host zo-fe1.lanl.gov
>> 0: We have 2 processors
>> 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty
>>
>> This implies to me that some other library is being used instead of
>> /usr/lib64/libhcoll.so, but I am not sure how that could be...
>>
>> Thanks,
>> David
>>
>> On 08/12/2015 03:30 PM, Deva wrote:
>>
>> Hi David,
>>
>> I tried same tarball on OFED-1.5.4.1 and I could not reproduce the
>> issue.  Can you do one more quick test with seeing LD_PRELOAD to hcoll lib?
>>
>> $LD_PRELOAD=<path/to/hcoll/lib/libhcoll.so>  mpirun -n 2  -mca coll ^ml
>> ./a.out
>>
>> -Devendar
>>
>> On Wed, Aug 12, 2015 at 12:52 PM, David Shrader < <dshra...@lanl.gov>
>> dshra...@lanl.gov> wrote:
>>
>>> The admin that rolled the hcoll rpm that we're using (and got it in
>>> system space) said that she got it from
>>> hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.
>>>
>>> Thanks,
>>> David
>>>
>>>
>>> On 08/12/2015 10:51 AM, Deva wrote:
>>>
>>> From where did you grab this HCOLL lib?  MOFED or HPCX? what version?
>>>
>>> On Wed, Aug 12, 2015 at 9:47 AM, David Shrader < <dshra...@lanl.gov>
>>> dshra...@lanl.gov> wrote:
>>>
>>>> Hey Devendar,
>>>>
>>>> It looks like I still get the error:
>>>>
>>>> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
>>>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>>>> [1439397957.351764] [zo-fe1:14678:0]         shm.c:65   MXM  WARN
>>>>  Could not open the KNEM device file at /dev/knem : No such file or direc
>>>> tory. Won't use knem.
>>>> [1439397957.352704] [zo-fe1:14677:0]         shm.c:65   MXM  WARN
>>>>  Could not open the KNEM device file at /dev/knem : No such file or direc
>>>> tory. Won't use knem.
>>>> [zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
>>>> [zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
>>>> ==== backtrace ====
>>>> 2 0x0000000000056cdc mxm_handle_error()
>>>>  
>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>>
>>>> 3 0x0000000000056e4c mxm_error_signal_handler()
>>>>  
>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>>>
>>>> 4 0x00000000000326a0 killpg()  ??:0
>>>> 5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
>>>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>>>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>>>  coll_ml_module.c:0
>>>> 8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
>>>> 9 0x000000000006ace9 hcoll_create_context()  ??:0
>>>> 10 0x00000000000f9706 mca_coll_hcoll_comm_query()  ??:0
>>>> 11 0x00000000000f684e mca_coll_base_comm_select()  ??:0
>>>> 12 0x0000000000073fc4 ompi_mpi_init()  ??:0
>>>> 13 0x0000000000092ea0 PMPI_Init()  ??:0
>>>> 14 0x00000000004009b6 main()  ??:0
>>>> 15 0x000000000001ed5d __libc_start_main()  ??:0
>>>> 16 0x00000000004008c9 _start()  ??:0
>>>> ===================
>>>> ==== backtrace ====
>>>> 2 0x0000000000056cdc mxm_handle_error()
>>>>  
>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>>
>>>> 3 0x0000000000056e4c mxm_error_signal_handler()
>>>>  
>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>>>
>>>> 4 0x00000000000326a0 killpg()  ??:0
>>>> 5 0x00000000000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
>>>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>>>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>>>  coll_ml_module.c:0
>>>> 8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
>>>> 9 0x000000000006ace9 hcoll_create_context()  ??:0
>>>> 10 0x00000000000f9706 mca_coll_hcoll_comm_query()  ??:0
>>>> 11 0x00000000000f684e mca_coll_base_comm_select()  ??:0
>>>> 12 0x0000000000073fc4 ompi_mpi_init()  ??:0
>>>> 13 0x0000000000092ea0 PMPI_Init()  ??:0
>>>> 14 0x00000000004009b6 main()  ??:0
>>>> 15 0x000000000001ed5d __libc_start_main()  ??:0
>>>> 16 0x00000000004008c9 _start()  ??:0
>>>> ===================
>>>> --------------------------------------------------------------------------
>>>>
>>>> mpirun noticed that process rank 1 with PID 14678 on node zo-fe1 exited
>>>> on signal 11 (Segmentation fault).
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>> On 08/12/2015 10:42 AM, Deva wrote:
>>>>
>>>> Hi David,
>>>>
>>>> This issue is from hcoll library. This could be because of symbol
>>>> conflict with ml module.  This is fixed recently in HCOLL.  Can you try
>>>> with "-mca coll ^ml" and see if this workaround works in your setup?
>>>>
>>>> -Devendar
>>>>
>>>> On Wed, Aug 12, 2015 at 9:30 AM, David Shrader < <dshra...@lanl.gov>
>>>> dshra...@lanl.gov> wrote:
>>>>
>>>>> Hello Gilles,
>>>>>
>>>>> Thank you very much for the patch! It is much more complete than mine.
>>>>> Using that patch and re-running autogen.pl, I am able to build 1.8.8
>>>>> with './configure --with-hcoll' without errors.
>>>>>
>>>>> I do have issues when it comes to running 1.8.8 with hcoll built in,
>>>>> however. In my quick sanity test of running a basic parallel hello world C
>>>>> program, I get the following:
>>>>>
>>>>> [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
>>>>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>>>>> [1439390789.039197] [zo-fe1:31354:0]         shm.c:65   MXM  WARN
>>>>>  Could not open the KNEM device file at /dev/knem : No such file or direc
>>>>> tory. Won't use knem.
>>>>> [1439390789.040265] [zo-fe1:31353:0]         shm.c:65   MXM  WARN
>>>>>  Could not open the KNEM device file at /dev/knem : No such file or direc
>>>>> tory. Won't use knem.
>>>>> [zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
>>>>> [zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
>>>>> ==== backtrace ====
>>>>> 2 0x0000000000056cdc mxm_handle_error()
>>>>>  
>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>>>
>>>>> 3 0x0000000000056e4c mxm_error_signal_handler()
>>>>>  
>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>>>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>>>>
>>>>> 4 0x00000000000326a0 killpg()  ??:0
>>>>> 5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
>>>>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>>>>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>>>>  coll_ml_module.c:0
>>>>> 8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
>>>>> 9 0x000000000006ace9 hcoll_create_context()  ??:0
>>>>> 10 0x00000000000fa626 mca_coll_hcoll_comm_query()  ??:0
>>>>> 11 0x00000000000f776e mca_coll_base_comm_select()  ??:0
>>>>> 12 0x0000000000074ee4 ompi_mpi_init()  ??:0
>>>>> 13 0x0000000000093dc0 PMPI_Init()  ??:0
>>>>> 14 0x00000000004009b6 main()  ??:0
>>>>> 15 0x000000000001ed5d __libc_start_main()  ??:0
>>>>> 16 0x00000000004008c9 _start()  ??:0
>>>>> ===================
>>>>> ==== backtrace ====
>>>>> 2 0x0000000000056cdc mxm_handle_error()
>>>>>  
>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>>>
>>>>> 3 0x0000000000056e4c mxm_error_signal_handler()
>>>>>  
>>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>>>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>>>>
>>>>> 4 0x00000000000326a0 killpg()  ??:0
>>>>> 5 0x00000000000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
>>>>> 6 0x00000000000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>>>>> 7 0x0000000000032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>>>>  coll_ml_module.c:0
>>>>> 8 0x000000000002fda2 hmca_coll_ml_comm_query()  ??:0
>>>>> 9 0x000000000006ace9 hcoll_create_context()  ??:0
>>>>> 10 0x00000000000fa626 mca_coll_hcoll_comm_query()  ??:0
>>>>> 11 0x00000000000f776e mca_coll_base_comm_select()  ??:0
>>>>> 12 0x0000000000074ee4 ompi_mpi_init()  ??:0
>>>>> 13 0x0000000000093dc0 PMPI_Init()  ??:0
>>>>> 14 0x00000000004009b6 main()  ??:0
>>>>> 15 0x000000000001ed5d __libc_start_main()  ??:0
>>>>> 16 0x00000000004008c9 _start()  ??:0
>>>>> ===================
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> mpirun noticed that process rank 0 with PID 31353 on node zo-fe1
>>>>> exited on signal 11 (Segmentation fault).
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> I do not get this message with only 1 process.
>>>>>
>>>>> I am using hcoll 3.2.748. Could this be an issue with hcoll itself or
>>>>> something with my ompi build?
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>> On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:
>>>>>
>>>>> Thanks David,
>>>>>
>>>>> i made a PR for the v1.8 branch at
>>>>> <https://github.com/open-mpi/ompi-release/pull/492>
>>>>> https://github.com/open-mpi/ompi-release/pull/492
>>>>>
>>>>> the patch is attached (it required some back-porting)
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Gilles
>>>>>
>>>>> On 8/12/2015 4:01 AM, David Shrader wrote:
>>>>>
>>>>> I have cloned Gilles' topic/hcoll_config branch and, after running
>>>>> autogen.pl, have found that './configure --with-hcoll' does indeed
>>>>> work now. I used Gilles' branch as I wasn't sure how best to get the pull
>>>>> request changes in to my own clone of master. It looks like the proper
>>>>> checks are happening, too:
>>>>>
>>>>> --- MCA component coll:hcoll (m4 configuration macro)
>>>>> checking for MCA component coll:hcoll compile mode... dso
>>>>> checking --with-hcoll value... simple ok (unspecified)
>>>>> checking hcoll/api/hcoll_api.h usability... yes
>>>>> checking hcoll/api/hcoll_api.h presence... yes
>>>>> checking for hcoll/api/hcoll_api.h... yes
>>>>> looking for library without search path
>>>>> checking for library containing hcoll_get_version... -lhcoll
>>>>> checking if MCA component coll:hcoll can compile... yes
>>>>>
>>>>> I haven't checked whether or not Open MPI builds successfully as I
>>>>> don't have much experience running off of the latest source. For now, I
>>>>> think I will try to generate a patch to the 1.8.8 configure script and see
>>>>> if that works as expected.
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>> On 08/11/2015 06:34 AM, Jeff Squyres (jsquyres) wrote:
>>>>>
>>>>> On Aug 11, 2015, at 1:39 AM, Åke Sandgren <ake.sandg...@hpc2n.umu.se> 
>>>>> <ake.sandg...@hpc2n.umu.se> wrote:
>>>>>
>>>>> Please fix the hcoll test (and code) to be correct.
>>>>>
>>>>> Any configure test that adds /usr/lib and/or /usr/include to any compile 
>>>>> flags is broken.
>>>>>
>>>>> +1
>>>>>
>>>>> Gilles filed https://github.com/open-mpi/ompi/pull/796; I just added some 
>>>>> comments to it.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> David Shrader
>>>>> HPC-3 High Performance Computer Systems
>>>>> Los Alamos National Lab
>>>>> Email: dshrader <at> lanl.gov
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing listus...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2015/08/27432.php
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing listus...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/users/2015/08/27434.php
>>>>>
>>>>>
>>>>> --
>>>>> David Shrader
>>>>> HPC-3 High Performance Computer Systems
>>>>> Los Alamos National Lab
>>>>> Email: dshrader <at> lanl.gov
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> <us...@open-mpi.org>us...@open-mpi.org
>>>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> <http://www.open-mpi.org/community/lists/users/2015/08/27438.php>
>>>>> http://www.open-mpi.org/community/lists/users/2015/08/27438.php
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> -Devendar
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing listus...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27439.php
>>>>
>>>>
>>>> --
>>>> David Shrader
>>>> HPC-3 High Performance Computer Systems
>>>> Los Alamos National Lab
>>>> Email: dshrader <at> lanl.gov
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: <http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> <http://www.open-mpi.org/community/lists/users/2015/08/27440.php>
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27440.php
>>>>
>>>
>>>
>>>
>>> --
>>>
>>>
>>> -Devendar
>>>
>>>
>>> _______________________________________________
>>> users mailing listus...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/08/27441.php
>>>
>>>
>>> --
>>> David Shrader
>>> HPC-3 High Performance Computer Systems
>>> Los Alamos National Lab
>>> Email: dshrader <at> lanl.gov
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> <http://www.open-mpi.org/community/lists/users/2015/08/27445.php>
>>> http://www.open-mpi.org/community/lists/users/2015/08/27445.php
>>>
>>
>>
>>
>> --
>>
>>
>> -Devendar
>>
>>
>> --
>> David Shrader
>> HPC-3 High Performance Computer Systems
>> Los Alamos National Lab
>> Email: dshrader <at> lanl.gov
>>
>>
>
>
> --
>
>
> -Devendar
>
>
> --
> David Shrader
> HPC-3 High Performance Computer Systems
> Los Alamos National Lab
> Email: dshrader <at> lanl.gov
>
>


-- 


-Devendar

Reply via email to