Gilles,

This is on Mellanox's own system where /opt/mellanox/hcoll was updates Aug
2.
This problem also did not occur unless I build libmpi statically.
A run of "mpirun -mca coll ^ml -np 2 examples/ring_c" still crashes.
So, I really don't know if this is the same issue, but suspect that it is
not.

-Paul

On Sat, Aug 22, 2015 at 6:00 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Paul,
>
> isn t this an issue that was already discussed ?
> mellanox proprietary hcoll library includes its own coll ml module that
> conflicts with the ompi one.
> mellanox folks fixed this internally but I am not sure this has been
> released.
> you can run
> nm libhcoll.so
> if there are some symbols starting with coll_ml, then the issue is still
> there.
> if you have time and recent autotools, you can
> touch ompi/mca/coll/ml/.ompi_ignore
> ./autogen.pl
> make ...
> and that should be fine
>
> if you configure'd with dynamic libraries and no --disable_dlopen, then
> mpirun --mca coll ^ml ...
> is enough to work around the issue.
>
> Cheers,
>
> Gilles
>
> On Sunday, August 23, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>> Having seen problems with mtl:ofi with "--enable-static
>> --disable-shared", I tried mtl:psm and mtl:mxm with those options as well.
>>
>> The good news is that mtl:psm was fine, but the bad news is when testing
>> mtl:mxm I encountered a new problem involving coll:hcol.
>> Ralph probably wants to strangle me right now...
>>
>>
>> I am configuring the 1.10.0rc4 tarball with
>>    --prefix=[...] --enable-debug --with-verbs
>> --enable-openib-connectx-xrc \
>>    --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll \
>>    --enable-static --disable-shared
>>
>> Everything was fine without those last two arguments.
>> When I add them the build is fine, and I can compile the examples.
>> However, I get a SEGV when running an example:
>>
>> $mpirun -np 2 examples/ring_c
>> [mir13:12444:0] Caught signal 11 (Segmentation fault)
>> [mir13:12445:0] Caught signal 11 (Segmentation fault)
>> ==== backtrace ====
>> ==== backtrace ====
>>  2 0x0000000000059d9c mxm_handle_error()
>>  /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/
>> src/mxm/util/debug/debug.c:641
>>  3 0x0000000000059f0c mxm_error_signal_handler()
>>  /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm
>> -master/src/mxm/util/debug/debug.c:616
>>  4 0x0000003c2e0329a0 killpg()  ??:0
>>  5 0x0000000000528b51 opal_list_remove_last()
>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux
>> -x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721
>>  6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers()
>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/ope
>>
>> nmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537
>>  7 0x000000000009e983 hmca_bcol_basesmuma_comm_query()  ??:0
>>  8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery()
>>  coll_ml_module.c:0
>>  9 0x00000000000317a2 hmca_coll_ml_comm_query()  ??:0
>> 10 0x000000000006c929 hcoll_create_context()  ??:0
>> 11 0x00000000004a248f mca_coll_hcoll_comm_query()
>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l
>>
>> inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290
>> 12 0x000000000047c82f query_2_0_0()
>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mx
>> m-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392
>> 13 0x000000000047c7ee query()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stat
>> ic/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375
>> 14 0x000000000047c704 check_one_component()
>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x
>>
>> 86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337
>> 15 0x000000000047c567 check_components()
>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_
>>
>> 64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301
>> 16 0x000000000047552a mca_coll_base_comm_select()
>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l
>>
>> inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131
>> 17 0x0000000000428476 ompi_mpi_init()
>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-
>> mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894
>> 18 0x0000000000431ba5 PMPI_Init()
>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-
>> static/BLD/ompi/mpi/c/profile/pinit.c:84
>> 19 0x000000000040abce main()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stati
>> c/BLD/examples/ring_c.c:19
>> 20 0x0000003c2e01ed1d __libc_start_main()  ??:0
>> 21 0x000000000040aae9 _start()  ??:0
>> ===================
>>  2 0x0000000000059d9c mxm_handle_error()
>>  
>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:641
>>  3 0x0000000000059f0c mxm_error_signal_handler()
>>  
>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:616
>>  4 0x0000003c2e0329a0 killpg()  ??:0
>>  5 0x0000000000528b51 opal_list_remove_last()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721
>>  6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537
>>  7 0x000000000009e983 hmca_bcol_basesmuma_comm_query()  ??:0
>>  8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery()
>>  coll_ml_module.c:0
>>  9 0x00000000000317a2 hmca_coll_ml_comm_query()  ??:0
>> 10 0x000000000006c929 hcoll_create_context()  ??:0
>> 11 0x00000000004a248f mca_coll_hcoll_comm_query()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290
>> 12 0x000000000047c82f query_2_0_0()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392
>> 13 0x000000000047c7ee query()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375
>> 14 0x000000000047c704 check_one_component()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337
>> 15 0x000000000047c567 check_components()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301
>> 16 0x000000000047552a mca_coll_base_comm_select()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131
>> 17 0x0000000000428476 ompi_mpi_init()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894
>> 18 0x0000000000431ba5 PMPI_Init()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/ompi/mpi/c/profile/pinit.c:84
>> 19 0x000000000040abce main()
>>  
>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/examples/ring_c.c:19
>> 20 0x0000003c2e01ed1d __libc_start_main()  ??:0
>> 21 0x000000000040aae9 _start()  ??:0
>> ===================
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 12445 on node mir13 exited on
>> signal 13 (Broken pipe).
>> --------------------------------------------------------------------------
>>
>> This is reproducible.
>> A run with "-np 1" is fine.
>>
>> -Paul
>>
>> --
>> Paul H. Hargrove                          phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/08/17795.php
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to