Ralph,

Indeed, configuration with --enable-mca-no-build=coll-ml resolved my
problem.
So, this *is* the same problem at was already known.
Sorry for the false alarm.

-Paul

On Sun, Aug 23, 2015 at 9:43 AM, Ralph Castain <r...@open-mpi.org> wrote:

> I think that’s true - this looks like the hcoll symbol issue. I’d suggest
> configuring with —enable-mca-no-build=coll-ml to resolve the problem in
> static builds, or follow Gilles suggestion about .ompi_ignore
>
>
>
>
> On Aug 22, 2015, at 10:14 PM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> Paul,
>
> if ompi is built statically or with --disable-dlopen, I do not think --mca
> coll ^ml can prevent the crash (assuming this is the same issue we
> discussed before).
> note if you build dynamically and without --disable-dlopen, it might or
> might not crash, depending on how modules are enumerated, and this is
> specific to each system.
>
> so at this stage, I cannot suspect this is a different issue or not.
> if the crash still occurs with .ompi_ignore in coll ml, then I could
> conclude this is a different issue.
>
> Cheers,
>
> Gilles
>
> On Sunday, August 23, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>> Gilles,
>>
>> This is on Mellanox's own system where /opt/mellanox/hcoll was updates
>> Aug 2.
>> This problem also did not occur unless I build libmpi statically.
>> A run of "mpirun -mca coll ^ml -np 2 examples/ring_c" still crashes.
>> So, I really don't know if this is the same issue, but suspect that it is
>> not.
>>
>> -Paul
>>
>> On Sat, Aug 22, 2015 at 6:00 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> Paul,
>>>
>>> isn t this an issue that was already discussed ?
>>> mellanox proprietary hcoll library includes its own coll ml module that
>>> conflicts with the ompi one.
>>> mellanox folks fixed this internally but I am not sure this has been
>>> released.
>>> you can run
>>> nm libhcoll.so
>>> if there are some symbols starting with coll_ml, then the issue is still
>>> there.
>>> if you have time and recent autotools, you can
>>> touch ompi/mca/coll/ml/.ompi_ignore
>>> ./autogen.pl
>>> make ...
>>> and that should be fine
>>>
>>> if you configure'd with dynamic libraries and no --disable_dlopen, then
>>> mpirun --mca coll ^ml ...
>>> is enough to work around the issue.
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Sunday, August 23, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote:
>>>
>>>> Having seen problems with mtl:ofi with "--enable-static
>>>> --disable-shared", I tried mtl:psm and mtl:mxm with those options as well.
>>>>
>>>> The good news is that mtl:psm was fine, but the bad news is when
>>>> testing mtl:mxm I encountered a new problem involving coll:hcol.
>>>> Ralph probably wants to strangle me right now...
>>>>
>>>>
>>>> I am configuring the 1.10.0rc4 tarball with
>>>>    --prefix=[...] --enable-debug --with-verbs
>>>> --enable-openib-connectx-xrc \
>>>>    --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll \
>>>>    --enable-static --disable-shared
>>>>
>>>> Everything was fine without those last two arguments.
>>>> When I add them the build is fine, and I can compile the examples.
>>>> However, I get a SEGV when running an example:
>>>>
>>>> $mpirun -np 2 examples/ring_c
>>>> [mir13:12444:0] Caught signal 11 (Segmentation fault)
>>>> [mir13:12445:0] Caught signal 11 (Segmentation fault)
>>>> ==== backtrace ====
>>>> ==== backtrace ====
>>>>  2 0x0000000000059d9c mxm_handle_error()
>>>>  /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/
>>>> src/mxm/util/debug/debug.c:641
>>>>  3 0x0000000000059f0c mxm_error_signal_handler()
>>>>  /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm
>>>> -master/src/mxm/util/debug/debug.c:616
>>>>  4 0x0000003c2e0329a0 killpg()  ??:0
>>>>  5 0x0000000000528b51 opal_list_remove_last()
>>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux
>>>> -x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721
>>>>  6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers()
>>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/ope
>>>>
>>>> nmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537
>>>>  7 0x000000000009e983 hmca_bcol_basesmuma_comm_query()  ??:0
>>>>  8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery()
>>>>  coll_ml_module.c:0
>>>>  9 0x00000000000317a2 hmca_coll_ml_comm_query()  ??:0
>>>> 10 0x000000000006c929 hcoll_create_context()  ??:0
>>>> 11 0x00000000004a248f mca_coll_hcoll_comm_query()
>>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l
>>>>
>>>> inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290
>>>> 12 0x000000000047c82f query_2_0_0()
>>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mx
>>>>
>>>> m-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392
>>>> 13 0x000000000047c7ee query()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stat
>>>> ic/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375
>>>> 14 0x000000000047c704 check_one_component()
>>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x
>>>>
>>>> 86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337
>>>> 15 0x000000000047c567 check_components()
>>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_
>>>>
>>>> 64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301
>>>> 16 0x000000000047552a mca_coll_base_comm_select()
>>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l
>>>>
>>>> inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131
>>>> 17 0x0000000000428476 ompi_mpi_init()
>>>>  /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-
>>>> mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894
>>>> 18 0x0000000000431ba5 PMPI_Init()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-
>>>> static/BLD/ompi/mpi/c/profile/pinit.c:84
>>>> 19 0x000000000040abce main()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stati
>>>> c/BLD/examples/ring_c.c:19
>>>> 20 0x0000003c2e01ed1d __libc_start_main()  ??:0
>>>> 21 0x000000000040aae9 _start()  ??:0
>>>> ===================
>>>>  2 0x0000000000059d9c mxm_handle_error()
>>>>  
>>>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:641
>>>>  3 0x0000000000059f0c mxm_error_signal_handler()
>>>>  
>>>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:616
>>>>  4 0x0000003c2e0329a0 killpg()  ??:0
>>>>  5 0x0000000000528b51 opal_list_remove_last()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721
>>>>  6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537
>>>>  7 0x000000000009e983 hmca_bcol_basesmuma_comm_query()  ??:0
>>>>  8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery()
>>>>  coll_ml_module.c:0
>>>>  9 0x00000000000317a2 hmca_coll_ml_comm_query()  ??:0
>>>> 10 0x000000000006c929 hcoll_create_context()  ??:0
>>>> 11 0x00000000004a248f mca_coll_hcoll_comm_query()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290
>>>> 12 0x000000000047c82f query_2_0_0()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392
>>>> 13 0x000000000047c7ee query()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375
>>>> 14 0x000000000047c704 check_one_component()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337
>>>> 15 0x000000000047c567 check_components()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301
>>>> 16 0x000000000047552a mca_coll_base_comm_select()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131
>>>> 17 0x0000000000428476 ompi_mpi_init()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894
>>>> 18 0x0000000000431ba5 PMPI_Init()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/ompi/mpi/c/profile/pinit.c:84
>>>> 19 0x000000000040abce main()
>>>>  
>>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/examples/ring_c.c:19
>>>> 20 0x0000003c2e01ed1d __libc_start_main()  ??:0
>>>> 21 0x000000000040aae9 _start()  ??:0
>>>> ===================
>>>>
>>>> --------------------------------------------------------------------------
>>>> mpirun noticed that process rank 1 with PID 12445 on node mir13 exited
>>>> on signal 13 (Broken pipe).
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> This is reproducible.
>>>> A run with "-np 1" is fine.
>>>>
>>>> -Paul
>>>>
>>>> --
>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>> Computer Languages & Systems Software (CLaSS) Group
>>>> Computer Science Department               Tel: +1-510-495-2352
>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/08/17795.php
>>>
>>
>>
>>
>> --
>> Paul H. Hargrove                          phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/08/17797.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/08/17799.php
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to