Ralph, Indeed, configuration with --enable-mca-no-build=coll-ml resolved my problem. So, this *is* the same problem at was already known. Sorry for the false alarm.
-Paul On Sun, Aug 23, 2015 at 9:43 AM, Ralph Castain <r...@open-mpi.org> wrote: > I think that’s true - this looks like the hcoll symbol issue. I’d suggest > configuring with —enable-mca-no-build=coll-ml to resolve the problem in > static builds, or follow Gilles suggestion about .ompi_ignore > > > > > On Aug 22, 2015, at 10:14 PM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > Paul, > > if ompi is built statically or with --disable-dlopen, I do not think --mca > coll ^ml can prevent the crash (assuming this is the same issue we > discussed before). > note if you build dynamically and without --disable-dlopen, it might or > might not crash, depending on how modules are enumerated, and this is > specific to each system. > > so at this stage, I cannot suspect this is a different issue or not. > if the crash still occurs with .ompi_ignore in coll ml, then I could > conclude this is a different issue. > > Cheers, > > Gilles > > On Sunday, August 23, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote: > >> Gilles, >> >> This is on Mellanox's own system where /opt/mellanox/hcoll was updates >> Aug 2. >> This problem also did not occur unless I build libmpi statically. >> A run of "mpirun -mca coll ^ml -np 2 examples/ring_c" still crashes. >> So, I really don't know if this is the same issue, but suspect that it is >> not. >> >> -Paul >> >> On Sat, Aug 22, 2015 at 6:00 PM, Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com> wrote: >> >>> Paul, >>> >>> isn t this an issue that was already discussed ? >>> mellanox proprietary hcoll library includes its own coll ml module that >>> conflicts with the ompi one. >>> mellanox folks fixed this internally but I am not sure this has been >>> released. >>> you can run >>> nm libhcoll.so >>> if there are some symbols starting with coll_ml, then the issue is still >>> there. >>> if you have time and recent autotools, you can >>> touch ompi/mca/coll/ml/.ompi_ignore >>> ./autogen.pl >>> make ... >>> and that should be fine >>> >>> if you configure'd with dynamic libraries and no --disable_dlopen, then >>> mpirun --mca coll ^ml ... >>> is enough to work around the issue. >>> >>> Cheers, >>> >>> Gilles >>> >>> On Sunday, August 23, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote: >>> >>>> Having seen problems with mtl:ofi with "--enable-static >>>> --disable-shared", I tried mtl:psm and mtl:mxm with those options as well. >>>> >>>> The good news is that mtl:psm was fine, but the bad news is when >>>> testing mtl:mxm I encountered a new problem involving coll:hcol. >>>> Ralph probably wants to strangle me right now... >>>> >>>> >>>> I am configuring the 1.10.0rc4 tarball with >>>> --prefix=[...] --enable-debug --with-verbs >>>> --enable-openib-connectx-xrc \ >>>> --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll \ >>>> --enable-static --disable-shared >>>> >>>> Everything was fine without those last two arguments. >>>> When I add them the build is fine, and I can compile the examples. >>>> However, I get a SEGV when running an example: >>>> >>>> $mpirun -np 2 examples/ring_c >>>> [mir13:12444:0] Caught signal 11 (Segmentation fault) >>>> [mir13:12445:0] Caught signal 11 (Segmentation fault) >>>> ==== backtrace ==== >>>> ==== backtrace ==== >>>> 2 0x0000000000059d9c mxm_handle_error() >>>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/ >>>> src/mxm/util/debug/debug.c:641 >>>> 3 0x0000000000059f0c mxm_error_signal_handler() >>>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm >>>> -master/src/mxm/util/debug/debug.c:616 >>>> 4 0x0000003c2e0329a0 killpg() ??:0 >>>> 5 0x0000000000528b51 opal_list_remove_last() >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux >>>> -x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721 >>>> 6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers() >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/ope >>>> >>>> nmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537 >>>> 7 0x000000000009e983 hmca_bcol_basesmuma_comm_query() ??:0 >>>> 8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery() >>>> coll_ml_module.c:0 >>>> 9 0x00000000000317a2 hmca_coll_ml_comm_query() ??:0 >>>> 10 0x000000000006c929 hcoll_create_context() ??:0 >>>> 11 0x00000000004a248f mca_coll_hcoll_comm_query() >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l >>>> >>>> inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290 >>>> 12 0x000000000047c82f query_2_0_0() >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mx >>>> >>>> m-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392 >>>> 13 0x000000000047c7ee query() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stat >>>> ic/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375 >>>> 14 0x000000000047c704 check_one_component() >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x >>>> >>>> 86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337 >>>> 15 0x000000000047c567 check_components() >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_ >>>> >>>> 64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301 >>>> 16 0x000000000047552a mca_coll_base_comm_select() >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-l >>>> >>>> inux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131 >>>> 17 0x0000000000428476 ompi_mpi_init() >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64- >>>> mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894 >>>> 18 0x0000000000431ba5 PMPI_Init() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm- >>>> static/BLD/ompi/mpi/c/profile/pinit.c:84 >>>> 19 0x000000000040abce main() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-stati >>>> c/BLD/examples/ring_c.c:19 >>>> 20 0x0000003c2e01ed1d __libc_start_main() ??:0 >>>> 21 0x000000000040aae9 _start() ??:0 >>>> =================== >>>> 2 0x0000000000059d9c mxm_handle_error() >>>> >>>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:641 >>>> 3 0x0000000000059f0c mxm_error_signal_handler() >>>> >>>> /hpc/local/benchmarks/hpc-stack-gcc-Saturday/src/install/mxm-master/src/mxm/util/debug/debug.c:616 >>>> 4 0x0000003c2e0329a0 killpg() ??:0 >>>> 5 0x0000000000528b51 opal_list_remove_last() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/opal/class/opal_list.h:721 >>>> 6 0x0000000000529872 base_bcol_basesmuma_setup_library_buffers() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/bcol/basesmuma/bcol_basesmuma_setup.c:537 >>>> 7 0x000000000009e983 hmca_bcol_basesmuma_comm_query() ??:0 >>>> 8 0x00000000000348e3 hmca_coll_ml_tree_hierarchy_discovery() >>>> coll_ml_module.c:0 >>>> 9 0x00000000000317a2 hmca_coll_ml_comm_query() ??:0 >>>> 10 0x000000000006c929 hcoll_create_context() ??:0 >>>> 11 0x00000000004a248f mca_coll_hcoll_comm_query() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/hcoll/coll_hcoll_module.c:290 >>>> 12 0x000000000047c82f query_2_0_0() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:392 >>>> 13 0x000000000047c7ee query() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:375 >>>> 14 0x000000000047c704 check_one_component() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:337 >>>> 15 0x000000000047c567 check_components() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:301 >>>> 16 0x000000000047552a mca_coll_base_comm_select() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/mca/coll/base/coll_base_comm_select.c:131 >>>> 17 0x0000000000428476 ompi_mpi_init() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/openmpi-1.10.0rc4/ompi/runtime/ompi_mpi_init.c:894 >>>> 18 0x0000000000431ba5 PMPI_Init() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/ompi/mpi/c/profile/pinit.c:84 >>>> 19 0x000000000040abce main() >>>> >>>> /hpc/home/USERS/phhargrove/SCRATCH/OMPI/openmpi-1.10.0rc4-linux-x86_64-mxm-static/BLD/examples/ring_c.c:19 >>>> 20 0x0000003c2e01ed1d __libc_start_main() ??:0 >>>> 21 0x000000000040aae9 _start() ??:0 >>>> =================== >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that process rank 1 with PID 12445 on node mir13 exited >>>> on signal 13 (Broken pipe). >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> This is reproducible. >>>> A run with "-np 1" is fine. >>>> >>>> -Paul >>>> >>>> -- >>>> Paul H. Hargrove phhargr...@lbl.gov >>>> Computer Languages & Systems Software (CLaSS) Group >>>> Computer Science Department Tel: +1-510-495-2352 >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/08/17795.php >>> >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/08/17797.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/08/17799.php > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900