Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"
I tried the patch, but I get the same result: error obtaining device attributes for mlx4_0 errno says Success I'm getting (what I think are) good transfer rates using "--mca btl self,tcp" on the osu_bw test (~7000 MB/s). It seems to me that the only way that could be happening is if the infiniband interfaces are being used over TCP, correct? Would such an arrangement preclude the ability to do RDMA or openib? Perhaps the network is setup in such a way that the IB hardware is not discoverable by openib? (I'm not a network admin, and I wasn't involved in the setup of the network. Unfortunately, the person who knows the most has recently left the organization.) Greg From: Pritchard Jr., Howard Sent: Thursday, October 14, 2021 5:45 PM To: Fischer, Greg A. ; Open MPI Users Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] HI Greg, Oh yes that's not good about rdmacm. Yes the OFED looks pretty old. Did you by any chance apply that patch? I generated that for a sysadmin here who was in the situation where they needed to maintain Open MPI 3.1.6 but had to also upgrade to some newer RHEL release, but the Open MPi wasn't compiling after the RHEL upgrade. Howard From: "Fischer, Greg A." mailto:fisch...@westinghouse.com>> Date: Thursday, October 14, 2021 at 1:47 PM To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, Open MPI Users mailto:users@lists.open-mpi.org>> Cc: "Fischer, Greg A." mailto:fisch...@westinghouse.com>> Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" I added -enable-mt and re-installed UCX. Same result. (I didn't re-compile OpenMPI.) A conspicuous warning I see in my UCX configure output is: checking for rdma_establish in -lrdmacm... no configure: WARNING: RDMACM requested but librdmacm is not found or does not provide rdma_establish() API The version of librdmacm we have comes from librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64, which seems to date from mid-2017. I wonder if that's too old? Greg From: Pritchard Jr., Howard mailto:howa...@lanl.gov>> Sent: Thursday, October 14, 2021 3:31 PM To: Fischer, Greg A. mailto:fisch...@westinghouse.com>>; Open MPI Users mailto:users@lists.open-mpi.org>> Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] Hi Greg, I think the UCX PML may be discomfited by the lack of thread safety. Could you try using the contrib/configure-release-mt in your ucx folder? You want to add -enable-mt. That's what stands out in your configure output from the one I usually get when building on a MLNX connectx5 cluster with MLNX_OFED_LINUX-4.5-1.0.1.0 Here's the output from one of my UCX configs: configure: = configure: UCX build configuration: configure: Build prefix: /ucx_testing/ucx/test_install configure:Configuration dir: ${prefix}/etc/ucx configure: Preprocessor flags: -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src configure: C compiler: /users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement configure: C++ compiler: /users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++ -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch configure: Multi-thread: enabled configure: NUMA support: disabled configure:MPI tests: disabled configure: VFS support: no configure:Devel headers: no configure: io_demo CUDA support: no configure: Bindings: < > configure: UCS modules: < > configure: UCT modules: < ib cma knem > configure: CUDA modules: < > configure: ROCM modules: < > configure: IB modules: < > configure: UCM modules: < > configure: Perf modules: < > configure: = Howard From: "Fischer, Greg A." mailto:fisch...@westinghouse.com>> Date: Thursday, October 14, 2021 at 12:46 PM To: "Pritchard Jr., How
Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"
Thanks, Howard. I downloaded a current version of UCX (1.11.2) and installed it with OpenMPI 4.1.1. When I try to specify the "-mca pml ucx" for a simple, 2-process benchmark problem, I get: -- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: bl1311 Framework: pml -- [bl1311:20168] PML ucx cannot be selected [bl1311:20169] PML ucx cannot be selected I've attached my ucx_info -d output, as well as the ucx configuration information. I'm not sure I follow everything on the UCX FAQ page, but it seems like everything is being routed over TCP, which is probably not what I want. Any thoughts as to what I might be doing wrong? Thanks, Greg From: Pritchard Jr., Howard Sent: Wednesday, October 13, 2021 12:28 PM To: Open MPI Users Cc: Fischer, Greg A. Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] HI Greg, It's the aging of the openib btl. You may be able to apply the attached patch. Note the 3.1.x release stream is no longer supported. You may want to try using the 4.1.1 release, in which case you'll want to use UCX. Howard From: users mailto:users-boun...@lists.open-mpi.org>> on behalf of "Fischer, Greg A. via users" mailto:users@lists.open-mpi.org>> Reply-To: Open MPI Users mailto:users@lists.open-mpi.org>> Date: Wednesday, October 13, 2021 at 10:06 AM To: "users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>" mailto:users@lists.open-mpi.org>> Cc: "Fischer, Greg A." mailto:fisch...@westinghouse.com>> Subject: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" Hello, I have compiled OpenMPI 3.1.6 from source on SLES12-SP3, and I am seeing the following errors when I try to use the openib btl: WARNING: There was an error initializing an OpenFabrics device. Local host: bl1308 Local device: mlx4_0 -- [bl1308][[44866,1],5][../../../../../openmpi-3.1.6/opal/mca/btl/openib/btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success I have disabled UCX ("--without-ucx") because the UCX installation we have seems to be too out-of-date. ofed_info says "MLNX_OFED_LINUX-4.1-1.0.2.0". I've attached the detailed output of ofed_info and ompi_info. This issue seems similar to Issue #7461 (https://github.com/open-mpi/ompi/issues/7461<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopen-mpi%2Fompi%2Fissues%2F7461=04%7C01%7Cfischega%40westinghouse.com%7Cfe8eac2c9dfb4f26781a08d98e667521%7C516ec17ab92f438b8594e11b6f6bec79%7C0%7C0%7C637697392985500288%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=uZVYaEU3YA7hcUD%2F4Mrtarmo26J64O41I9WlPDPpXLk%3D=0>), which I don't see a resolution for. Does anyone know what the likely explanation is? Is the version of OFED on the system badly out-of-sync with contemporary OpenMPI? Thanks, Greg This e-mail may contain proprietary information of the sending organization. Any unauthorized or improper disclosure, copying, distribution, or use of the contents of this e-mail and attached document(s) is prohibited. The information contained in this e-mail and attached document(s) is intended only for the personal and private use of the recipient(s) named above. If you have received this communication in error, please notify the sender immediately by email and delete the original e-mail and attached document(s). This e-mail may contain proprietary information of the sending organization. Any unauthorized or improper disclosure, copying, distribution, or use of the contents of this e-mail and attached document(s) is prohibited. The information contained in this e-mail and attached document(s) is intended only for the personal and private use of the recipient(s) named above. If you have received this communication in error, please notify the sender immediately by email and delete the original e-mail and attached document(s). # # Memory domain: posix # Component: posix # allocate: unlimited # remote key: 24 bytes # rkey_ptr is supported # # Transport: posix # Device: memory # System device: # # capabilities: #ba
Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"
I added -enable-mt and re-installed UCX. Same result. (I didn't re-compile OpenMPI.) A conspicuous warning I see in my UCX configure output is: checking for rdma_establish in -lrdmacm... no configure: WARNING: RDMACM requested but librdmacm is not found or does not provide rdma_establish() API The version of librdmacm we have comes from librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64, which seems to date from mid-2017. I wonder if that's too old? Greg From: Pritchard Jr., Howard Sent: Thursday, October 14, 2021 3:31 PM To: Fischer, Greg A. ; Open MPI Users Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] Hi Greg, I think the UCX PML may be discomfited by the lack of thread safety. Could you try using the contrib/configure-release-mt in your ucx folder? You want to add -enable-mt. That's what stands out in your configure output from the one I usually get when building on a MLNX connectx5 cluster with MLNX_OFED_LINUX-4.5-1.0.1.0 Here's the output from one of my UCX configs: configure: = configure: UCX build configuration: configure: Build prefix: /ucx_testing/ucx/test_install configure:Configuration dir: ${prefix}/etc/ucx configure: Preprocessor flags: -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src configure: C compiler: /users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement configure: C++ compiler: /users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++ -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch configure: Multi-thread: enabled configure: NUMA support: disabled configure:MPI tests: disabled configure: VFS support: no configure:Devel headers: no configure: io_demo CUDA support: no configure: Bindings: < > configure: UCS modules: < > configure: UCT modules: < ib cma knem > configure: CUDA modules: < > configure: ROCM modules: < > configure: IB modules: < > configure: UCM modules: < > configure: Perf modules: < > configure: ========= Howard From: "Fischer, Greg A." mailto:fisch...@westinghouse.com>> Date: Thursday, October 14, 2021 at 12:46 PM To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, Open MPI Users mailto:users@lists.open-mpi.org>> Cc: "Fischer, Greg A." mailto:fisch...@westinghouse.com>> Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" Thanks, Howard. I downloaded a current version of UCX (1.11.2) and installed it with OpenMPI 4.1.1. When I try to specify the "-mca pml ucx" for a simple, 2-process benchmark problem, I get: -- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: bl1311 Framework: pml -- [bl1311:20168] PML ucx cannot be selected [bl1311:20169] PML ucx cannot be selected I've attached my ucx_info -d output, as well as the ucx configuration information. I'm not sure I follow everything on the UCX FAQ page, but it seems like everything is being routed over TCP, which is probably not what I want. Any thoughts as to what I might be doing wrong? Thanks, Greg From: Pritchard Jr., Howard mailto:howa...@lanl.gov>> Sent: Wednesday, October 13, 2021 12:28 PM To: Open MPI Users mailto:users@lists.open-mpi.org>> Cc: Fischer, Greg A. mailto:fisch...@westinghouse.com>> Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] HI Greg, It's the aging of the openib btl. You may be able to appl
[OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"
Hello, I have compiled OpenMPI 3.1.6 from source on SLES12-SP3, and I am seeing the following errors when I try to use the openib btl: WARNING: There was an error initializing an OpenFabrics device. Local host: bl1308 Local device: mlx4_0 -- [bl1308][[44866,1],5][../../../../../openmpi-3.1.6/opal/mca/btl/openib/btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success I have disabled UCX ("--without-ucx") because the UCX installation we have seems to be too out-of-date. ofed_info says "MLNX_OFED_LINUX-4.1-1.0.2.0". I've attached the detailed output of ofed_info and ompi_info. This issue seems similar to Issue #7461 (https://github.com/open-mpi/ompi/issues/7461), which I don't see a resolution for. Does anyone know what the likely explanation is? Is the version of OFED on the system badly out-of-sync with contemporary OpenMPI? Thanks, Greg This e-mail may contain proprietary information of the sending organization. Any unauthorized or improper disclosure, copying, distribution, or use of the contents of this e-mail and attached document(s) is prohibited. The information contained in this e-mail and attached document(s) is intended only for the personal and private use of the recipient(s) named above. If you have received this communication in error, please notify the sender immediately by email and delete the original e-mail and attached document(s). MLNX_OFED_LINUX-4.1-1.0.2.0 (OFED-4.1-1.0.2): ar_mgr: osm_plugins/ar_mgr/ar_mgr-1.0-0.34.g9bd7c9a.tar.gz cc_mgr: osm_plugins/cc_mgr/cc_mgr-1.0-0.33.g9bd7c9a.tar.gz dapl: dapl.git mlnx_ofed_4_0 commit bdb055900059d1b8d5ee8cdfb457ca653eb9dd2d dump_pr: osm_plugins/dump_pr//dump_pr-1.0-0.29.g9bd7c9a.tar.gz fabric-collector: fabric_collector//fabric-collector-1.1.0.MLNX20170103.89bb2aa.tar.gz hcoll: mlnx_ofed_hcol/hcoll-3.8.1649-1.src.rpm ibacm: mlnx_ofed/ibacm.git mlnx_ofed_4_1 commit b0d53cf13358eb0c14665765b0170a37768463ff ibacm_ssa: mlnx_ofed_ssa/acm/ibacm_ssa-0.0.9.3.MLNX20151203.50eb579.tar.gz ibdump: sniffer/sniffer-5.0.0-1/ibdump/linux/ibdump-5.0.0-1.tgz ibsim: mlnx_ofed_ibsim/ibsim-0.6mlnx1-0.8.g9d76581.tar.gz ibssa: mlnx_ofed_ssa/distrib/ibssa-0.0.9.3.MLNX20151203.50eb579.tar.gz ibutils: ofed-1.5.3-rpms/ibutils/ibutils-1.5.7.1-0.12.gdcaeae2.tar.gz ibutils2: ibutils2/ibutils2-2.1.1-0.91.MLNX20170612.g2e0d52a.tar.gz infiniband-diags: mlnx_ofed_infiniband_diags/infiniband-diags-1.6.7.MLNX20170511.7595646.tar.gz iser: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1 commit c22af8878c71966728f6ac38d963190f5222b2ec isert: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1 commit c22af8878c71966728f6ac38d963190f5222b2ec kernel-mft: mlnx_ofed_mft/kernel-mft-4.7.0-41.src.rpm knem: knem.git mellanox-master commit 4faa2978ad0339c50dd6df336d0a4182647b624b libibcm: mlnx_ofed/libibcm.git mlnx_ofed_4_1 commit e3e9fffe4d2d2f730110a7bdeb7da7b8ea97e51e libibmad: mlnx_ofed_libibmad/libibmad-1.3.13.MLNX20170511.267a441.tar.gz libibprof: mlnx_ofed_libibprof/libibprof-1.1.41-1.src.rpm libibumad: mlnx_ofed_libibumad/libibumad-13.10.2.MLNX20170511.dcc9f7a.tar.gz libibverbs: mlnx_ofed/libibverbs.git mlnx_ofed_4_1 commit a23bf787eff96af4c05d6e5f0e201dba80db114e libmlx4: mlnx_ofed/libmlx4.git mlnx_ofed_4_1 commit d945a7eeb52e319b210e6602a9fee0646371 libmlx5: mlnx_ofed/libmlx5.git mlnx_ofed_4_1 commit 71822e375014c7f81dec3e4eca06f366846eaf1a libopensmssa: mlnx_ofed_ssa/plugin/libopensmssa-0.0.9.3.MLNX20151203.50eb579.tar.gz librdmacm: mlnx_ofed/librdmacm.git mlnx_ofed_4_1 commit 1297178df9b07030d84a042d417cb61fa65e62a1 librxe: mlnx_ofed/librxe.git master commit 607460456c717c3b65428367676cacb5495ac005 libvma: vma/source_rpms//libvma-8.3.7-0.src.rpm mlnx-en: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1 commit c22af8878c71966728f6ac38d963190f5222b2ec mlnx-ethtool: upstream/ethtool.git for-upstream commit ac0cf295abe0c0832f0711fed66ab9601c8b2513 mlnx-nfsrdma: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1 commit c22af8878c71966728f6ac38d963190f5222b2ec mlnx-nvme: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1 commit c22af8878c71966728f6ac38d963190f5222b2ec mlnx-ofa_kernel: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1 commit c22af8878c71966728f6ac38d963190f5222b2ec mlnx-rdma-rxe: mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1 commit c22af8878c71966728f6ac38d963190f5222b2ec mpi-selector: ofed-1.5.3-rpms/mpi-selector/mpi-selector-1.0.3-1.src.rpm mpitests: mlnx_ofed_mpitest/mpitests-3.2.19-acade41.src.rpm mstflint: mlnx_ofed_mstflint/mstflint-4.7.0-1.6.g26037b7.tar.gz multiperf: mlnx_ofed_multiperf/multiperf-3.0-0.10.gda89e8c.tar.gz mxm: mlnx_ofed_mxm/mxm-3.6.3102-1.src.rpm ofed-docs: docs.git mlnx_ofed-4.0 commit 3d1b0afb7bc190ae5f362223043f76b2b45971cc openmpi: mlnx_ofed_ompi_1.8/openmpi-2.1.2a1-1.src.rpm opensm:
[OMPI users] disappearance of the memory registration error in 1.8.x?
Hello, I'm trying to run the "connectivity_c" test on a variety of systems using OpenMPI 1.8.4. The test returns segmentation faults when running across nodes on one particular type of system, and only when using the openib BTL. (The test runs without error if I stipulate "--mca btl tcp,self".) Here's the output: 1033 fischega@bl1415[~/tmp/openmpi/1.8.4_test_examples_SLES11_SP2/error]> mpirun -np 16 connectivity_c [bl1415:29526] *** Process received signal *** [bl1415:29526] Signal: Segmentation fault (11) [bl1415:29526] Signal code: (128) [bl1415:29526] Failing at address: (nil) [bl1415:29526] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2ab1e72915d0] [bl1415:29526] [ 1] /data/pgrlf/openmpi-1.8.4/SLES10_SP2_lib/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x29e)[0x2ab1e7c550be] [bl1415:29526] [ 2] /data/pgrlf/openmpi-1.8.4/SLES10_SP2_lib/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_memalign+0x69)[0x2ab1e7c58829] [bl1415:29526] [ 3] /data/pgrlf/openmpi-1.8.4/SLES10_SP2_lib/lib/libopen-pal.so.6(opal_memory_ptmalloc2_memalign+0x6f)[0x2ab1e7c583ff] [bl1415:29526] [ 4] /data/pgrlf/openmpi-1.8.4/SLES10_SP2_lib/lib/openmpi/mca_btl_openib.so(+0x2867b)[0x2ab1eac8a67b] [bl1415:29526] [ 5] /data/pgrlf/openmpi-1.8.4/SLES10_SP2_lib/lib/openmpi/mca_btl_openib.so(+0x1f712)[0x2ab1eac81712] [bl1415:29526] [ 6] /lib64/libpthread.so.0(+0x75f0)[0x2ab1e72895f0] [bl1415:29526] [ 7] /lib64/libc.so.6(clone+0x6d)[0x2ab1e757484d] [bl1415:29526] *** End of error message *** When I run the same test using a previous build of OpenMPI 1.6.5 on this system, it returns a memory registration warning, but otherwise executes normally: -- WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. OpenMPI 1.8.4 does not seem to be reporting a memory registration warning in situations where previous versions would report such a warning. Is this because OpenMPI 1.8.4 is no longer vulnerable to this type of condition? Thanks, Greg This e-mail may contain proprietary information of the sending organization. Any unauthorized or improper disclosure, copying, distribution, or use of the contents of this e-mail and attached document(s) is prohibited. The information contained in this e-mail and attached document(s) is intended only for the personal and private use of the recipient(s) named above. If you have received this communication in error, please notify the sender immediately by email and delete the original e-mail and attached document(s).
Re: [OMPI users] poor performance using the openib btl
I looked through my configure log, and that option is not enabled. Thanks for the suggestion. From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime Boissonneault Sent: Wednesday, June 25, 2014 10:51 AM To: Open MPI Users Subject: Re: [OMPI users] poor performance using the openib btl Hi, I recovered the name of the option that caused problems for us. It is --enable-mpi-thread-multiple This option enables threading within OPAL, which was bugged (at least in 1.6.x series). I don't know if it has been fixed in 1.8 series. I do not see your configure line in the attached file, to see if it was enabled or not. Maxime Le 2014-06-25 10:46, Fischer, Greg A. a écrit : Attached are the results of "grep thread" on my configure output. There appears to be some amount of threading, but is there anything I should look for in particular? I see Mike Dubman's questions on the mailing list website, but his message didn't appear to make it to my inbox. The answers to his questions are: [binford:fischega] $ rpm -qa | grep ofed ofed-doc-1.5.4.1-0.11.5 ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5 ofed-1.5.4.1-0.11.5 Distro: SLES11 SP3 HCA: [binf102:fischega] $ /usr/sbin/ibstat CA 'mlx4_0' CA type: MT26428 Command line (path and LD_LIBRARY_PATH are set correctly): mpirun -x LD_LIBRARY_PATH -mca btl openib,sm,self -mca btl_openib_verbose 1 -np 31 $CTF_EXEC From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime Boissonneault Sent: Tuesday, June 24, 2014 6:41 PM To: Open MPI Users Subject: Re: [OMPI users] poor performance using the openib btl What are your threading options for OpenMPI (when it was built) ? I have seen OpenIB BTL completely lock when some level of threading is enabled before. Maxime Boissonneault Le 2014-06-24 18:18, Fischer, Greg A. a écrit : Hello openmpi-users, A few weeks ago, I posted to the list about difficulties I was having getting openib to work with Torque (see "openib segfaults with Torque", June 6, 2014). The issues were related to Torque imposing restrictive limits on locked memory, and have since been resolved. However, now that I've had some time to test the applications, I'm seeing abysmal performance over the openib layer. Applications run with the tcp btl execute about 10x faster than with the openib btl. Clearly something still isn't quite right. I tried running with "-mca btl_openib_verbose 1", but didn't see anything resembling a smoking gun. How should I go about determining the source of the problem? (This uses the same OpenMPI Version 1.8.1 / SLES11 SP3 / GCC 4.8.3 setup discussed previously.) Thanks, Greg ___ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24697.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique ___ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24700.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
Re: [OMPI users] poor performance using the openib btl
Attached are the results of "grep thread" on my configure output. There appears to be some amount of threading, but is there anything I should look for in particular? I see Mike Dubman's questions on the mailing list website, but his message didn't appear to make it to my inbox. The answers to his questions are: [binford:fischega] $ rpm -qa | grep ofed ofed-doc-1.5.4.1-0.11.5 ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5 ofed-1.5.4.1-0.11.5 Distro: SLES11 SP3 HCA: [binf102:fischega] $ /usr/sbin/ibstat CA 'mlx4_0' CA type: MT26428 Command line (path and LD_LIBRARY_PATH are set correctly): mpirun -x LD_LIBRARY_PATH -mca btl openib,sm,self -mca btl_openib_verbose 1 -np 31 $CTF_EXEC From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime Boissonneault Sent: Tuesday, June 24, 2014 6:41 PM To: Open MPI Users Subject: Re: [OMPI users] poor performance using the openib btl What are your threading options for OpenMPI (when it was built) ? I have seen OpenIB BTL completely lock when some level of threading is enabled before. Maxime Boissonneault Le 2014-06-24 18:18, Fischer, Greg A. a écrit : Hello openmpi-users, A few weeks ago, I posted to the list about difficulties I was having getting openib to work with Torque (see "openib segfaults with Torque", June 6, 2014). The issues were related to Torque imposing restrictive limits on locked memory, and have since been resolved. However, now that I've had some time to test the applications, I'm seeing abysmal performance over the openib layer. Applications run with the tcp btl execute about 10x faster than with the openib btl. Clearly something still isn't quite right. I tried running with "-mca btl_openib_verbose 1", but didn't see anything resembling a smoking gun. How should I go about determining the source of the problem? (This uses the same OpenMPI Version 1.8.1 / SLES11 SP3 / GCC 4.8.3 setup discussed previously.) Thanks, Greg ___ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24697.php -- - Maxime Boissonneault Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique checking for a thread-safe mkdir -p... /bin/mkdir -p checking pthread.h usability... yes checking pthread.h presence... yes checking for pthread.h... yes checking if C compiler and POSIX threads work as is... no checking if C++ compiler and POSIX threads work as is... no checking if Fortran compiler and POSIX threads work as is... no checking if C compiler and POSIX threads work with -Kthread... no checking if C compiler and POSIX threads work with -kthread... no checking if C compiler and POSIX threads work with -pthread... yes checking if C++ compiler and POSIX threads work with -Kthread... no checking if C++ compiler and POSIX threads work with -kthread... no checking if C++ compiler and POSIX threads work with -pthread... yes checking if Fortran compiler and POSIX threads work with -Kthread... no checking if Fortran compiler and POSIX threads work with -kthread... no checking if Fortran compiler and POSIX threads work with -pthread... yes checking for pthread_mutexattr_setpshared... yes checking for pthread_condattr_setpshared... yes checking for working POSIX threads package... yes checking for type of thread support... posix checking if threads have different pids (pthreads on linux)... no checking for pthread_t... yes checking pthread_np.h usability... no checking pthread_np.h presence... no checking for pthread_np.h... no checking whether pthread_setaffinity_np is declared... yes checking whether pthread_getaffinity_np is declared... yes checking for library containing pthread_getthrds_np... no checking for pthread_mutex_lock... yes checking libevent configuration args... --disable-dns --disable-http --disable-rpc --disable-openssl --enable-thread-support --disable-evport configure: running /bin/sh '../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/configure' --disable-dns --disable-http --disable-rpc --disable-openssl --enable-thread-support --disable-evport '--prefix=/casl/vera_ib/gcc-4.8.3/toolset/openmpi-1.8.1' --cache-file=/dev/null --srcdir=../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent --disable-option-checking checking for a thread-safe mkdir -p... /bin/mkdir -p checking for the pthreads library -lpthreads... no checking whether pthreads work without any flags... yes checking for joinable pthread attribute... PTHREAD_CREATE_JOINABLE checking if more special flags are required for pthreads... no checking size of pthread_t... 8 config.status: creating libevent_pthreads.pc checking for thread support (needed for rdmacm/udcm)... posix configure: running /bin/sh '../../../../../../openmpi-1.8
[OMPI users] poor performance using the openib btl
Hello openmpi-users, A few weeks ago, I posted to the list about difficulties I was having getting openib to work with Torque (see "openib segfaults with Torque", June 6, 2014). The issues were related to Torque imposing restrictive limits on locked memory, and have since been resolved. However, now that I've had some time to test the applications, I'm seeing abysmal performance over the openib layer. Applications run with the tcp btl execute about 10x faster than with the openib btl. Clearly something still isn't quite right. I tried running with "-mca btl_openib_verbose 1", but didn't see anything resembling a smoking gun. How should I go about determining the source of the problem? (This uses the same OpenMPI Version 1.8.1 / SLES11 SP3 / GCC 4.8.3 setup discussed previously.) Thanks, Greg
Re: [OMPI users] openib segfaults with Torque
This sounds credible. When I login via Torque, I see the following: [binf316:fischega] $ ulimit -l 64 but when I login via ssh, I see: [binf316:fischega] $ ulimit -l unlimited I'll have my administrator make the changes and give that a shot. Thanks, everyone! _ From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa Sent: Wednesday, June 11, 2014 7:13 PM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque If that could help Greg, on the compute nodes I normally add this to /etc/security/limits.conf: * - memlock -1 * - stack -1 * - nofile 32768 and ulimit -n 32768 ulimit -l unlimited ulimit -s unlimited to either /etc/init.d/pbs_mom or to /etc/sysconfig/pbs_mom (which should be sourced by the former). Other values are possible, of course. My recollection is that the boilerplate init scripts that come with Torque don't change those limits. I suppose this makes the pbs_mom child processes, including the user job script and whatever processes it starts (mpiexec, etc), to inherit those limits. Or not? Gus Correa On 06/11/2014 06:20 PM, Jeff Squyres (jsquyres) wrote: > +1 > > On Jun 11, 2014, at 6:01 PM, Ralph Castain > <r...@open-mpi.org<mailto:r...@open-mpi.org>> > wrote: > >> Yeah, I think we've seen that somewhere before too... >> >> >> On Jun 11, 2014, at 2:59 PM, Joshua Ladd >> <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote: >> >>> Agreed. The problem is not with UDCM. I don't think something is wrong with >>> the system. I think his Torque is imposing major constraints on the maximum >>> size that can be locked into memory. >>> >>> Josh >>> >>> >>> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm >>> <hje...@lanl.gov<mailto:hje...@lanl.gov>> wrote: >>> Probably won't help to use RDMACM though as you will just see the >>> resource failure somewhere else. UDCM is not the problem. Something is >>> wrong with the system. Allocating a 512 entry CQ should not fail. >>> >>> -Nathan >>> >>> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: >>>> I'm guessing it's a resource limitation issue coming from Torque. >>>> >>>> H...I found something interesting on the interwebs that looks >>>> awfully >>>> similar: >>>> >>>> http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html >>>> >>>> Greg, if the suggestion from the Torque users doesn't resolve your >>>> issue ( >>>> "...adding the following line 'ulimit -l unlimited' to pbs_mom and >>>> restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead >>>> of >>>> UDCM, which is a pretty recent addition to the openIB BTL.) by setting: >>>> >>>> -mca btl_openib_cpc_include rdmacm >>>> >>>> Josh >>>> >>>> On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) >>>> <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote: >>>> >>>> Mellanox -- >>>> >>>> What would cause a CQ to fail to be created? >>>> >>>> On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." >>>> <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: >>>> >>>> > Is there any other work around that I might try? Something that >>>> avoids UDCM? >>>> > >>>> > -Original Message- >>>> > From: Fischer, Greg A. >>>> > Sent: Tuesday, June 10, 2014 2:59 PM >>>> > To: Nathan Hjelm >>>> > Cc: Open MPI Users; Fischer, Greg A. >>>> > Subject: RE: [OMPI users] openib segfaults with Torque >>>> > >>>> > [binf316:fischega] $ ulimit -m >>>> > unlimited >>>> > >>>> > Greg >>>> > >>>> > -Original Message- >>>> > From: Nathan Hjelm [mailto:hje...@lanl.gov] >>>> > Sent: Tuesday, June 10, 2014 2:58 PM >>>> > To: Fischer, Greg A. >>>> > Cc: Open MPI Users >>>> > Subject: Re: [OMPI users] openib segfaults with Torque >>>> > >>>> > Out of curiosity what is the
Re: [OMPI users] openib segfaults with Torque
Is there any other work around that I might try? Something that avoids UDCM? -Original Message- From: Fischer, Greg A. Sent: Tuesday, June 10, 2014 2:59 PM To: Nathan Hjelm Cc: Open MPI Users; Fischer, Greg A. Subject: RE: [OMPI users] openib segfaults with Torque [binf316:fischega] $ ulimit -m unlimited Greg -Original Message- From: Nathan Hjelm [mailto:hje...@lanl.gov] Sent: Tuesday, June 10, 2014 2:58 PM To: Fischer, Greg A. Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Out of curiosity what is the mlock limit on your system? If it is too low that can cause ibv_create_cq to fail. To check run ulimit -m. -Nathan Hjelm Application Readiness, HPC-5, LANL On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: > Yes, this fails on all nodes on the system, except for the head node. > > The uptime of the system isn't significant. Maybe 1 week, and it's received > basically no use. > > -Original Message- > From: Nathan Hjelm [mailto:hje...@lanl.gov] > Sent: Tuesday, June 10, 2014 2:49 PM > To: Fischer, Greg A. > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > > Well, thats interesting. The output shows that ibv_create_cq is failing. > Strange since an identical call had just succeeded (udcm creates two > completion queues). Some questions that might indicate where the failure > might be: > > Does this fail on any other node in your system? > > How long has the node been up? > > -Nathan Hjelm > Application Readiness, HPC-5, LANL > > On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > > Jeff/Nathan, > > > > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a > > terminal on a compute node with "qsub -l nodes 2 -I": > > > > mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 > > ring_c &> output.txt > > > > Output and backtrace are attached. Let me know if I can provide anything > > else. > > > > Thanks for looking into this, > > Greg > > > > -Original Message- > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > > Squyres (jsquyres) > > Sent: Tuesday, June 10, 2014 10:31 AM > > To: Nathan Hjelm > > Cc: Open MPI Users > > Subject: Re: [OMPI users] openib segfaults with Torque > > > > Greg: > > > > Can you run with "--mca btl_base_verbose 100" on your debug build so that > > we can get some additional output to see why UDCM is failing to setup > > properly? > > > > > > > > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > > > > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: > > >> I seem to recall that you have an IB-based cluster, right? > > >> > > >> From a *very quick* glance at the code, it looks like this might be a > > >> simple incorrect-finalization issue. That is: > > >> > > >> - you run the job on a single server > > >> - openib disqualifies itself because you're running on a single > > >> server > > >> - openib then goes to finalize/close itself > > >> - but openib didn't fully initialize itself (because it > > >> disqualified itself early in the initialization process), and > > >> something in the finalization process didn't take that into > > >> account > > >> > > >> Nathan -- is that anywhere close to correct? > > > > > > Nope. udcm_module_finalize is being called because there was an > > > error setting up the udcm state. See btl_openib_connect_udcm.c:476. > > > The opal_list_t destructor is getting an assert failure. Probably > > > because the constructor wasn't called. I can rearrange the > > > constructors to be called first but there appears to be a deeper > > > issue with the user's > > > system: udcm_module_init should not be failing! It creates a > > > couple of CQs, allocates a small number of registered bufferes and > > > starts monitoring the fd for the completion channel. All these > > > things are also done in the setup of the openib btl itself. Keep > > > in mind that the openib btl will not disqualify itself when running > > > single server. > > > Openib may be used to communicate on node and is needed for the dynamics > > > case. > > > > > > The user might try adding -mca btl_base_verbose 100 to shed some > > > light on what the real issue is. > > > > > > BTW, I no longer monitor th
Re: [OMPI users] openib segfaults with Torque
[binf316:fischega] $ ulimit -m unlimited Greg -Original Message- From: Nathan Hjelm [mailto:hje...@lanl.gov] Sent: Tuesday, June 10, 2014 2:58 PM To: Fischer, Greg A. Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Out of curiosity what is the mlock limit on your system? If it is too low that can cause ibv_create_cq to fail. To check run ulimit -m. -Nathan Hjelm Application Readiness, HPC-5, LANL On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: > Yes, this fails on all nodes on the system, except for the head node. > > The uptime of the system isn't significant. Maybe 1 week, and it's received > basically no use. > > -Original Message- > From: Nathan Hjelm [mailto:hje...@lanl.gov] > Sent: Tuesday, June 10, 2014 2:49 PM > To: Fischer, Greg A. > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > > Well, thats interesting. The output shows that ibv_create_cq is failing. > Strange since an identical call had just succeeded (udcm creates two > completion queues). Some questions that might indicate where the failure > might be: > > Does this fail on any other node in your system? > > How long has the node been up? > > -Nathan Hjelm > Application Readiness, HPC-5, LANL > > On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > > Jeff/Nathan, > > > > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a > > terminal on a compute node with "qsub -l nodes 2 -I": > > > > mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 > > ring_c &> output.txt > > > > Output and backtrace are attached. Let me know if I can provide anything > > else. > > > > Thanks for looking into this, > > Greg > > > > -Original Message- > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > > Squyres (jsquyres) > > Sent: Tuesday, June 10, 2014 10:31 AM > > To: Nathan Hjelm > > Cc: Open MPI Users > > Subject: Re: [OMPI users] openib segfaults with Torque > > > > Greg: > > > > Can you run with "--mca btl_base_verbose 100" on your debug build so that > > we can get some additional output to see why UDCM is failing to setup > > properly? > > > > > > > > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > > > > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: > > >> I seem to recall that you have an IB-based cluster, right? > > >> > > >> From a *very quick* glance at the code, it looks like this might be a > > >> simple incorrect-finalization issue. That is: > > >> > > >> - you run the job on a single server > > >> - openib disqualifies itself because you're running on a single > > >> server > > >> - openib then goes to finalize/close itself > > >> - but openib didn't fully initialize itself (because it > > >> disqualified itself early in the initialization process), and > > >> something in the finalization process didn't take that into > > >> account > > >> > > >> Nathan -- is that anywhere close to correct? > > > > > > Nope. udcm_module_finalize is being called because there was an > > > error setting up the udcm state. See btl_openib_connect_udcm.c:476. > > > The opal_list_t destructor is getting an assert failure. Probably > > > because the constructor wasn't called. I can rearrange the > > > constructors to be called first but there appears to be a deeper > > > issue with the user's > > > system: udcm_module_init should not be failing! It creates a > > > couple of CQs, allocates a small number of registered bufferes and > > > starts monitoring the fd for the completion channel. All these > > > things are also done in the setup of the openib btl itself. Keep > > > in mind that the openib btl will not disqualify itself when running > > > single server. > > > Openib may be used to communicate on node and is needed for the dynamics > > > case. > > > > > > The user might try adding -mca btl_base_verbose 100 to shed some > > > light on what the real issue is. > > > > > > BTW, I no longer monitor the user mailing list. If something needs > > > my attention forward it to me directly. > > > > > > -Nathan > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > h
Re: [OMPI users] openib segfaults with Torque
Yes, this fails on all nodes on the system, except for the head node. The uptime of the system isn't significant. Maybe 1 week, and it's received basically no use. -Original Message- From: Nathan Hjelm [mailto:hje...@lanl.gov] Sent: Tuesday, June 10, 2014 2:49 PM To: Fischer, Greg A. Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Well, thats interesting. The output shows that ibv_create_cq is failing. Strange since an identical call had just succeeded (udcm creates two completion queues). Some questions that might indicate where the failure might be: Does this fail on any other node in your system? How long has the node been up? -Nathan Hjelm Application Readiness, HPC-5, LANL On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: > Jeff/Nathan, > > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a > terminal on a compute node with "qsub -l nodes 2 -I": > > mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> > output.txt > > Output and backtrace are attached. Let me know if I can provide anything else. > > Thanks for looking into this, > Greg > > -Original Message- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff > Squyres (jsquyres) > Sent: Tuesday, June 10, 2014 10:31 AM > To: Nathan Hjelm > Cc: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > Greg: > > Can you run with "--mca btl_base_verbose 100" on your debug build so that we > can get some additional output to see why UDCM is failing to setup properly? > > > > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: > >> I seem to recall that you have an IB-based cluster, right? > >> > >> From a *very quick* glance at the code, it looks like this might be a > >> simple incorrect-finalization issue. That is: > >> > >> - you run the job on a single server > >> - openib disqualifies itself because you're running on a single > >> server > >> - openib then goes to finalize/close itself > >> - but openib didn't fully initialize itself (because it > >> disqualified itself early in the initialization process), and > >> something in the finalization process didn't take that into account > >> > >> Nathan -- is that anywhere close to correct? > > > > Nope. udcm_module_finalize is being called because there was an > > error setting up the udcm state. See btl_openib_connect_udcm.c:476. > > The opal_list_t destructor is getting an assert failure. Probably > > because the constructor wasn't called. I can rearrange the > > constructors to be called first but there appears to be a deeper > > issue with the user's > > system: udcm_module_init should not be failing! It creates a couple > > of CQs, allocates a small number of registered bufferes and starts > > monitoring the fd for the completion channel. All these things are > > also done in the setup of the openib btl itself. Keep in mind that > > the openib btl will not disqualify itself when running single server. > > Openib may be used to communicate on node and is needed for the dynamics > > case. > > > > The user might try adding -mca btl_base_verbose 100 to shed some > > light on what the real issue is. > > > > BTW, I no longer monitor the user mailing list. If something needs > > my attention forward it to me directly. > > > > -Nathan > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > Core was generated by `ring_c'. > Program terminated with signal 6, Aborted. > #0 0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6 > #0 0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6 > #1 0x7f8b6ae1e0c5 in abort () from /lib64/libc.so.6 > #2 0x7f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6 > #3 0x7f8b664b684b in udcm_module_finalize (btl=0x717060, > cpc=0x7190c0) at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_co > nnect_udcm.c:734 > #4 0x7f8b664b5474 in udcm_component_query (btl=0x717060, > cpc=0x718a48) at > ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_co > nnect_udcm.c:476 > #5 0x7f8b664ae316 in > ompi_btl_openib_connect_base_select_for_local_port (btl=0x717060) at > ..
Re: [OMPI users] openib segfaults with Torque
Jeff/Nathan, I ran the following with my debug build of OpenMPI 1.8.1 - after opening a terminal on a compute node with "qsub -l nodes 2 -I": mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> output.txt Output and backtrace are attached. Let me know if I can provide anything else. Thanks for looking into this, Greg -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres (jsquyres) Sent: Tuesday, June 10, 2014 10:31 AM To: Nathan Hjelm Cc: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Greg: Can you run with "--mca btl_base_verbose 100" on your debug build so that we can get some additional output to see why UDCM is failing to setup properly? On Jun 10, 2014, at 10:25 AM, Nathan Hjelmwrote: > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote: >> I seem to recall that you have an IB-based cluster, right? >> >> From a *very quick* glance at the code, it looks like this might be a simple >> incorrect-finalization issue. That is: >> >> - you run the job on a single server >> - openib disqualifies itself because you're running on a single >> server >> - openib then goes to finalize/close itself >> - but openib didn't fully initialize itself (because it disqualified >> itself early in the initialization process), and something in the >> finalization process didn't take that into account >> >> Nathan -- is that anywhere close to correct? > > Nope. udcm_module_finalize is being called because there was an error > setting up the udcm state. See btl_openib_connect_udcm.c:476. The > opal_list_t destructor is getting an assert failure. Probably because > the constructor wasn't called. I can rearrange the constructors to be > called first but there appears to be a deeper issue with the user's > system: udcm_module_init should not be failing! It creates a couple of > CQs, allocates a small number of registered bufferes and starts > monitoring the fd for the completion channel. All these things are > also done in the setup of the openib btl itself. Keep in mind that the > openib btl will not disqualify itself when running single server. > Openib may be used to communicate on node and is needed for the dynamics case. > > The user might try adding -mca btl_base_verbose 100 to shed some light > on what the real issue is. > > BTW, I no longer monitor the user mailing list. If something needs my > attention forward it to me directly. > > -Nathan -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users Core was generated by `ring_c'. Program terminated with signal 6, Aborted. #0 0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6 #0 0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6 #1 0x7f8b6ae1e0c5 in abort () from /lib64/libc.so.6 #2 0x7f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6 #3 0x7f8b664b684b in udcm_module_finalize (btl=0x717060, cpc=0x7190c0) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734 #4 0x7f8b664b5474 in udcm_component_query (btl=0x717060, cpc=0x718a48) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476 #5 0x7f8b664ae316 in ompi_btl_openib_connect_base_select_for_local_port (btl=0x717060) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273 #6 0x7f8b66497817 in btl_openib_component_init (num_btl_modules=0x7fffe34cebe0, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703 #7 0x7f8b6b43fa5e in mca_btl_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108 #8 0x7f8b666d9d42 in mca_bml_r2_component_init (priority=0x7fffe34cecb4, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88 #9 0x7f8b6b43ed1b in mca_bml_base_init (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69 #10 0x7f8b655ff739 in mca_pml_ob1_component_init (priority=0x7fffe34cedf0, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271 #11 0x7f8b6b4659b2 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128 #12 0x7f8b6b3d233c in ompi_mpi_init (argc=1, argv=0x7fffe34cf0e8, requested=0, provided=0x7fffe34cef98) at ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604 #13
Re: [OMPI users] intermittent segfaults with openib on ring_c.c
Yes, it should be possible for me to get an upgraded Intel compiler on that system. However, as you suggest, I'm more focused on getting it working with GCC right now. -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres (jsquyres) Sent: Monday, June 09, 2014 8:24 PM To: Open MPI Users Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c I'm digging out from mail backlog from being at the MPI Forum last week... Yes, from looking at the stack traces, it's segv'ing inside the memory allocator, which typically means some other memory error occurred before this. I.e., this particular segv is a symptom of the problem, not the actual problem. Are you able to upgrade your Intel compiler to avoid this issue? (I'm guessing the UDCM issues you reported later were with the gcc-compiled Open MPI) On Jun 4, 2014, at 5:15 PM, Ralph Castain <r...@open-mpi.org> wrote: > Aha!! I found this in our users mailing list archives: > > http://www.open-mpi.org/community/lists/users/2012/01/18091.php > > Looks like this is a known compiler vectorization issue. > > > On Jun 4, 2014, at 1:52 PM, Fischer, Greg A. <fisch...@westinghouse.com> > wrote: > >> Ralph, >> >> Thanks for looking. Let me know if there's any other testing that I can do. >> >> I recompiled with GCC and it works fine, so that lends credence to your >> theory that it has something to do with the Intel compilers, and possibly >> their interplay with SUSE. >> >> Greg >> >> -Original Message- >> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph >> Castain >> Sent: Wednesday, June 04, 2014 4:48 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] intermittent segfaults with openib on >> ring_c.c >> >> Urggg...unfortunately, the people who know the most about that code are >> all at the MPI Forum this week, so we may not be able to fully address it >> until their return. It looks like you are still going down into that malloc >> interceptor, so I'm not correctly blocking it for you. >> >> This run segfaulted in a completely different call in a different part of >> the startup procedure - but in the same part of the interceptor, which makes >> me suspicious. Don't know how much testing we've seen on SLES... >> >> >> On Jun 4, 2014, at 1:18 PM, Fischer, Greg A. <fisch...@westinghouse.com> >> wrote: >> >>> Ralph, >>> >>> It segfaults. Here's the backtrace: >>> >>> Core was generated by `ring_c'. >>> Program terminated with signal 11, Segmentation fault. >>> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, >>> bytes=47840385564856) at >>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 >>> 4098 bck->fd = unsorted_chunks(av); >>> (gdb) bt >>> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, >>> bytes=47840385564856) at >>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 >>> #1 0x2b82b1a47e38 in opal_memory_ptmalloc2_malloc >>> (bytes=47840385564704) at >>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:3433 >>> #2 0x2b82b1a47b36 in opal_memory_linux_malloc_hook >>> (sz=47840385564704, caller=0x2b82b53000b8) at >>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/hooks.c:691 >>> #3 0x2b82b19e7b18 in opal_malloc (size=47840385564704, >>> file=0x2b82b53000b8 "", line=12) at >>> ../../../openmpi-1.8.1/opal/util/malloc.c:101 >>> #4 0x2b82b199c017 in opal_hash_table_set_value_uint64 >>> (ht=0x2b82b5300020, key=47840385564856, value=0xc) at >>> ../../openmpi-1.8.1/opal/class/opal_hash_table.c:283 >>> #5 0x2b82b170e4ca in process_uri (uri=0x2b82b5300020 "\001") at >>> ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:348 >>> #6 0x2b82b170e941 in orte_oob_base_set_addr (fd=-1255145440, >>> args=184, cbdata=0xc) at >>> ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:296 >>> #7 0x2b82b19fba1c in event_process_active_single_queue >>> (base=0x655480, activeq=0x654920) at >>> ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent >>> /e >>> vent.c:1367 >>> #8 0x2b82b19fbcd9 in event_process_active (base=0x655480) at >>> ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent >>> /e >>> vent.c:1437 >>> #9 0x2b82b19fc4c3 in opal_libevent2021_event_base_loop >>> (base=0x655480, f
Re: [OMPI users] openib segfaults with Torque
Yep, TCP works fine when launched via Torque/qsub: [binf315:fischega] $ mpirun -np 2 -mca btl tcp,sm,self ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Process 1 exiting From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Friday, June 06, 2014 10:34 AM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Huh - how strange. I can't imagine what it has to do with Torque vs rsh - this is failing when the openib BTL is trying to create the connection, which comes way after the launch is complete. Are you able to run this with btl tcp,sm,self? If so, that would confirm that everything else is correct, and the problem truly is limited to the udcm itself...which shouldn't have anything to do with how the proc was launched. On Jun 6, 2014, at 6:47 AM, Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: Here are the results when logging in to the compute node via ssh and running as you suggest: [binf102:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c Process 0 sending 10 to 1, tag 201 (2 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Process 1 exiting Here are the results when executing over Torque (launch the shell with "qsub -l nodes=2 -I"): [binf316:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed. [binf316:21584] *** Process received signal *** [binf316:21584] Signal: Aborted (6) [binf316:21584] Signal code: (-6) ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed. [binf316:21583] *** Process received signal *** [binf316:21583] Signal: Aborted (6) [binf316:21583] Signal code: (-6) [binf316:21584] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7fe33a2637c0] [binf316:21584] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fe339f0fb55] [binf316:21584] [ 2] /lib64/libc.so.6(abort+0x181)[0x7fe339f11131] [binf316:21584] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7fe339f08a10] [binf316:21584] [ 4] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7fe3355a984b] [binf316:21584] [ 5] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7fe3355a8474] [binf316:21584] [ 6] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7fe3355a1316] [binf316:21584] [ 7] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7fe33558a817] [binf316:21584] [ 8] [binf316:21583] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f3b586697c0] [binf316:21583] [ 1] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7fe33a532a5e] [binf316:21584] [ 9] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7fe3357ccd42] [binf316:21584] [10] /lib64/libc.so.6(gsignal+0x35)[0x7f3b58315b55] [binf316:21583] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f3b58317131] [binf316:21583] [ 3] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7fe33a531d1b] [binf316:21584] [11] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7fe3344e7739] [binf316:21584] [12] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f3b5830ea10] [binf316:21583] [ 4] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f3b539af84b] [binf316:21583] [ 5] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f3b539ae474] [binf316:21583] [ 6] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f3b539a7316] [binf316:21583] [ 7] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_open
Re: [OMPI users] openib segfaults with Torque
** //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f3b588cb33c] [binf316:21583] [14] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f3b58900386] [binf316:21583] [15] ring_c[0x40096f] [binf316:21583] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f3b58301c36] [binf316:21583] [17] ring_c[0x400889] [binf316:21583] *** End of error message *** -- mpirun noticed that process rank 0 with PID 21583 on node 316 exited on signal 6 (Aborted). -- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Thursday, June 05, 2014 7:57 PM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque Hmmm...I'm not sure how that is going to run with only one proc (I don't know if the program is protected against that scenario). If you run with -np 2 -mca btl openib,sm,self, is it happy? On Jun 5, 2014, at 2:16 PM, Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: Here's the command I'm invoking and the terminal output. (Some of this information doesn't appear to be captured in the backtrace.) [binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed. [binf316:04549] *** Process received signal *** [binf316:04549] Signal: Aborted (6) [binf316:04549] Signal code: (-6) [binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0] [binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55] [binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131] [binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10] [binf316:04549] [ 4] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b] [binf316:04549] [ 5] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474] [binf316:04549] [ 6] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316] [binf316:04549] [ 7] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817] [binf316:04549] [ 8] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e] [binf316:04549] [ 9] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42] [binf316:04549] [10] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b] [binf316:04549] [11] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739] [binf316:04549] [12] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2] [binf316:04549] [13] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c] [binf316:04549] [14] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386] [binf316:04549] [15] ring_c[0x40096f] [binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36] [binf316:04549] [17] ring_c[0x400889] [binf316:04549] *** End of error message *** -- mpirun noticed that process rank 0 with PID 4549 on node 316 exited on signal 6 (Aborted). ------ From: Fischer, Greg A. Sent: Thursday, June 05, 2014 5:10 PM To: us...@open-mpi.org<mailto:us...@open-mpi.org> Cc: Fischer, Greg A. Subject: openib segfaults with Torque OpenMPI Users, After encountering difficulty with the Intel compilers (see the "intermittent segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL in a typical BASH environment. Everything appeared to work fine, so I went on my merry way compiling the rest of my dependencies. After getting my dependencies and applications compiled, I began observing segfaults when submitting the applications through Torque. I recompiled OpenMPI with debug options, ran "ring_c" over the openib BTL in an interactive Torque session ("qsub -I"), and got the backtrace below. All other system settings described in the previous thread are the same. Any thoughts on how to resolve this issue? Core was generated by `ring_c'. Program termin
Re: [OMPI users] openib segfaults with Torque
Here's the command I'm invoking and the terminal output. (Some of this information doesn't appear to be captured in the backtrace.) [binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c ring_c: ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed. [binf316:04549] *** Process received signal *** [binf316:04549] Signal: Aborted (6) [binf316:04549] Signal code: (-6) [binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0] [binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55] [binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131] [binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10] [binf316:04549] [ 4] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b] [binf316:04549] [ 5] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474] [binf316:04549] [ 6] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316] [binf316:04549] [ 7] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817] [binf316:04549] [ 8] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e] [binf316:04549] [ 9] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42] [binf316:04549] [10] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b] [binf316:04549] [11] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739] [binf316:04549] [12] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2] [binf316:04549] [13] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c] [binf316:04549] [14] //_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386] [binf316:04549] [15] ring_c[0x40096f] [binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36] [binf316:04549] [17] ring_c[0x400889] [binf316:04549] *** End of error message *** -- mpirun noticed that process rank 0 with PID 4549 on node 316 exited on signal 6 (Aborted). -- From: Fischer, Greg A. Sent: Thursday, June 05, 2014 5:10 PM To: us...@open-mpi.org Cc: Fischer, Greg A. Subject: openib segfaults with Torque OpenMPI Users, After encountering difficulty with the Intel compilers (see the "intermittent segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL in a typical BASH environment. Everything appeared to work fine, so I went on my merry way compiling the rest of my dependencies. After getting my dependencies and applications compiled, I began observing segfaults when submitting the applications through Torque. I recompiled OpenMPI with debug options, ran "ring_c" over the openib BTL in an interactive Torque session ("qsub -I"), and got the backtrace below. All other system settings described in the previous thread are the same. Any thoughts on how to resolve this issue? Core was generated by `ring_c'. Program terminated with signal 6, Aborted. #0 0x7f7f5920ab55 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x7f7f5920ab55 in raise () from /lib64/libc.so.6 #1 0x7f7f5920c0c5 in abort () from /lib64/libc.so.6 #2 0x7f7f59203a10 in __assert_fail () from /lib64/libc.so.6 #3 0x7f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734 #4 0x7f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476 #5 0x7f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port (btl=0x716680) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273 #6 0x7f7f54885817 in btl_openib_component_init (num_btl_modules=0x7fff906aa420, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703 #7 0x7f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108 #8 0x7f7f54ac7
[OMPI users] openib segfaults with Torque
OpenMPI Users, After encountering difficulty with the Intel compilers (see the "intermittent segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL in a typical BASH environment. Everything appeared to work fine, so I went on my merry way compiling the rest of my dependencies. After getting my dependencies and applications compiled, I began observing segfaults when submitting the applications through Torque. I recompiled OpenMPI with debug options, ran "ring_c" over the openib BTL in an interactive Torque session ("qsub -I"), and got the backtrace below. All other system settings described in the previous thread are the same. Any thoughts on how to resolve this issue? Core was generated by `ring_c'. Program terminated with signal 6, Aborted. #0 0x7f7f5920ab55 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x7f7f5920ab55 in raise () from /lib64/libc.so.6 #1 0x7f7f5920c0c5 in abort () from /lib64/libc.so.6 #2 0x7f7f59203a10 in __assert_fail () from /lib64/libc.so.6 #3 0x7f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734 #4 0x7f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476 #5 0x7f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port (btl=0x716680) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273 #6 0x7f7f54885817 in btl_openib_component_init (num_btl_modules=0x7fff906aa420, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703 #7 0x7f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108 #8 0x7f7f54ac7d42 in mca_bml_r2_component_init (priority=0x7fff906aa4f4, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88 #9 0x7f7f5982cd1b in mca_bml_base_init (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69 #10 0x7f7f539ed739 in mca_pml_ob1_component_init (priority=0x7fff906aa630, enable_progress_threads=false, enable_mpi_threads=false) at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271 #11 0x7f7f598539b2 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128 #12 0x7f7f597c033c in ompi_mpi_init (argc=1, argv=0x7fff906aa928, requested=0, provided=0x7fff906aa7d8) at ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604 #13 0x7f7f597f5386 in PMPI_Init (argc=0x7fff906aa82c, argv=0x7fff906aa820) at pinit.c:84 #14 0x0040096f in main (argc=1, argv=0x7fff906aa928) at ring_c.c:19 Greg
Re: [OMPI users] intermittent segfaults with openib on ring_c.c
Ralph, Thanks for looking. Let me know if there's any other testing that I can do. I recompiled with GCC and it works fine, so that lends credence to your theory that it has something to do with the Intel compilers, and possibly their interplay with SUSE. Greg -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Wednesday, June 04, 2014 4:48 PM To: Open MPI Users Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c Urggg...unfortunately, the people who know the most about that code are all at the MPI Forum this week, so we may not be able to fully address it until their return. It looks like you are still going down into that malloc interceptor, so I'm not correctly blocking it for you. This run segfaulted in a completely different call in a different part of the startup procedure - but in the same part of the interceptor, which makes me suspicious. Don't know how much testing we've seen on SLES... On Jun 4, 2014, at 1:18 PM, Fischer, Greg A. <fisch...@westinghouse.com> wrote: > Ralph, > > It segfaults. Here's the backtrace: > > Core was generated by `ring_c'. > Program terminated with signal 11, Segmentation fault. > #0 opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, > bytes=47840385564856) at > ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 > 4098 bck->fd = unsorted_chunks(av); > (gdb) bt > #0 opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, > bytes=47840385564856) at > ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 > #1 0x2b82b1a47e38 in opal_memory_ptmalloc2_malloc > (bytes=47840385564704) at > ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:3433 > #2 0x2b82b1a47b36 in opal_memory_linux_malloc_hook > (sz=47840385564704, caller=0x2b82b53000b8) at > ../../../../../openmpi-1.8.1/opal/mca/memory/linux/hooks.c:691 > #3 0x2b82b19e7b18 in opal_malloc (size=47840385564704, > file=0x2b82b53000b8 "", line=12) at > ../../../openmpi-1.8.1/opal/util/malloc.c:101 > #4 0x2b82b199c017 in opal_hash_table_set_value_uint64 > (ht=0x2b82b5300020, key=47840385564856, value=0xc) at > ../../openmpi-1.8.1/opal/class/opal_hash_table.c:283 > #5 0x2b82b170e4ca in process_uri (uri=0x2b82b5300020 "\001") at > ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:348 > #6 0x2b82b170e941 in orte_oob_base_set_addr (fd=-1255145440, > args=184, cbdata=0xc) at > ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:296 > #7 0x2b82b19fba1c in event_process_active_single_queue > (base=0x655480, activeq=0x654920) at > ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/e > vent.c:1367 > #8 0x2b82b19fbcd9 in event_process_active (base=0x655480) at > ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/e > vent.c:1437 > #9 0x2b82b19fc4c3 in opal_libevent2021_event_base_loop > (base=0x655480, flags=1) at > ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/e > vent.c:1645 > #10 0x2b82b16f8763 in orte_progress_thread_engine > (obj=0x2b82b5300020) at > ../../../../openmpi-1.8.1/orte/mca/ess/base/ess_base_std_app.c:456 > #11 0x2b82b0f1c7b6 in start_thread () from /lib64/libpthread.so.0 > #12 0x2b82b1410d6d in clone () from /lib64/libc.so.6 > #13 0x in ?? () > > Greg > > -Original Message- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph > Castain > Sent: Wednesday, June 04, 2014 3:49 PM > To: Open MPI Users > Subject: Re: [OMPI users] intermittent segfaults with openib on > ring_c.c > > Sorry for delay - digging my way out of the backlog. This is very strange as > you are failing in a simple asprintf call. We check that all the players are > non-NULL, and it appears that you are failing to allocate the memory for the > resulting (rather short) string. > > I'm wondering if this is some strange interaction between SLES, the Intel > compiler, and our malloc interceptor - or if there is some difference between > the malloc libraries on the two machines. Let's try running it without the > malloc interceptor and see if that helps. > > Try running with "-mca memory ^linux" on your cmd line > > > On Jun 4, 2014, at 9:58 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> He isn't getting that far - he's failing in MPI_Init when the RTE >> attempts to connect to the local daemon >> >> >> On Jun 4, 2014, at 9:53 AM, Gus Correa <g...@ldeo.columbia.edu> wrote: >> >>> Hi Greg >>> >>> From your original email: >>> >>>>> [binf102:fischega] $ mpirun -np 2 --mca btl openib,s
Re: [OMPI users] intermittent segfaults with openib on ring_c.c
Ralph, It segfaults. Here's the backtrace: Core was generated by `ring_c'. Program terminated with signal 11, Segmentation fault. #0 opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, bytes=47840385564856) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 4098 bck->fd = unsorted_chunks(av); (gdb) bt #0 opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, bytes=47840385564856) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 #1 0x2b82b1a47e38 in opal_memory_ptmalloc2_malloc (bytes=47840385564704) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:3433 #2 0x2b82b1a47b36 in opal_memory_linux_malloc_hook (sz=47840385564704, caller=0x2b82b53000b8) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/hooks.c:691 #3 0x2b82b19e7b18 in opal_malloc (size=47840385564704, file=0x2b82b53000b8 "", line=12) at ../../../openmpi-1.8.1/opal/util/malloc.c:101 #4 0x2b82b199c017 in opal_hash_table_set_value_uint64 (ht=0x2b82b5300020, key=47840385564856, value=0xc) at ../../openmpi-1.8.1/opal/class/opal_hash_table.c:283 #5 0x2b82b170e4ca in process_uri (uri=0x2b82b5300020 "\001") at ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:348 #6 0x2b82b170e941 in orte_oob_base_set_addr (fd=-1255145440, args=184, cbdata=0xc) at ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:296 #7 0x2b82b19fba1c in event_process_active_single_queue (base=0x655480, activeq=0x654920) at ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/event.c:1367 #8 0x2b82b19fbcd9 in event_process_active (base=0x655480) at ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/event.c:1437 #9 0x2b82b19fc4c3 in opal_libevent2021_event_base_loop (base=0x655480, flags=1) at ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/event.c:1645 #10 0x2b82b16f8763 in orte_progress_thread_engine (obj=0x2b82b5300020) at ../../../../openmpi-1.8.1/orte/mca/ess/base/ess_base_std_app.c:456 #11 0x2b82b0f1c7b6 in start_thread () from /lib64/libpthread.so.0 #12 0x2b82b1410d6d in clone () from /lib64/libc.so.6 #13 0x in ?? () Greg -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Wednesday, June 04, 2014 3:49 PM To: Open MPI Users Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c Sorry for delay - digging my way out of the backlog. This is very strange as you are failing in a simple asprintf call. We check that all the players are non-NULL, and it appears that you are failing to allocate the memory for the resulting (rather short) string. I'm wondering if this is some strange interaction between SLES, the Intel compiler, and our malloc interceptor - or if there is some difference between the malloc libraries on the two machines. Let's try running it without the malloc interceptor and see if that helps. Try running with "-mca memory ^linux" on your cmd line On Jun 4, 2014, at 9:58 AM, Ralph Castain <r...@open-mpi.org> wrote: > He isn't getting that far - he's failing in MPI_Init when the RTE > attempts to connect to the local daemon > > > On Jun 4, 2014, at 9:53 AM, Gus Correa <g...@ldeo.columbia.edu> wrote: > >> Hi Greg >> >> From your original email: >> >>>> [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c >> >> This may not fix the problem, >> but have you tried to add the shared memory btl to your mca parameter? >> >> mpirun -np 2 --mca btl openib,sm,self ring_c >> >> As far as I know, sm is the preferred transport layer for intra-node >> communication. >> >> Gus Correa >> >> >> On 06/04/2014 11:13 AM, Ralph Castain wrote: >>> Thanks!! Really appreciate your help - I'll try to figure out what >>> went wrong and get back to you >>> >>> On Jun 4, 2014, at 8:07 AM, Fischer, Greg A. >>> <fisch...@westinghouse.com <mailto:fisch...@westinghouse.com>> wrote: >>> >>>> I re-ran with 1 processor and got more information. How about this? >>>> Core was generated by `ring_c'. >>>> Program terminated with signal 11, Segmentation fault. >>>> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020, >>>> bytes=47592367980728) at >>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 >>>> 4098 bck->fd = unsorted_chunks(av); >>>> (gdb) bt >>>> #0 opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020, >>>> bytes=47592367980728) at >>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 >>>> #1 0x2b48f2a15e38 in opal_memory_ptmalloc2_malloc >
Re: [OMPI users] intermittent segfaults with openib on ring_c.c
I re-ran with 1 processor and got more information. How about this? Core was generated by `ring_c'. Program terminated with signal 11, Segmentation fault. #0 opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020, bytes=47592367980728) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 4098 bck->fd = unsorted_chunks(av); (gdb) bt #0 opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020, bytes=47592367980728) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 #1 0x2b48f2a15e38 in opal_memory_ptmalloc2_malloc (bytes=47592367980576) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:3433 #2 0x2b48f2a15b36 in opal_memory_linux_malloc_hook (sz=47592367980576, caller=0x2b48f63000b8) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/hooks.c:691 #3 0x2b48f2374b90 in vasprintf () from /lib64/libc.so.6 #4 0x2b48f2354148 in asprintf () from /lib64/libc.so.6 #5 0x2b48f26dc7d1 in orte_oob_base_get_addr (uri=0x2b48f6300020) at ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:234 #6 0x2b48f53e7d4a in orte_rml_oob_get_uri () at ../../../../../openmpi-1.8.1/orte/mca/rml/oob/rml_oob_contact.c:36 #7 0x2b48f26fa181 in orte_routed_base_register_sync (setup=32 ' ') at ../../../../openmpi-1.8.1/orte/mca/routed/base/routed_base_fns.c:301 #8 0x2b48f4bbcccf in init_routes (job=4130340896, ndat=0x2b48f63000b8) at ../../../../../openmpi-1.8.1/orte/mca/routed/binomial/routed_binomial.c:705 #9 0x2b48f26c615d in orte_ess_base_app_setup (db_restrict_local=32 ' ') at ../../../../openmpi-1.8.1/orte/mca/ess/base/ess_base_std_app.c:245 #10 0x2b48f45b069f in rte_init () at ../../../../../openmpi-1.8.1/orte/mca/ess/env/ess_env_module.c:146 #11 0x2b48f26935ab in orte_init (pargc=0x2b48f6300020, pargv=0x2b48f63000b8, flags=8) at ../../openmpi-1.8.1/orte/runtime/orte_init.c:148 #12 0x2b48f1739d38 in ompi_mpi_init (argc=1, argv=0x7fffebf0d1f8, requested=8, provided=0x0) at ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:464 #13 0x2b48f1760a37 in PMPI_Init (argc=0x2b48f6300020, argv=0x2b48f63000b8) at pinit.c:84 #14 0x004024ef in main (argc=1, argv=0x7fffebf0d1f8) at ring_c.c:19 From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Wednesday, June 04, 2014 11:00 AM To: Open MPI Users Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c Does the trace go any further back? Your prior trace seemed to indicate an error in our OOB framework, but in a very basic place. Looks like it could be an uninitialized variable, and having the line number down as deep as possible might help identify the source On Jun 4, 2014, at 7:55 AM, Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: Oops, ulimit was set improperly. I generated a core file, loaded it in GDB, and ran a backtrace: Core was generated by `ring_c'. Program terminated with signal 11, Segmentation fault. #0 opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, bytes=47890224382136) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 4098 bck->fd = unsorted_chunks(av); (gdb) bt #0 opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, bytes=47890224382136) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 #1 0x in ?? () Is that helpful? Greg From: Fischer, Greg A. Sent: Wednesday, June 04, 2014 10:17 AM To: 'Open MPI Users' Cc: Fischer, Greg A. Subject: RE: [OMPI users] intermittent segfaults with openib on ring_c.c I recompiled with "-enable-debug" but it doesn't seem to be providing any more information or a core dump. I'm compiling ring.c with: mpicc ring_c.c -g -traceback -o ring_c and running with: mpirun -np 4 --mca btl openib,self ring_c and I'm getting: [binf112:05845] *** Process received signal *** [binf112:05845] Signal: Segmentation fault (11) [binf112:05845] Signal code: Address not mapped (1) [binf112:05845] Failing at address: 0x10 [binf112:05845] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x2b2fa44d57c0] [binf112:05845] [ 1] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x4b3)[0x2b2fa4ff2b03] [binf112:05845] [ 2] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)[0x2b2fa4ff5288] [binf112:05845] [ 3] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(+0xd1f86)[0x2b2fa4ff4f86] [binf112:05845] [ 4] /lib64/libc.so.6(vasprintf+0x3e)[0x2b2fa4957a7e] [binf112:05845] [ 5] /lib64/libc.so.6(asprintf+0x88)[0x2b2fa4937148] [binf112:05845] [ 6] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_util_convert_process_name_to_string+0xe2)[0x2b2fa4c873e2] [binf112:05845] [ 7] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_oob_base_get_addr+0x25)[0x2b2fa4cbdb15]
Re: [OMPI users] intermittent segfaults with openib on ring_c.c
Oops, ulimit was set improperly. I generated a core file, loaded it in GDB, and ran a backtrace: Core was generated by `ring_c'. Program terminated with signal 11, Segmentation fault. #0 opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, bytes=47890224382136) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 4098 bck->fd = unsorted_chunks(av); (gdb) bt #0 opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, bytes=47890224382136) at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098 #1 0x in ?? () Is that helpful? Greg From: Fischer, Greg A. Sent: Wednesday, June 04, 2014 10:17 AM To: 'Open MPI Users' Cc: Fischer, Greg A. Subject: RE: [OMPI users] intermittent segfaults with openib on ring_c.c I recompiled with "-enable-debug" but it doesn't seem to be providing any more information or a core dump. I'm compiling ring.c with: mpicc ring_c.c -g -traceback -o ring_c and running with: mpirun -np 4 --mca btl openib,self ring_c and I'm getting: [binf112:05845] *** Process received signal *** [binf112:05845] Signal: Segmentation fault (11) [binf112:05845] Signal code: Address not mapped (1) [binf112:05845] Failing at address: 0x10 [binf112:05845] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x2b2fa44d57c0] [binf112:05845] [ 1] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x4b3)[0x2b2fa4ff2b03] [binf112:05845] [ 2] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)[0x2b2fa4ff5288] [binf112:05845] [ 3] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(+0xd1f86)[0x2b2fa4ff4f86] [binf112:05845] [ 4] /lib64/libc.so.6(vasprintf+0x3e)[0x2b2fa4957a7e] [binf112:05845] [ 5] /lib64/libc.so.6(asprintf+0x88)[0x2b2fa4937148] [binf112:05845] [ 6] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_util_convert_process_name_to_string+0xe2)[0x2b2fa4c873e2] [binf112:05845] [ 7] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_oob_base_get_addr+0x25)[0x2b2fa4cbdb15] [binf112:05845] [ 8] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_rml_oob.so(orte_rml_oob_get_uri+0xa)[0x2b2fa79c5d2a] [binf112:05845] [ 9] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_routed_base_register_sync+0x1fd)[0x2b2fa4cdae7d] [binf112:05845] [10] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_routed_binomial.so(+0x3c7b)[0x2b2fa719bc7b] [binf112:05845] [11] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_ess_base_app_setup+0x3ad)[0x2b2fa4ca7c8d] [binf112:05845] [12] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_ess_env.so(+0x169f)[0x2b2fa6b8f69f] [binf112:05845] [13] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x17b)[0x2b2fa4c764bb] [binf112:05845] [14] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x438)[0x2b2fa3d1e198] [binf112:05845] [15] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0xf7)[0x2b2fa3d44947] [binf112:05845] [16] ring_c[0x4024ef] [binf112:05845] [17] /lib64/libc.so.6(__libc_start_main+0xe6)[0x2b2fa4906c36] [binf112:05845] [18] ring_c[0x4023f9] [binf112:05845] *** End of error message *** -- mpirun noticed that process rank 3 with PID 5845 on node 112 exited on signal 11 (Segmentation fault). -- Does any of that help? Greg From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Tuesday, June 03, 2014 11:54 PM To: Open MPI Users Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c Sounds odd - can you configure OMPI --enable-debug and run it again? If it fails and you can get a core dump, could you tell us the line number where it is failing? On Jun 3, 2014, at 9:58 AM, Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: Apologies - I forgot to add some of the information requested by the FAQ: 1. OpenFabrics is provided by the Linux distribution: [binf102:fischega] $ rpm -qa | grep ofed ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5 ofed-1.5.4.1-0.11.5 ofed-doc-1.5.4.1-0.11.5 2. Linux Distro / Kernel: [binf102:fischega] $ cat /etc/SuSE-release SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 3 [binf102:fischega] $ uname -a Linux 102 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990) x86_64 x86_64 x86_64 GNU/Linux 3. Not sure which subnet manger is being used - I think OpenSM, but I'll need to check with my administrators. 4. Output of ibv_devinfo is attached. 5. Ifconfig output is attached. 6. Ul
Re: [OMPI users] intermittent segfaults with openib on ring_c.c
I recompiled with "-enable-debug" but it doesn't seem to be providing any more information or a core dump. I'm compiling ring.c with: mpicc ring_c.c -g -traceback -o ring_c and running with: mpirun -np 4 --mca btl openib,self ring_c and I'm getting: [binf112:05845] *** Process received signal *** [binf112:05845] Signal: Segmentation fault (11) [binf112:05845] Signal code: Address not mapped (1) [binf112:05845] Failing at address: 0x10 [binf112:05845] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x2b2fa44d57c0] [binf112:05845] [ 1] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x4b3)[0x2b2fa4ff2b03] [binf112:05845] [ 2] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)[0x2b2fa4ff5288] [binf112:05845] [ 3] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(+0xd1f86)[0x2b2fa4ff4f86] [binf112:05845] [ 4] /lib64/libc.so.6(vasprintf+0x3e)[0x2b2fa4957a7e] [binf112:05845] [ 5] /lib64/libc.so.6(asprintf+0x88)[0x2b2fa4937148] [binf112:05845] [ 6] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_util_convert_process_name_to_string+0xe2)[0x2b2fa4c873e2] [binf112:05845] [ 7] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_oob_base_get_addr+0x25)[0x2b2fa4cbdb15] [binf112:05845] [ 8] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_rml_oob.so(orte_rml_oob_get_uri+0xa)[0x2b2fa79c5d2a] [binf112:05845] [ 9] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_routed_base_register_sync+0x1fd)[0x2b2fa4cdae7d] [binf112:05845] [10] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_routed_binomial.so(+0x3c7b)[0x2b2fa719bc7b] [binf112:05845] [11] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_ess_base_app_setup+0x3ad)[0x2b2fa4ca7c8d] [binf112:05845] [12] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_ess_env.so(+0x169f)[0x2b2fa6b8f69f] [binf112:05845] [13] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x17b)[0x2b2fa4c764bb] [binf112:05845] [14] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x438)[0x2b2fa3d1e198] [binf112:05845] [15] //_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0xf7)[0x2b2fa3d44947] [binf112:05845] [16] ring_c[0x4024ef] [binf112:05845] [17] /lib64/libc.so.6(__libc_start_main+0xe6)[0x2b2fa4906c36] [binf112:05845] [18] ring_c[0x4023f9] [binf112:05845] *** End of error message *** -- mpirun noticed that process rank 3 with PID 5845 on node 112 exited on signal 11 (Segmentation fault). -- Does any of that help? Greg From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Tuesday, June 03, 2014 11:54 PM To: Open MPI Users Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c Sounds odd - can you configure OMPI --enable-debug and run it again? If it fails and you can get a core dump, could you tell us the line number where it is failing? On Jun 3, 2014, at 9:58 AM, Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: Apologies - I forgot to add some of the information requested by the FAQ: 1. OpenFabrics is provided by the Linux distribution: [binf102:fischega] $ rpm -qa | grep ofed ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5 ofed-1.5.4.1-0.11.5 ofed-doc-1.5.4.1-0.11.5 2. Linux Distro / Kernel: [binf102:fischega] $ cat /etc/SuSE-release SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 3 [binf102:fischega] $ uname -a Linux 102 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990) x86_64 x86_64 x86_64 GNU/Linux 3. Not sure which subnet manger is being used - I think OpenSM, but I'll need to check with my administrators. 4. Output of ibv_devinfo is attached. 5. Ifconfig output is attached. 6. Ulimit -l output: [binf102:fischega] $ ulimit -l unlimited Greg From: Fischer, Greg A. Sent: Tuesday, June 03, 2014 12:38 PM To: Open MPI Users Cc: Fischer, Greg A. Subject: intermittent segfaults with openib on ring_c.c Hello openmpi-users, I'm running into a perplexing problem on a new system, whereby I'm experiencing intermittent segmentation faults when I run the ring_c.c example and use the openib BTL. See an example below. Approximately 50% of the time it provides the expected output, but the other 50% of the time, it segfaults. LD_LIBRARY_PATH is set correctly, and the version of "mpirun" being invoked is correct. The output of ompi_info -all is attached. One potential problem may be that the system that OpenMPI was compiled on i
Re: [OMPI users] intermittent segfaults with openib on ring_c.c
Apologies - I forgot to add some of the information requested by the FAQ: 1. OpenFabrics is provided by the Linux distribution: [binf102:fischega] $ rpm -qa | grep ofed ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5 ofed-1.5.4.1-0.11.5 ofed-doc-1.5.4.1-0.11.5 2. Linux Distro / Kernel: [binf102:fischega] $ cat /etc/SuSE-release SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 3 [binf102:fischega] $ uname -a Linux casl102 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990) x86_64 x86_64 x86_64 GNU/Linux 3. Not sure which subnet manger is being used - I think OpenSM, but I'll need to check with my administrators. 4. Output of ibv_devinfo is attached. 5. Ifconfig output is attached. 6. Ulimit -l output: [binf102:fischega] $ ulimit -l unlimited Greg From: Fischer, Greg A. Sent: Tuesday, June 03, 2014 12:38 PM To: Open MPI Users Cc: Fischer, Greg A. Subject: intermittent segfaults with openib on ring_c.c Hello openmpi-users, I'm running into a perplexing problem on a new system, whereby I'm experiencing intermittent segmentation faults when I run the ring_c.c example and use the openib BTL. See an example below. Approximately 50% of the time it provides the expected output, but the other 50% of the time, it segfaults. LD_LIBRARY_PATH is set correctly, and the version of "mpirun" being invoked is correct. The output of ompi_info -all is attached. One potential problem may be that the system that OpenMPI was compiled on is mostly the same as the system where it is being executed, but there are some differences in the installed packages. I've checked the critical ones (libibverbs, librdmacm, libmlx4-rdmav2, etc.), and they appear to be the same. Can anyone suggest how I might start tracking this problem down? Thanks, Greg [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c [binf102:31268] *** Process received signal *** [binf102:31268] Signal: Segmentation fault (11) [binf102:31268] Signal code: Address not mapped (1) [binf102:31268] Failing at address: 0x10 [binf102:31268] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2b42213f57c0] [binf102:31268] [ 1] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) [0x2b42203fd7e3] [binf102:31268] [ 2] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_memalign+0x8b) [0x2b4220400d3b] [binf102:31268] [ 3] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_memalign+0x6f) [0x2b42204008ef] [binf102:31268] [ 4] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(+0x117876) [0x2b4220400876] [binf102:31268] [ 5] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0xc34c) [0x2b422572334c] [binf102:31268] [ 6] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_class_initialize+0xaa) [0x2b422041d64a] [binf102:31268] [ 7] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0x1f12f) [0x2b422573612f] [binf102:31268] [ 8] /lib64/libpthread.so.0(+0x77b6) [0x2b42213ed7b6] [binf102:31268] [ 9] /lib64/libc.so.6(clone+0x6d) [0x2b42216dcd6d] [binf102:31268] *** End of error message *** -- mpirun noticed that process rank 0 with PID 31268 on node 102 exited on signal 11 (Segmentation fault). -- hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.8.000 node_guid: 0002:c903:0010:371e sys_image_guid: 0002:c903:0010:3721 vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: HP_016009 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu:4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 21 port_lmc: 0x00 link_layer: IB port: 2 state: PORT_DOWN (1) max_mtu:4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: IB eth0 Link encap:Ethernet HWaddr 3C:4A:92:F5:2F:B0 inet addr:10.179.32.21 Bcast:10.179.32.255 Mask:255
[OMPI users] intermittent segfaults with openib on ring_c.c
Hello openmpi-users, I'm running into a perplexing problem on a new system, whereby I'm experiencing intermittent segmentation faults when I run the ring_c.c example and use the openib BTL. See an example below. Approximately 50% of the time it provides the expected output, but the other 50% of the time, it segfaults. LD_LIBRARY_PATH is set correctly, and the version of "mpirun" being invoked is correct. The output of ompi_info -all is attached. One potential problem may be that the system that OpenMPI was compiled on is mostly the same as the system where it is being executed, but there are some differences in the installed packages. I've checked the critical ones (libibverbs, librdmacm, libmlx4-rdmav2, etc.), and they appear to be the same. Can anyone suggest how I might start tracking this problem down? Thanks, Greg [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c [binf102:31268] *** Process received signal *** [binf102:31268] Signal: Segmentation fault (11) [binf102:31268] Signal code: Address not mapped (1) [binf102:31268] Failing at address: 0x10 [binf102:31268] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2b42213f57c0] [binf102:31268] [ 1] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3) [0x2b42203fd7e3] [binf102:31268] [ 2] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_memalign+0x8b) [0x2b4220400d3b] [binf102:31268] [ 3] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_memalign+0x6f) [0x2b42204008ef] [binf102:31268] [ 4] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(+0x117876) [0x2b4220400876] [binf102:31268] [ 5] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0xc34c) [0x2b422572334c] [binf102:31268] [ 6] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_class_initialize+0xaa) [0x2b422041d64a] [binf102:31268] [ 7] //_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0x1f12f) [0x2b422573612f] [binf102:31268] [ 8] /lib64/libpthread.so.0(+0x77b6) [0x2b42213ed7b6] [binf102:31268] [ 9] /lib64/libc.so.6(clone+0x6d) [0x2b42216dcd6d] [binf102:31268] *** End of error message *** -- mpirun noticed that process rank 0 with PID 31268 on node 102 exited on signal 11 (Segmentation fault). -- Package: Open MPI fischega@binford Distribution Open MPI: 1.6.5 Open MPI SVN revision: r28673 Open MPI release date: Jun 26, 2013 Open RTE: 1.6.5 Open RTE SVN revision: r28673 Open RTE release date: Jun 26, 2013 OPAL: 1.6.5 OPAL SVN revision: r28673 OPAL release date: Jun 26, 2013 MPI API: 2.1 Ident string: 1.6.5 MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.6.5) MCA memory: linux (MCA v2.0, API v2.0, Component v1.6.5) MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.5) MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.6.5) MCA carto: file (MCA v2.0, API v2.0, Component v1.6.5) MCA shmem: mmap (MCA v2.0, API v2.0, Component v1.6.5) MCA shmem: posix (MCA v2.0, API v2.0, Component v1.6.5) MCA shmem: sysv (MCA v2.0, API v2.0, Component v1.6.5) MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6.5) MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.5) MCA timer: linux (MCA v2.0, API v2.0, Component v1.6.5) MCA installdirs: env (MCA v2.0, API v2.0, Component v1.6.5) MCA installdirs: config (MCA v2.0, API v2.0, Component v1.6.5) MCA sysinfo: linux (MCA v2.0, API v2.0, Component v1.6.5) MCA hwloc: hwloc132 (MCA v2.0, API v2.0, Component v1.6.5) MCA dpm: orte (MCA v2.0, API v2.0, Component v1.6.5) MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.6.5) MCA allocator: basic (MCA v2.0, API v2.0, Component v1.6.5) MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.6.5) MCA coll: basic (MCA v2.0, API v2.0, Component v1.6.5) MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.6.5) MCA coll: inter (MCA v2.0, API v2.0, Component v1.6.5) MCA coll: self (MCA v2.0, API v2.0, Component v1.6.5) MCA coll: sm (MCA v2.0, API v2.0, Component v1.6.5) MCA coll: sync (MCA v2.0, API v2.0, Component v1.6.5) MCA coll: tuned (MCA v2.0, API v2.0, Component v1.6.5) MCA io: romio (MCA v2.0, API v2.0, Component v1.6.5) MCA mpool: fake (MCA v2.0, API v2.0,
Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources
Yep. That was the problem. It works beautifully now. Thanks for prodding me to take another look. With regards to openmpi-1.6.5, the system that I'm compiling and running on, SLES10, contains some pretty dated software (e.g. Linux 2.6.x, python 2.4, gcc 4.1.2). Is it possible there's simply an incompatibility lurking in there somewhere that would trip openmpi-1.6.5 but not openmpi-1.4.3? Greg >-Original Message- >From: Fischer, Greg A. >Sent: Friday, January 24, 2014 11:41 AM >To: 'Open MPI Users' >Cc: Fischer, Greg A. >Subject: RE: [OMPI users] simple test problem hangs on mpi_finalize and >consumes all system resources > >Hmm... It looks like CMAKE was somehow finding openmpi-1.6.5 instead of >openmpi-1.4.3, despite the environment variables being set otherwise. This >is likely the explanation. I'll try to chase that down. > >>-Original Message- >>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >>Squyres (jsquyres) >>Sent: Friday, January 24, 2014 11:39 AM >>To: Open MPI Users >>Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and >>consumes all system resources >> >>Ok. I only mention this because the "mca_paffinity_linux.so: undefined >>symbol: mca_base_param_reg_int" type of message is almost always an >>indicator of two different versions being installed into the same tree. >> >> >>On Jan 24, 2014, at 11:26 AM, "Fischer, Greg A." >><fisch...@westinghouse.com> wrote: >> >>> Version 1.4.3 and 1.6.5 were and are installed in separate trees: >>> >>> 1003 fischega@lxlogin2[~]> ls >>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.* >>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.4.3: >>> bin etc include lib share >>> >>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5: >>> bin etc include lib share >>> >>> I'm fairly sure I was careful to check that the LD_LIBRARY_PATH was >>> set >>correctly, but I'll check again. >>> >>>> -Original Message- >>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >>>> Squyres (jsquyres) >>>> Sent: Friday, January 24, 2014 11:07 AM >>>> To: Open MPI Users >>>> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize >>>> and consumes all system resources >>>> >>>> On Jan 22, 2014, at 10:21 AM, "Fischer, Greg A." >>>> <fisch...@westinghouse.com> wrote: >>>> >>>>> The reason for deleting the openmpi-1.6.5 installation was that I >>>>> went back >>>> and installed openmpi-1.4.3 and the problem (mostly) went away. >>>> Openmpi- >>>> 1.4.3 can run the simple tests without issue, but on my "real" >>>> program, I'm getting symbol lookup errors: >>>>> >>>>> mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int >>>> >>>> This sounds like you are mixing 1.6.x and 1.4.x in the same >>>> installation >>tree. >>>> This can definitely lead to sadness. >>>> >>>> More specifically: installing 1.6 over an existing 1.4 installation >>>> (and vice >>>> versa) is definitely NOT supported. The set of plugins that the two >>>> install are different, and can lead to all manner of weird/undefined >>behavior. >>>> >>>> FWIW: I typically install Open MPI into a tree by itself. And if I >>>> later want to remove that installation, I just "rm -rf" that tree. >>>> Then I can install a different version of OMPI into that same tree >>>> (because the prior tree is completely gone). >>>> >>>> However, if you can't install OMPI into a tree by itself, you can >>>> "make uninstall" from the source tree, and that should surgically >>>> completely remove OMPI from the installation tree. Then it is safe >>>> to install a different version of OMPI into that same tree. >>>> >>>> Can you verify that you had installed OMPI into completely clean >>>> trees? If you didn't, I can imagine that causing the kinds of >>>> errors that you >>described. >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> ___ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >>-- >>Jeff Squyres >>jsquy...@cisco.com >>For corporate legal information go to: >>http://www.cisco.com/web/about/doing_business/legal/cri/ >> >>___ >>users mailing list >>us...@open-mpi.org >>http://www.open-mpi.org/mailman/listinfo.cgi/users >>
Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources
Hmm... It looks like CMAKE was somehow finding openmpi-1.6.5 instead of openmpi-1.4.3, despite the environment variables being set otherwise. This is likely the explanation. I'll try to chase that down. >-Original Message- >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >Squyres (jsquyres) >Sent: Friday, January 24, 2014 11:39 AM >To: Open MPI Users >Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and >consumes all system resources > >Ok. I only mention this because the "mca_paffinity_linux.so: undefined >symbol: mca_base_param_reg_int" type of message is almost always an >indicator of two different versions being installed into the same tree. > > >On Jan 24, 2014, at 11:26 AM, "Fischer, Greg A." ><fisch...@westinghouse.com> wrote: > >> Version 1.4.3 and 1.6.5 were and are installed in separate trees: >> >> 1003 fischega@lxlogin2[~]> ls >> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.* >> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.4.3: >> bin etc include lib share >> >> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5: >> bin etc include lib share >> >> I'm fairly sure I was careful to check that the LD_LIBRARY_PATH was set >correctly, but I'll check again. >> >>> -Original Message- >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >>> Squyres (jsquyres) >>> Sent: Friday, January 24, 2014 11:07 AM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize >>> and consumes all system resources >>> >>> On Jan 22, 2014, at 10:21 AM, "Fischer, Greg A." >>> <fisch...@westinghouse.com> wrote: >>> >>>> The reason for deleting the openmpi-1.6.5 installation was that I >>>> went back >>> and installed openmpi-1.4.3 and the problem (mostly) went away. >>> Openmpi- >>> 1.4.3 can run the simple tests without issue, but on my "real" >>> program, I'm getting symbol lookup errors: >>>> >>>> mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int >>> >>> This sounds like you are mixing 1.6.x and 1.4.x in the same installation >tree. >>> This can definitely lead to sadness. >>> >>> More specifically: installing 1.6 over an existing 1.4 installation >>> (and vice >>> versa) is definitely NOT supported. The set of plugins that the two >>> install are different, and can lead to all manner of weird/undefined >behavior. >>> >>> FWIW: I typically install Open MPI into a tree by itself. And if I >>> later want to remove that installation, I just "rm -rf" that tree. >>> Then I can install a different version of OMPI into that same tree >>> (because the prior tree is completely gone). >>> >>> However, if you can't install OMPI into a tree by itself, you can >>> "make uninstall" from the source tree, and that should surgically >>> completely remove OMPI from the installation tree. Then it is safe >>> to install a different version of OMPI into that same tree. >>> >>> Can you verify that you had installed OMPI into completely clean >>> trees? If you didn't, I can imagine that causing the kinds of errors that >>> you >described. >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >-- >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: >http://www.cisco.com/web/about/doing_business/legal/cri/ > >___ >users mailing list >us...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources
Version 1.4.3 and 1.6.5 were and are installed in separate trees: 1003 fischega@lxlogin2[~]> ls /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.* /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.4.3: bin etc include lib share /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5: bin etc include lib share I'm fairly sure I was careful to check that the LD_LIBRARY_PATH was set correctly, but I'll check again. >-Original Message- >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >Squyres (jsquyres) >Sent: Friday, January 24, 2014 11:07 AM >To: Open MPI Users >Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and >consumes all system resources > >On Jan 22, 2014, at 10:21 AM, "Fischer, Greg A." ><fisch...@westinghouse.com> wrote: > >> The reason for deleting the openmpi-1.6.5 installation was that I went back >and installed openmpi-1.4.3 and the problem (mostly) went away. Openmpi- >1.4.3 can run the simple tests without issue, but on my "real" program, I'm >getting symbol lookup errors: >> >> mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int > >This sounds like you are mixing 1.6.x and 1.4.x in the same installation tree. >This can definitely lead to sadness. > >More specifically: installing 1.6 over an existing 1.4 installation (and vice >versa) is definitely NOT supported. The set of plugins that the two install >are >different, and can lead to all manner of weird/undefined behavior. > >FWIW: I typically install Open MPI into a tree by itself. And if I later want >to >remove that installation, I just "rm -rf" that tree. Then I can install a >different >version of OMPI into that same tree (because the prior tree is completely >gone). > >However, if you can't install OMPI into a tree by itself, you can "make >uninstall" from the source tree, and that should surgically completely >remove OMPI from the installation tree. Then it is safe to install a different >version of OMPI into that same tree. > >Can you verify that you had installed OMPI into completely clean trees? If >you didn't, I can imagine that causing the kinds of errors that you described. > >-- >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: >http://www.cisco.com/web/about/doing_business/legal/cri/ > >___ >users mailing list >us...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources
Well, this is a little strange. The hanging behavior is gone, but I'm getting a segfault now. The output of "hello_c.c" and "ring_c.c" are attached. I'm getting a segfault with the Fortran test, also. I'm afraid I may have polluted the experiment by removing the target openmpi-1.6.5 installation directory yesterday. To produce the attached outputs, I just went back and did "make install" in the openmpi-1.6.5 build directory. I've re-set the environment variables as they were a few days ago by sourcing the same bash script. Perhaps I forgot something, or something on the system changed? Regardless, LD_LIBRARY_PATH and PATH are set correctly, and aberrant behavior persists. The reason for deleting the openmpi-1.6.5 installation was that I went back and installed openmpi-1.4.3 and the problem (mostly) went away. Openmpi-1.4.3 can run the simple tests without issue, but on my "real" program, I'm getting symbol lookup errors: mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int Perhaps that's a separate thread. >-Original Message- >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >Squyres (jsquyres) >Sent: Tuesday, January 21, 2014 3:57 PM >To: Open MPI Users >Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and >consumes all system resources > >Just for giggles, can you repeat the same test but with hello_c.c and ring_c.c? >I.e., let's get the Fortran out of the way and use just the base C bindings, >and >see what happens. > > >On Jan 19, 2014, at 6:18 PM, "Fischer, Greg A." <fisch...@westinghouse.com> >wrote: > >> I just tried running "hello_f90.f90" and see the same behavior: 100% CPU >usage, gradually increasing memory consumption, and failure to get past >mpi_finalize. LD_LIBRARY_PATH is set as: >> >> >> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib >> >> The installation target for this version of OpenMPI is: >> >> >> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5 >> >> 1045 >> fischega@lxlogin2[/data/fischega/petsc_configure/mpi_test/simple]> >> which mpirun >> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin/mpir >> un >> >> Perhaps something strange is happening with GCC? I've tried simple hello >world C and Fortran programs, and they work normally. >> >> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph >> Castain >> Sent: Sunday, January 19, 2014 11:36 AM >> To: Open MPI Users >> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize >> and consumes all system resources >> >> The OFED warning about registration is something OMPI added at one point >when we isolated the cause of jobs occasionally hanging, so you won't see >that warning from other MPIs or earlier versions of OMPI (I forget exactly >when we added it). >> >> The problem you describe doesn't sound like an OMPI issue - it sounds like >you've got a memory corruption problem in the code. Have you tried running >the examples in our example directory to confirm that the installation is >good? >> >> Also, check to ensure that your LD_LIBRARY_PATH is correctly set to pickup >the OMPI libs you installed - most Linux distros come with an older version, >and that can cause problems if you inadvertently pick them up. >> >> >> On Jan 19, 2014, at 5:51 AM, Fischer, Greg A. <fisch...@westinghouse.com> >wrote: >> >> >> Hello, >> >> I have a simple, 1-process test case that gets stuck on the mpi_finalize >> call. >The test case is a dead-simple calculation of pi - 50 lines of Fortran. The >process gradually consumes more and more memory until the system >becomes unresponsive and needs to be rebooted, unless the job is killed >first. >> >> In the output, attached, I see the warning message about OpenFabrics >being configured to only allow registering part of physical memory. I've tried >to chase this down with my administrator to no avail yet. (I am aware of the >relevant FAQ entry.) A different installation of MPI on the same system, >made with a different compiler, does not produce the OpenFabrics memory >registration warning - which seems strange because I thought it was a system >configuration issue independent of MPI. Also curious in the output is that LSF >seems to think there are 7 processes and 11 threads associated with this job. >> >> The particulars of my configuration are attached and detailed below. Does >anyone see anything potentially problematic? >> >> Thanks, >> Greg >> >> OpenMPI Version: 1.6.5 >
Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources
I just tried running "hello_f90.f90" and see the same behavior: 100% CPU usage, gradually increasing memory consumption, and failure to get past mpi_finalize. LD_LIBRARY_PATH is set as: /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib The installation target for this version of OpenMPI is: /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5 1045 fischega@lxlogin2[/data/fischega/petsc_configure/mpi_test/simple]> which mpirun /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin/mpirun Perhaps something strange is happening with GCC? I've tried simple hello world C and Fortran programs, and they work normally. From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Sunday, January 19, 2014 11:36 AM To: Open MPI Users Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources The OFED warning about registration is something OMPI added at one point when we isolated the cause of jobs occasionally hanging, so you won't see that warning from other MPIs or earlier versions of OMPI (I forget exactly when we added it). The problem you describe doesn't sound like an OMPI issue - it sounds like you've got a memory corruption problem in the code. Have you tried running the examples in our example directory to confirm that the installation is good? Also, check to ensure that your LD_LIBRARY_PATH is correctly set to pickup the OMPI libs you installed - most Linux distros come with an older version, and that can cause problems if you inadvertently pick them up. On Jan 19, 2014, at 5:51 AM, Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: Hello, I have a simple, 1-process test case that gets stuck on the mpi_finalize call. The test case is a dead-simple calculation of pi - 50 lines of Fortran. The process gradually consumes more and more memory until the system becomes unresponsive and needs to be rebooted, unless the job is killed first. In the output, attached, I see the warning message about OpenFabrics being configured to only allow registering part of physical memory. I've tried to chase this down with my administrator to no avail yet. (I am aware of the relevant FAQ entry.) A different installation of MPI on the same system, made with a different compiler, does not produce the OpenFabrics memory registration warning - which seems strange because I thought it was a system configuration issue independent of MPI. Also curious in the output is that LSF seems to think there are 7 processes and 11 threads associated with this job. The particulars of my configuration are attached and detailed below. Does anyone see anything potentially problematic? Thanks, Greg OpenMPI Version: 1.6.5 Compiler: GCC 4.6.1 OS: SuSE Linux Enterprise Server 10, Patchlevel 2 uname -a : Linux lxlogin2 2.6.16.60-0.21-smp #1 SMP Tue May 6 12:41:02 UTC 2008 x86_64 x86_64 x86_64 GNU/Linux LD_LIBRARY_PATH=/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/lib64:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/lib PATH= /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/python-2.7.6/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/git-1.7.0.4/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/cmake-2.8.11.2/bin:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/etc:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/bin:/usr/bin:.:/bin:/usr/scripts Execution command: (executed via LSF - effectively "mpirun -np 1 test_program") ___ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources
Hello, I have a simple, 1-process test case that gets stuck on the mpi_finalize call. The test case is a dead-simple calculation of pi - 50 lines of Fortran. The process gradually consumes more and more memory until the system becomes unresponsive and needs to be rebooted, unless the job is killed first. In the output, attached, I see the warning message about OpenFabrics being configured to only allow registering part of physical memory. I've tried to chase this down with my administrator to no avail yet. (I am aware of the relevant FAQ entry.) A different installation of MPI on the same system, made with a different compiler, does not produce the OpenFabrics memory registration warning - which seems strange because I thought it was a system configuration issue independent of MPI. Also curious in the output is that LSF seems to think there are 7 processes and 11 threads associated with this job. The particulars of my configuration are attached and detailed below. Does anyone see anything potentially problematic? Thanks, Greg OpenMPI Version: 1.6.5 Compiler: GCC 4.6.1 OS: SuSE Linux Enterprise Server 10, Patchlevel 2 uname -a : Linux lxlogin2 2.6.16.60-0.21-smp #1 SMP Tue May 6 12:41:02 UTC 2008 x86_64 x86_64 x86_64 GNU/Linux LD_LIBRARY_PATH=/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/lib64:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/lib PATH= /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/python-2.7.6/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/git-1.7.0.4/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/cmake-2.8.11.2/bin:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/etc:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/bin:/usr/bin:.:/bin:/usr/scripts Execution command: (executed via LSF - effectively "mpirun -np 1 test_program") Sender: LSF SystemSubject: Job 900527: Exited Job was submitted from host by user in cluster . Job was executed on host(s) , in queue , as user in cluster . was used as the home directory. was used as the working directory. Started at Sat Jan 18 21:47:47 2014 Results reported at Sat Jan 18 21:48:33 2014 Your job looked like: # LSBATCH: User input mpirun.lsf pi TERM_OWNER: job killed by owner. Exited with exit code 1. Resource usage summary: CPU time : 41.56 sec. Max Memory : 12075 MB Max Swap : 12213 MB Max Processes : 7 Max Threads:11 The output (if any) follows: -- WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. This may be caused by your OpenFabrics vendor limiting the amount of physical memory that can be registered. You should investigate the relevant Linux kernel module parameters that control how much physical memory can be registered, and increase them to allow registering all physical memory on your machine. See this Open MPI FAQ item for more information on these Linux kernel module parameters: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: bl1211 Registerable memory: 32768 MiB Total memory:64618 MiB Your MPI job will continue, but may be behave poorly and/or hang. -- MPI process0 running on node bl1211. Running 5000 samples over1 proc(s). pi is3.1415926535895617 Error is 2.31370478331882623E-013 THIS IS THE END. -- mpirun has exited due to process rank 0 with PID 29294 on node bl1211 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- Job /tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper pi TID HOST_NAME COMMAND_LINESTATUS