Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-15 Thread Fischer, Greg A. via users
I tried the patch, but I get the same result:

error obtaining device attributes for mlx4_0 errno says Success

I'm getting (what I think are) good transfer rates using "--mca btl self,tcp" 
on the osu_bw test (~7000 MB/s). It seems to me that the only way that could be 
happening is if the infiniband interfaces are being used over TCP, correct? 
Would such an arrangement preclude the ability to do RDMA or openib? Perhaps 
the network is setup in such a way that the IB hardware is not discoverable by 
openib?

(I'm not a network admin, and I wasn't involved in the setup of the network. 
Unfortunately, the person who knows the most has recently left the 
organization.)

Greg

From: Pritchard Jr., Howard 
Sent: Thursday, October 14, 2021 5:45 PM
To: Fischer, Greg A. ; Open MPI Users 

Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

[External Email]
HI Greg,

Oh yes that's not good about rdmacm.

Yes the OFED looks pretty old.

Did you by any chance apply that patch?  I generated that for a sysadmin here 
who was in the situation where they needed to maintain Open MPI 3.1.6 but had 
to also upgrade to some newer RHEL release, but the Open MPi wasn't compiling 
after the RHEL upgrade.

Howard


From: "Fischer, Greg A." 
mailto:fisch...@westinghouse.com>>
Date: Thursday, October 14, 2021 at 1:47 PM
To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, Open 
MPI Users mailto:users@lists.open-mpi.org>>
Cc: "Fischer, Greg A." 
mailto:fisch...@westinghouse.com>>
Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

I added -enable-mt and re-installed UCX. Same result. (I didn't re-compile 
OpenMPI.)

A conspicuous warning I see in my UCX configure output is:

checking for rdma_establish in -lrdmacm... no
configure: WARNING: RDMACM requested but librdmacm is not found or does not 
provide rdma_establish() API

The version of librdmacm we have comes from 
librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64, which seems to date from 
mid-2017. I wonder if that's too old?

Greg

From: Pritchard Jr., Howard mailto:howa...@lanl.gov>>
Sent: Thursday, October 14, 2021 3:31 PM
To: Fischer, Greg A. 
mailto:fisch...@westinghouse.com>>; Open MPI Users 
mailto:users@lists.open-mpi.org>>
Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

[External Email]
Hi Greg,

I think the UCX PML may be discomfited by the lack of thread safety.

Could you try using the contrib/configure-release-mt  in your ucx folder?  You 
want to add -enable-mt.
That's what stands out in your configure output from the one I usually get when 
building on a MLNX connectx5 cluster with
MLNX_OFED_LINUX-4.5-1.0.1.0

Here's the output from one of my UCX configs:

configure: =
configure: UCX build configuration:
configure: Build prefix:   /ucx_testing/ucx/test_install
configure:Configuration dir:   ${prefix}/etc/ucx
configure:   Preprocessor flags:   -DCPU_FLAGS="" -I${abs_top_srcdir}/src 
-I${abs_top_builddir} -I${abs_top_builddir}/src
configure:   C compiler:   
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc
 -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers 
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels 
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch 
-Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length 
-Wnested-externs -Wshadow -Werror=declaration-after-statement
configure: C++ compiler:   
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++
 -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers 
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels 
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch
configure: Multi-thread:   enabled
configure: NUMA support:   disabled
configure:MPI tests:   disabled
configure:  VFS support:   no
configure:Devel headers:   no
configure: io_demo CUDA support:   no
configure: Bindings:   < >
configure:  UCS modules:   < >
configure:  UCT modules:   < ib cma knem >
configure: CUDA modules:   < >
configure: ROCM modules:   < >
configure:   IB modules:   < >
configure:      UCM modules:   < >
configure: Perf modules:   < >
configure: =


Howard

From: "Fischer, Greg A." 
mailto:fisch...@westinghouse.com>>
Date: Thursday, October 14, 2021 at 12:46 PM
To: "Pritchard Jr., How

Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-14 Thread Fischer, Greg A. via users
Thanks, Howard.

I downloaded a current version of UCX (1.11.2) and installed it with OpenMPI 
4.1.1. When I try to specify the "-mca pml ucx" for a simple, 2-process 
benchmark problem, I get:

--
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:  bl1311
  Framework: pml
--
[bl1311:20168] PML ucx cannot be selected
[bl1311:20169] PML ucx cannot be selected


I've attached my ucx_info -d output, as well as the ucx configuration 
information. I'm not sure I follow everything on the UCX FAQ page, but it seems 
like everything is being routed over TCP, which is probably not what I want. 
Any thoughts as to what I might be doing wrong?

Thanks,
Greg

From: Pritchard Jr., Howard 
Sent: Wednesday, October 13, 2021 12:28 PM
To: Open MPI Users 
Cc: Fischer, Greg A. 
Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

[External Email]
HI Greg,

It's the aging of the openib btl.

You may be able to apply the attached patch.  Note the 3.1.x release stream is 
no longer supported.

You may want to try using the 4.1.1 release, in which case you'll want to use 
UCX.

Howard


From: users 
mailto:users-boun...@lists.open-mpi.org>> on 
behalf of "Fischer, Greg A. via users" 
mailto:users@lists.open-mpi.org>>
Reply-To: Open MPI Users 
mailto:users@lists.open-mpi.org>>
Date: Wednesday, October 13, 2021 at 10:06 AM
To: "users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>" 
mailto:users@lists.open-mpi.org>>
Cc: "Fischer, Greg A." 
mailto:fisch...@westinghouse.com>>
Subject: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno 
says Success"


Hello,



I have compiled OpenMPI 3.1.6 from source on SLES12-SP3, and I am seeing the 
following errors when I try to use the openib btl:



WARNING: There was an error initializing an OpenFabrics device.



  Local host:   bl1308

  Local device: mlx4_0

--

[bl1308][[44866,1],5][../../../../../openmpi-3.1.6/opal/mca/btl/openib/btl_openib_component.c:1671:init_one_device]
 error obtaining device attributes for mlx4_0 errno says Success



I have disabled UCX ("--without-ucx") because the UCX installation we have 
seems to be too out-of-date. ofed_info says "MLNX_OFED_LINUX-4.1-1.0.2.0". I've 
attached the detailed output of ofed_info and ompi_info.



This issue seems similar to Issue #7461 
(https://github.com/open-mpi/ompi/issues/7461<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopen-mpi%2Fompi%2Fissues%2F7461=04%7C01%7Cfischega%40westinghouse.com%7Cfe8eac2c9dfb4f26781a08d98e667521%7C516ec17ab92f438b8594e11b6f6bec79%7C0%7C0%7C637697392985500288%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000=uZVYaEU3YA7hcUD%2F4Mrtarmo26J64O41I9WlPDPpXLk%3D=0>),
 which I don't see a resolution for.



Does anyone know what the likely explanation is? Is the version of OFED on the 
system badly out-of-sync with contemporary OpenMPI?



Thanks,

Greg




This e-mail may contain proprietary information of the sending organization. 
Any unauthorized or improper disclosure, copying, distribution, or use of the 
contents of this e-mail and attached document(s) is prohibited. The information 
contained in this e-mail and attached document(s) is intended only for the 
personal and private use of the recipient(s) named above. If you have received 
this communication in error, please notify the sender immediately by email and 
delete the original e-mail and attached document(s).



This e-mail may contain proprietary information of the sending organization. 
Any unauthorized or improper disclosure, copying, distribution, or use of the 
contents of this e-mail and attached document(s) is prohibited. The information 
contained in this e-mail and attached document(s) is intended only for the 
personal and private use of the recipient(s) named above. If you have received 
this communication in error, please notify the sender immediately by email and 
delete the original e-mail and attached document(s).
#
# Memory domain: posix
# Component: posix
# allocate: unlimited
#   remote key: 24 bytes
#   rkey_ptr is supported
#
#  Transport: posix
# Device: memory
#  System device: 
#
#  capabilities:
#ba

Re: [OMPI users] [EXTERNAL] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-14 Thread Fischer, Greg A. via users
I added -enable-mt and re-installed UCX. Same result. (I didn't re-compile 
OpenMPI.)

A conspicuous warning I see in my UCX configure output is:

checking for rdma_establish in -lrdmacm... no
configure: WARNING: RDMACM requested but librdmacm is not found or does not 
provide rdma_establish() API

The version of librdmacm we have comes from 
librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64, which seems to date from 
mid-2017. I wonder if that's too old?

Greg

From: Pritchard Jr., Howard 
Sent: Thursday, October 14, 2021 3:31 PM
To: Fischer, Greg A. ; Open MPI Users 

Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

[External Email]
Hi Greg,

I think the UCX PML may be discomfited by the lack of thread safety.

Could you try using the contrib/configure-release-mt  in your ucx folder?  You 
want to add -enable-mt.
That's what stands out in your configure output from the one I usually get when 
building on a MLNX connectx5 cluster with
MLNX_OFED_LINUX-4.5-1.0.1.0

Here's the output from one of my UCX configs:

configure: =
configure: UCX build configuration:
configure: Build prefix:   /ucx_testing/ucx/test_install
configure:Configuration dir:   ${prefix}/etc/ucx
configure:   Preprocessor flags:   -DCPU_FLAGS="" -I${abs_top_srcdir}/src 
-I${abs_top_builddir} -I${abs_top_builddir}/src
configure:   C compiler:   
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc
 -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers 
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels 
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch 
-Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length 
-Wnested-externs -Wshadow -Werror=declaration-after-statement
configure: C++ compiler:   
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++
 -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers 
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels 
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch
configure: Multi-thread:   enabled
configure: NUMA support:   disabled
configure:MPI tests:   disabled
configure:  VFS support:   no
configure:Devel headers:   no
configure: io_demo CUDA support:   no
configure: Bindings:   < >
configure:  UCS modules:   < >
configure:  UCT modules:   < ib cma knem >
configure: CUDA modules:   < >
configure: ROCM modules:   < >
configure:   IB modules:   < >
configure:  UCM modules:   < >
configure: Perf modules:   < >
configure: =========


Howard

From: "Fischer, Greg A." 
mailto:fisch...@westinghouse.com>>
Date: Thursday, October 14, 2021 at 12:46 PM
To: "Pritchard Jr., Howard" mailto:howa...@lanl.gov>>, Open 
MPI Users mailto:users@lists.open-mpi.org>>
Cc: "Fischer, Greg A." 
mailto:fisch...@westinghouse.com>>
Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

Thanks, Howard.

I downloaded a current version of UCX (1.11.2) and installed it with OpenMPI 
4.1.1. When I try to specify the "-mca pml ucx" for a simple, 2-process 
benchmark problem, I get:

--
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:  bl1311
  Framework: pml
--
[bl1311:20168] PML ucx cannot be selected
[bl1311:20169] PML ucx cannot be selected


I've attached my ucx_info -d output, as well as the ucx configuration 
information. I'm not sure I follow everything on the UCX FAQ page, but it seems 
like everything is being routed over TCP, which is probably not what I want. 
Any thoughts as to what I might be doing wrong?

Thanks,
Greg

From: Pritchard Jr., Howard mailto:howa...@lanl.gov>>
Sent: Wednesday, October 13, 2021 12:28 PM
To: Open MPI Users mailto:users@lists.open-mpi.org>>
Cc: Fischer, Greg A. 
mailto:fisch...@westinghouse.com>>
Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 
errno says Success"

[External Email]
HI Greg,

It's the aging of the openib btl.

You may be able to appl

[OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success"

2021-10-13 Thread Fischer, Greg A. via users
Hello,



I have compiled OpenMPI 3.1.6 from source on SLES12-SP3, and I am seeing the 
following errors when I try to use the openib btl:



WARNING: There was an error initializing an OpenFabrics device.



  Local host:   bl1308

  Local device: mlx4_0

--

[bl1308][[44866,1],5][../../../../../openmpi-3.1.6/opal/mca/btl/openib/btl_openib_component.c:1671:init_one_device]
 error obtaining device attributes for mlx4_0 errno says Success



I have disabled UCX ("--without-ucx") because the UCX installation we have 
seems to be too out-of-date. ofed_info says "MLNX_OFED_LINUX-4.1-1.0.2.0". I've 
attached the detailed output of ofed_info and ompi_info.



This issue seems similar to Issue #7461 
(https://github.com/open-mpi/ompi/issues/7461), which I don't see a resolution 
for.



Does anyone know what the likely explanation is? Is the version of OFED on the 
system badly out-of-sync with contemporary OpenMPI?



Thanks,

Greg




This e-mail may contain proprietary information of the sending organization. 
Any unauthorized or improper disclosure, copying, distribution, or use of the 
contents of this e-mail and attached document(s) is prohibited. The information 
contained in this e-mail and attached document(s) is intended only for the 
personal and private use of the recipient(s) named above. If you have received 
this communication in error, please notify the sender immediately by email and 
delete the original e-mail and attached document(s).
MLNX_OFED_LINUX-4.1-1.0.2.0 (OFED-4.1-1.0.2):

ar_mgr:
osm_plugins/ar_mgr/ar_mgr-1.0-0.34.g9bd7c9a.tar.gz

cc_mgr:
osm_plugins/cc_mgr/cc_mgr-1.0-0.33.g9bd7c9a.tar.gz

dapl:
dapl.git mlnx_ofed_4_0
commit bdb055900059d1b8d5ee8cdfb457ca653eb9dd2d
dump_pr:
osm_plugins/dump_pr//dump_pr-1.0-0.29.g9bd7c9a.tar.gz

fabric-collector:
fabric_collector//fabric-collector-1.1.0.MLNX20170103.89bb2aa.tar.gz

hcoll:
mlnx_ofed_hcol/hcoll-3.8.1649-1.src.rpm

ibacm:
mlnx_ofed/ibacm.git mlnx_ofed_4_1
commit b0d53cf13358eb0c14665765b0170a37768463ff
ibacm_ssa:
mlnx_ofed_ssa/acm/ibacm_ssa-0.0.9.3.MLNX20151203.50eb579.tar.gz

ibdump:
sniffer/sniffer-5.0.0-1/ibdump/linux/ibdump-5.0.0-1.tgz

ibsim:
mlnx_ofed_ibsim/ibsim-0.6mlnx1-0.8.g9d76581.tar.gz

ibssa:
mlnx_ofed_ssa/distrib/ibssa-0.0.9.3.MLNX20151203.50eb579.tar.gz

ibutils:
ofed-1.5.3-rpms/ibutils/ibutils-1.5.7.1-0.12.gdcaeae2.tar.gz

ibutils2:
ibutils2/ibutils2-2.1.1-0.91.MLNX20170612.g2e0d52a.tar.gz

infiniband-diags:
mlnx_ofed_infiniband_diags/infiniband-diags-1.6.7.MLNX20170511.7595646.tar.gz

iser:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1
commit c22af8878c71966728f6ac38d963190f5222b2ec

isert:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1
commit c22af8878c71966728f6ac38d963190f5222b2ec

kernel-mft:
mlnx_ofed_mft/kernel-mft-4.7.0-41.src.rpm

knem:
knem.git mellanox-master
commit 4faa2978ad0339c50dd6df336d0a4182647b624b
libibcm:
mlnx_ofed/libibcm.git mlnx_ofed_4_1
commit e3e9fffe4d2d2f730110a7bdeb7da7b8ea97e51e
libibmad:
mlnx_ofed_libibmad/libibmad-1.3.13.MLNX20170511.267a441.tar.gz

libibprof:
mlnx_ofed_libibprof/libibprof-1.1.41-1.src.rpm

libibumad:
mlnx_ofed_libibumad/libibumad-13.10.2.MLNX20170511.dcc9f7a.tar.gz

libibverbs:
mlnx_ofed/libibverbs.git mlnx_ofed_4_1
commit a23bf787eff96af4c05d6e5f0e201dba80db114e
libmlx4:
mlnx_ofed/libmlx4.git mlnx_ofed_4_1
commit d945a7eeb52e319b210e6602a9fee0646371
libmlx5:
mlnx_ofed/libmlx5.git mlnx_ofed_4_1
commit 71822e375014c7f81dec3e4eca06f366846eaf1a
libopensmssa:
mlnx_ofed_ssa/plugin/libopensmssa-0.0.9.3.MLNX20151203.50eb579.tar.gz

librdmacm:
mlnx_ofed/librdmacm.git mlnx_ofed_4_1
commit 1297178df9b07030d84a042d417cb61fa65e62a1
librxe:
mlnx_ofed/librxe.git master
commit 607460456c717c3b65428367676cacb5495ac005
libvma:
vma/source_rpms//libvma-8.3.7-0.src.rpm

mlnx-en:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1
commit c22af8878c71966728f6ac38d963190f5222b2ec

mlnx-ethtool:
upstream/ethtool.git for-upstream
commit ac0cf295abe0c0832f0711fed66ab9601c8b2513
mlnx-nfsrdma:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1
commit c22af8878c71966728f6ac38d963190f5222b2ec

mlnx-nvme:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1
commit c22af8878c71966728f6ac38d963190f5222b2ec

mlnx-ofa_kernel:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1
commit c22af8878c71966728f6ac38d963190f5222b2ec

mlnx-rdma-rxe:
mlnx_ofed/mlnx-ofa_kernel-4.0.git mlnx_ofed_4_1
commit c22af8878c71966728f6ac38d963190f5222b2ec

mpi-selector:
ofed-1.5.3-rpms/mpi-selector/mpi-selector-1.0.3-1.src.rpm

mpitests:
mlnx_ofed_mpitest/mpitests-3.2.19-acade41.src.rpm

mstflint:
mlnx_ofed_mstflint/mstflint-4.7.0-1.6.g26037b7.tar.gz

multiperf:
mlnx_ofed_multiperf/multiperf-3.0-0.10.gda89e8c.tar.gz

mxm:
mlnx_ofed_mxm/mxm-3.6.3102-1.src.rpm

ofed-docs:
docs.git mlnx_ofed-4.0
commit 3d1b0afb7bc190ae5f362223043f76b2b45971cc

openmpi:
mlnx_ofed_ompi_1.8/openmpi-2.1.2a1-1.src.rpm

opensm:

[OMPI users] disappearance of the memory registration error in 1.8.x?

2015-03-10 Thread Fischer, Greg A.
Hello,

I'm trying to run the "connectivity_c" test on a variety of systems using 
OpenMPI 1.8.4. The test returns segmentation faults when running across nodes 
on one particular type of system, and only when using the openib BTL. (The test 
runs without error if I stipulate "--mca btl tcp,self".) Here's the output:

1033 fischega@bl1415[~/tmp/openmpi/1.8.4_test_examples_SLES11_SP2/error]> 
mpirun -np 16 connectivity_c
[bl1415:29526] *** Process received signal ***
[bl1415:29526] Signal: Segmentation fault (11)
[bl1415:29526] Signal code:  (128)
[bl1415:29526] Failing at address: (nil)
[bl1415:29526] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2ab1e72915d0]
[bl1415:29526] [ 1] 
/data/pgrlf/openmpi-1.8.4/SLES10_SP2_lib/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x29e)[0x2ab1e7c550be]
[bl1415:29526] [ 2] 
/data/pgrlf/openmpi-1.8.4/SLES10_SP2_lib/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_memalign+0x69)[0x2ab1e7c58829]
[bl1415:29526] [ 3] 
/data/pgrlf/openmpi-1.8.4/SLES10_SP2_lib/lib/libopen-pal.so.6(opal_memory_ptmalloc2_memalign+0x6f)[0x2ab1e7c583ff]
[bl1415:29526] [ 4] 
/data/pgrlf/openmpi-1.8.4/SLES10_SP2_lib/lib/openmpi/mca_btl_openib.so(+0x2867b)[0x2ab1eac8a67b]
[bl1415:29526] [ 5] 
/data/pgrlf/openmpi-1.8.4/SLES10_SP2_lib/lib/openmpi/mca_btl_openib.so(+0x1f712)[0x2ab1eac81712]
[bl1415:29526] [ 6] /lib64/libpthread.so.0(+0x75f0)[0x2ab1e72895f0]
[bl1415:29526] [ 7] /lib64/libc.so.6(clone+0x6d)[0x2ab1e757484d]
[bl1415:29526] *** End of error message ***

When I run the same test using a previous build of OpenMPI 1.6.5 on this 
system, it returns a memory registration warning, but otherwise executes 
normally:

--
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

OpenMPI 1.8.4 does not seem to be reporting a memory registration warning in 
situations where previous versions would report such a warning. Is this because 
OpenMPI 1.8.4 is no longer vulnerable to this type of condition?

Thanks,
Greg


This e-mail may contain proprietary information of the sending organization. 
Any unauthorized or improper disclosure, copying, distribution, or use of the 
contents of this e-mail and attached document(s) is prohibited. The information 
contained in this e-mail and attached document(s) is intended only for the 
personal and private use of the recipient(s) named above. If you have received 
this communication in error, please notify the sender immediately by email and 
delete the original e-mail and attached document(s).


Re: [OMPI users] poor performance using the openib btl

2014-06-25 Thread Fischer, Greg A.
I looked through my configure log, and that option is not enabled. Thanks for 
the suggestion.

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime 
Boissonneault
Sent: Wednesday, June 25, 2014 10:51 AM
To: Open MPI Users
Subject: Re: [OMPI users] poor performance using the openib btl

Hi,
I recovered the name of the option that caused problems for us. It is 
--enable-mpi-thread-multiple

This option enables threading within OPAL, which was bugged (at least in 1.6.x 
series). I don't know if it has been fixed in 1.8 series.

I do not see your configure line in the attached file, to see if it was enabled 
or not.

Maxime

Le 2014-06-25 10:46, Fischer, Greg A. a écrit :
Attached are the results of "grep thread" on my configure output. There appears 
to be some amount of threading, but is there anything I should look for in 
particular?

I see Mike Dubman's questions on the mailing list website, but his message 
didn't appear to make it to my inbox. The answers to his questions are:

[binford:fischega] $ rpm -qa | grep ofed
ofed-doc-1.5.4.1-0.11.5
ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5
ofed-1.5.4.1-0.11.5

Distro: SLES11 SP3

HCA:
[binf102:fischega] $ /usr/sbin/ibstat
CA 'mlx4_0'
CA type: MT26428

Command line (path and LD_LIBRARY_PATH are set correctly):
mpirun -x LD_LIBRARY_PATH -mca btl openib,sm,self -mca btl_openib_verbose 1 -np 
31 $CTF_EXEC

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime 
Boissonneault
Sent: Tuesday, June 24, 2014 6:41 PM
To: Open MPI Users
Subject: Re: [OMPI users] poor performance using the openib btl

What are your threading options for OpenMPI (when it was built) ?

I have seen OpenIB BTL completely lock when some level of threading is enabled 
before.

Maxime Boissonneault


Le 2014-06-24 18:18, Fischer, Greg A. a écrit :
Hello openmpi-users,

A few weeks ago, I posted to the list about difficulties I was having getting 
openib to work with Torque (see "openib segfaults with Torque", June 6, 2014). 
The issues were related to Torque imposing restrictive limits on locked memory, 
and have since been resolved.

However, now that I've had some time to test the applications, I'm seeing 
abysmal performance over the openib layer. Applications run with the tcp btl 
execute about 10x faster than with the openib btl. Clearly something still 
isn't quite right.

I tried running with "-mca btl_openib_verbose 1", but didn't see anything 
resembling a smoking gun. How should I go about determining the source of the 
problem? (This uses the same OpenMPI Version 1.8.1 / SLES11 SP3 / GCC 4.8.3 
setup discussed previously.)

Thanks,
Greg





___

users mailing list

us...@open-mpi.org<mailto:us...@open-mpi.org>

Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/06/24697.php





--

-

Maxime Boissonneault

Analyste de calcul - Calcul Québec, Université Laval

Ph. D. en physique




___

users mailing list

us...@open-mpi.org<mailto:us...@open-mpi.org>

Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/06/24700.php




--

-

Maxime Boissonneault

Analyste de calcul - Calcul Québec, Université Laval

Ph. D. en physique


Re: [OMPI users] poor performance using the openib btl

2014-06-25 Thread Fischer, Greg A.
Attached are the results of "grep thread" on my configure output. There appears 
to be some amount of threading, but is there anything I should look for in 
particular?

I see Mike Dubman's questions on the mailing list website, but his message 
didn't appear to make it to my inbox. The answers to his questions are:

[binford:fischega] $ rpm -qa | grep ofed
ofed-doc-1.5.4.1-0.11.5
ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5
ofed-1.5.4.1-0.11.5

Distro: SLES11 SP3

HCA:
[binf102:fischega] $ /usr/sbin/ibstat
CA 'mlx4_0'
CA type: MT26428

Command line (path and LD_LIBRARY_PATH are set correctly):
mpirun -x LD_LIBRARY_PATH -mca btl openib,sm,self -mca btl_openib_verbose 1 -np 
31 $CTF_EXEC

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime 
Boissonneault
Sent: Tuesday, June 24, 2014 6:41 PM
To: Open MPI Users
Subject: Re: [OMPI users] poor performance using the openib btl

What are your threading options for OpenMPI (when it was built) ?

I have seen OpenIB BTL completely lock when some level of threading is enabled 
before.

Maxime Boissonneault


Le 2014-06-24 18:18, Fischer, Greg A. a écrit :
Hello openmpi-users,

A few weeks ago, I posted to the list about difficulties I was having getting 
openib to work with Torque (see "openib segfaults with Torque", June 6, 2014). 
The issues were related to Torque imposing restrictive limits on locked memory, 
and have since been resolved.

However, now that I've had some time to test the applications, I'm seeing 
abysmal performance over the openib layer. Applications run with the tcp btl 
execute about 10x faster than with the openib btl. Clearly something still 
isn't quite right.

I tried running with "-mca btl_openib_verbose 1", but didn't see anything 
resembling a smoking gun. How should I go about determining the source of the 
problem? (This uses the same OpenMPI Version 1.8.1 / SLES11 SP3 / GCC 4.8.3 
setup discussed previously.)

Thanks,
Greg




___

users mailing list

us...@open-mpi.org<mailto:us...@open-mpi.org>

Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/06/24697.php




--

-

Maxime Boissonneault

Analyste de calcul - Calcul Québec, Université Laval

Ph. D. en physique
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking pthread.h usability... yes
checking pthread.h presence... yes
checking for pthread.h... yes
checking if C compiler and POSIX threads work as is... no
checking if C++ compiler and POSIX threads work as is... no
checking if Fortran compiler and POSIX threads work as is... no
checking if C compiler and POSIX threads work with -Kthread... no
checking if C compiler and POSIX threads work with -kthread... no
checking if C compiler and POSIX threads work with -pthread... yes
checking if C++ compiler and POSIX threads work with -Kthread... no
checking if C++ compiler and POSIX threads work with -kthread... no
checking if C++ compiler and POSIX threads work with -pthread... yes
checking if Fortran compiler and POSIX threads work with -Kthread... no
checking if Fortran compiler and POSIX threads work with -kthread... no
checking if Fortran compiler and POSIX threads work with -pthread... yes
checking for pthread_mutexattr_setpshared... yes
checking for pthread_condattr_setpshared... yes
checking for working POSIX threads package... yes
checking for type of thread support... posix
checking if threads have different pids (pthreads on linux)... no
checking for pthread_t... yes
checking pthread_np.h usability... no
checking pthread_np.h presence... no
checking for pthread_np.h... no
checking whether pthread_setaffinity_np is declared... yes
checking whether pthread_getaffinity_np is declared... yes
checking for library containing pthread_getthrds_np... no
checking for pthread_mutex_lock... yes
checking libevent configuration args... --disable-dns --disable-http 
--disable-rpc --disable-openssl --enable-thread-support --disable-evport
configure: running /bin/sh 
'../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/configure'
 --disable-dns --disable-http --disable-rpc --disable-openssl 
--enable-thread-support --disable-evport  
'--prefix=/casl/vera_ib/gcc-4.8.3/toolset/openmpi-1.8.1' --cache-file=/dev/null 
--srcdir=../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent 
--disable-option-checking
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for the pthreads library -lpthreads... no
checking whether pthreads work without any flags... yes
checking for joinable pthread attribute... PTHREAD_CREATE_JOINABLE
checking if more special flags are required for pthreads... no
checking size of pthread_t... 8
config.status: creating libevent_pthreads.pc
checking for thread support (needed for rdmacm/udcm)... posix
configure: running /bin/sh 
'../../../../../../openmpi-1.8

[OMPI users] poor performance using the openib btl

2014-06-24 Thread Fischer, Greg A.
Hello openmpi-users,

A few weeks ago, I posted to the list about difficulties I was having getting 
openib to work with Torque (see "openib segfaults with Torque", June 6, 2014). 
The issues were related to Torque imposing restrictive limits on locked memory, 
and have since been resolved.

However, now that I've had some time to test the applications, I'm seeing 
abysmal performance over the openib layer. Applications run with the tcp btl 
execute about 10x faster than with the openib btl. Clearly something still 
isn't quite right.

I tried running with "-mca btl_openib_verbose 1", but didn't see anything 
resembling a smoking gun. How should I go about determining the source of the 
problem? (This uses the same OpenMPI Version 1.8.1 / SLES11 SP3 / GCC 4.8.3 
setup discussed previously.)

Thanks,
Greg


Re: [OMPI users] openib segfaults with Torque

2014-06-13 Thread Fischer, Greg A.
This sounds credible. When I login via Torque, I see the following:

[binf316:fischega] $ ulimit -l
64

but when I login via ssh, I see:

[binf316:fischega] $ ulimit -l
unlimited

I'll have my administrator make the changes and give that a shot.  Thanks, 
everyone!

_
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Wednesday, June 11, 2014 7:13 PM
To: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque


If that could help Greg,
on the compute nodes I normally add this to /etc/security/limits.conf:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  32768

and

ulimit -n 32768
ulimit -l unlimited
ulimit -s unlimited

to either /etc/init.d/pbs_mom or to /etc/sysconfig/pbs_mom (which
should be sourced by the former).
Other values are possible, of course.

My recollection is that the boilerplate init scripts that
come with Torque don't change those limits.

I suppose this makes the pbs_mom child processes,
including the user job script and whatever processes it starts
(mpiexec, etc), to inherit those limits.
Or not?

Gus Correa


On 06/11/2014 06:20 PM, Jeff Squyres (jsquyres) wrote:
> +1
>
> On Jun 11, 2014, at 6:01 PM, Ralph Castain 
> <r...@open-mpi.org<mailto:r...@open-mpi.org>>
>   wrote:
>
>> Yeah, I think we've seen that somewhere before too...
>>
>>
>> On Jun 11, 2014, at 2:59 PM, Joshua Ladd 
>> <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote:
>>
>>> Agreed. The problem is not with UDCM. I don't think something is wrong with 
>>> the system. I think his Torque is imposing major constraints on the maximum 
>>> size that can be locked into memory.
>>>
>>> Josh
>>>
>>>
>>> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm 
>>> <hje...@lanl.gov<mailto:hje...@lanl.gov>> wrote:
>>> Probably won't help to use RDMACM though as you will just see the
>>> resource failure somewhere else. UDCM is not the problem. Something is
>>> wrong with the system. Allocating a 512 entry CQ should not fail.
>>>
>>> -Nathan
>>>
>>> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:
>>>> I'm guessing it's a resource limitation issue coming from Torque.
>>>>
>>>> H...I found something interesting on the interwebs that looks 
>>>> awfully
>>>> similar:
>>>> 
>>>> http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html
>>>>
>>>> Greg, if the suggestion from the Torque users doesn't resolve your 
>>>> issue (
>>>> "...adding the following line 'ulimit -l unlimited' to pbs_mom and
>>>> restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead 
>>>> of
>>>>     UDCM, which is a pretty recent addition to the openIB BTL.) by setting:
>>>>
>>>> -mca btl_openib_cpc_include rdmacm
>>>>
>>>> Josh
>>>>
>>>> On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres)
>>>> <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
>>>>
>>>>   Mellanox --
>>>>
>>>>   What would cause a CQ to fail to be created?
>>>>
>>>>   On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A."
>>>>   <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote:
>>>>
>>>>   > Is there any other work around that I might try?  Something that
>>>>   avoids UDCM?
>>>>   >
>>>>   > -Original Message-
>>>>   > From: Fischer, Greg A.
>>>>   > Sent: Tuesday, June 10, 2014 2:59 PM
>>>>   > To: Nathan Hjelm
>>>>   > Cc: Open MPI Users; Fischer, Greg A.
>>>>   > Subject: RE: [OMPI users] openib segfaults with Torque
>>>>   >
>>>>   > [binf316:fischega] $ ulimit -m
>>>>   > unlimited
>>>>   >
>>>>   > Greg
>>>>   >
>>>>   > -Original Message-
>>>>   > From: Nathan Hjelm [mailto:hje...@lanl.gov]
>>>>   > Sent: Tuesday, June 10, 2014 2:58 PM
>>>>   > To: Fischer, Greg A.
>>>>   > Cc: Open MPI Users
>>>>   > Subject: Re: [OMPI users] openib segfaults with Torque
>>>>   >
>>>>   > Out of curiosity what is the

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Fischer, Greg A.
Is there any other work around that I might try?  Something that avoids UDCM?

-Original Message-
From: Fischer, Greg A.
Sent: Tuesday, June 10, 2014 2:59 PM
To: Nathan Hjelm
Cc: Open MPI Users; Fischer, Greg A.
Subject: RE: [OMPI users] openib segfaults with Torque

[binf316:fischega] $ ulimit -m
unlimited

Greg

-Original Message-
From: Nathan Hjelm [mailto:hje...@lanl.gov]
Sent: Tuesday, June 10, 2014 2:58 PM
To: Fischer, Greg A.
Cc: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Out of curiosity what is the mlock limit on your system? If it is too low that 
can cause ibv_create_cq to fail. To check run ulimit -m.

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
> Yes, this fails on all nodes on the system, except for the head node.
>
> The uptime of the system isn't significant. Maybe 1 week, and it's received 
> basically no use.
>
> -Original Message-
> From: Nathan Hjelm [mailto:hje...@lanl.gov]
> Sent: Tuesday, June 10, 2014 2:49 PM
> To: Fischer, Greg A.
> Cc: Open MPI Users
> Subject: Re: [OMPI users] openib segfaults with Torque
>
>
> Well, thats interesting. The output shows that ibv_create_cq is failing. 
> Strange since an identical call had just succeeded (udcm creates two 
> completion queues). Some questions that might indicate where the failure 
> might be:
>
> Does this fail on any other node in your system?
>
> How long has the node been up?
>
> -Nathan Hjelm
> Application Readiness, HPC-5, LANL
>
> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> > Jeff/Nathan,
> >
> > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a 
> > terminal on a compute node with "qsub -l nodes 2 -I":
> >
> >   mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2
> > ring_c &> output.txt
> >
> > Output and backtrace are attached. Let me know if I can provide anything 
> > else.
> >
> > Thanks for looking into this,
> > Greg
> >
> > -Original Message-
> > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
> > Squyres (jsquyres)
> > Sent: Tuesday, June 10, 2014 10:31 AM
> > To: Nathan Hjelm
> > Cc: Open MPI Users
> > Subject: Re: [OMPI users] openib segfaults with Torque
> >
> > Greg:
> >
> > Can you run with "--mca btl_base_verbose 100" on your debug build so that 
> > we can get some additional output to see why UDCM is failing to setup 
> > properly?
> >
> >
> >
> > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
> >
> > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
> > >> I seem to recall that you have an IB-based cluster, right?
> > >>
> > >> From a *very quick* glance at the code, it looks like this might be a 
> > >> simple incorrect-finalization issue.  That is:
> > >>
> > >> - you run the job on a single server
> > >> - openib disqualifies itself because you're running on a single
> > >> server
> > >> - openib then goes to finalize/close itself
> > >> - but openib didn't fully initialize itself (because it
> > >> disqualified itself early in the initialization process), and
> > >> something in the finalization process didn't take that into
> > >> account
> > >>
> > >> Nathan -- is that anywhere close to correct?
> > >
> > > Nope. udcm_module_finalize is being called because there was an
> > > error setting up the udcm state. See btl_openib_connect_udcm.c:476.
> > > The opal_list_t destructor is getting an assert failure. Probably
> > > because the constructor wasn't called. I can rearrange the
> > > constructors to be called first but there appears to be a deeper
> > > issue with the user's
> > > system: udcm_module_init should not be failing! It creates a
> > > couple of CQs, allocates a small number of registered bufferes and
> > > starts monitoring the fd for the completion channel. All these
> > > things are also done in the setup of the openib btl itself. Keep
> > > in mind that the openib btl will not disqualify itself when running 
> > > single server.
> > > Openib may be used to communicate on node and is needed for the dynamics 
> > > case.
> > >
> > > The user might try adding -mca btl_base_verbose 100 to shed some
> > > light on what the real issue is.
> > >
> > > BTW, I no longer monitor th

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
[binf316:fischega] $ ulimit -m
unlimited

Greg

-Original Message-
From: Nathan Hjelm [mailto:hje...@lanl.gov]
Sent: Tuesday, June 10, 2014 2:58 PM
To: Fischer, Greg A.
Cc: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Out of curiosity what is the mlock limit on your system? If it is too low that 
can cause ibv_create_cq to fail. To check run ulimit -m.

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
> Yes, this fails on all nodes on the system, except for the head node.
>
> The uptime of the system isn't significant. Maybe 1 week, and it's received 
> basically no use.
>
> -Original Message-
> From: Nathan Hjelm [mailto:hje...@lanl.gov]
> Sent: Tuesday, June 10, 2014 2:49 PM
> To: Fischer, Greg A.
> Cc: Open MPI Users
> Subject: Re: [OMPI users] openib segfaults with Torque
>
>
> Well, thats interesting. The output shows that ibv_create_cq is failing. 
> Strange since an identical call had just succeeded (udcm creates two 
> completion queues). Some questions that might indicate where the failure 
> might be:
>
> Does this fail on any other node in your system?
>
> How long has the node been up?
>
> -Nathan Hjelm
> Application Readiness, HPC-5, LANL
>
> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> > Jeff/Nathan,
> >
> > I ran the following with my debug build of OpenMPI 1.8.1 - after opening a 
> > terminal on a compute node with "qsub -l nodes 2 -I":
> >
> >   mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2
> > ring_c &> output.txt
> >
> > Output and backtrace are attached. Let me know if I can provide anything 
> > else.
> >
> > Thanks for looking into this,
> > Greg
> >
> > -Original Message-
> > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
> > Squyres (jsquyres)
> > Sent: Tuesday, June 10, 2014 10:31 AM
> > To: Nathan Hjelm
> > Cc: Open MPI Users
> > Subject: Re: [OMPI users] openib segfaults with Torque
> >
> > Greg:
> >
> > Can you run with "--mca btl_base_verbose 100" on your debug build so that 
> > we can get some additional output to see why UDCM is failing to setup 
> > properly?
> >
> >
> >
> > On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
> >
> > > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
> > >> I seem to recall that you have an IB-based cluster, right?
> > >>
> > >> From a *very quick* glance at the code, it looks like this might be a 
> > >> simple incorrect-finalization issue.  That is:
> > >>
> > >> - you run the job on a single server
> > >> - openib disqualifies itself because you're running on a single
> > >> server
> > >> - openib then goes to finalize/close itself
> > >> - but openib didn't fully initialize itself (because it
> > >> disqualified itself early in the initialization process), and
> > >> something in the finalization process didn't take that into
> > >> account
> > >>
> > >> Nathan -- is that anywhere close to correct?
> > >
> > > Nope. udcm_module_finalize is being called because there was an
> > > error setting up the udcm state. See btl_openib_connect_udcm.c:476.
> > > The opal_list_t destructor is getting an assert failure. Probably
> > > because the constructor wasn't called. I can rearrange the
> > > constructors to be called first but there appears to be a deeper
> > > issue with the user's
> > > system: udcm_module_init should not be failing! It creates a
> > > couple of CQs, allocates a small number of registered bufferes and
> > > starts monitoring the fd for the completion channel. All these
> > > things are also done in the setup of the openib btl itself. Keep
> > > in mind that the openib btl will not disqualify itself when running 
> > > single server.
> > > Openib may be used to communicate on node and is needed for the dynamics 
> > > case.
> > >
> > > The user might try adding -mca btl_base_verbose 100 to shed some
> > > light on what the real issue is.
> > >
> > > BTW, I no longer monitor the user mailing list. If something needs
> > > my attention forward it to me directly.
> > >
> > > -Nathan
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> > h

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
Yes, this fails on all nodes on the system, except for the head node.

The uptime of the system isn't significant. Maybe 1 week, and it's received 
basically no use.

-Original Message-
From: Nathan Hjelm [mailto:hje...@lanl.gov]
Sent: Tuesday, June 10, 2014 2:49 PM
To: Fischer, Greg A.
Cc: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque


Well, thats interesting. The output shows that ibv_create_cq is failing. 
Strange since an identical call had just succeeded (udcm creates two completion 
queues). Some questions that might indicate where the failure might be:

Does this fail on any other node in your system?

How long has the node been up?

-Nathan Hjelm
Application Readiness, HPC-5, LANL

On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
> Jeff/Nathan,
>
> I ran the following with my debug build of OpenMPI 1.8.1 - after opening a 
> terminal on a compute node with "qsub -l nodes 2 -I":
>
>   mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &>
> output.txt
>
> Output and backtrace are attached. Let me know if I can provide anything else.
>
> Thanks for looking into this,
> Greg
>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
> Squyres (jsquyres)
> Sent: Tuesday, June 10, 2014 10:31 AM
> To: Nathan Hjelm
> Cc: Open MPI Users
> Subject: Re: [OMPI users] openib segfaults with Torque
>
> Greg:
>
> Can you run with "--mca btl_base_verbose 100" on your debug build so that we 
> can get some additional output to see why UDCM is failing to setup properly?
>
>
>
> On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
>
> > On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
> >> I seem to recall that you have an IB-based cluster, right?
> >>
> >> From a *very quick* glance at the code, it looks like this might be a 
> >> simple incorrect-finalization issue.  That is:
> >>
> >> - you run the job on a single server
> >> - openib disqualifies itself because you're running on a single
> >> server
> >> - openib then goes to finalize/close itself
> >> - but openib didn't fully initialize itself (because it
> >> disqualified itself early in the initialization process), and
> >> something in the finalization process didn't take that into account
> >>
> >> Nathan -- is that anywhere close to correct?
> >
> > Nope. udcm_module_finalize is being called because there was an
> > error setting up the udcm state. See btl_openib_connect_udcm.c:476.
> > The opal_list_t destructor is getting an assert failure. Probably
> > because the constructor wasn't called. I can rearrange the
> > constructors to be called first but there appears to be a deeper
> > issue with the user's
> > system: udcm_module_init should not be failing! It creates a couple
> > of CQs, allocates a small number of registered bufferes and starts
> > monitoring the fd for the completion channel. All these things are
> > also done in the setup of the openib btl itself. Keep in mind that
> > the openib btl will not disqualify itself when running single server.
> > Openib may be used to communicate on node and is needed for the dynamics 
> > case.
> >
> > The user might try adding -mca btl_base_verbose 100 to shed some
> > light on what the real issue is.
> >
> > BTW, I no longer monitor the user mailing list. If something needs
> > my attention forward it to me directly.
> >
> > -Nathan
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

> Core was generated by `ring_c'.
> Program terminated with signal 6, Aborted.
> #0  0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6
> #0  0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6
> #1  0x7f8b6ae1e0c5 in abort () from /lib64/libc.so.6
> #2  0x7f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6
> #3  0x7f8b664b684b in udcm_module_finalize (btl=0x717060,
> cpc=0x7190c0) at
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_co
> nnect_udcm.c:734
> #4  0x7f8b664b5474 in udcm_component_query (btl=0x717060,
> cpc=0x718a48) at
> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_co
> nnect_udcm.c:476
> #5  0x7f8b664ae316 in
> ompi_btl_openib_connect_base_select_for_local_port (btl=0x717060) at
> ..

Re: [OMPI users] openib segfaults with Torque

2014-06-10 Thread Fischer, Greg A.
Jeff/Nathan,

I ran the following with my debug build of OpenMPI 1.8.1 - after opening a 
terminal on a compute node with "qsub -l nodes 2 -I":

mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> 
output.txt

Output and backtrace are attached. Let me know if I can provide anything else.

Thanks for looking into this,
Greg

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
(jsquyres)
Sent: Tuesday, June 10, 2014 10:31 AM
To: Nathan Hjelm
Cc: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Greg: 

Can you run with "--mca btl_base_verbose 100" on your debug build so that we 
can get some additional output to see why UDCM is failing to setup properly?



On Jun 10, 2014, at 10:25 AM, Nathan Hjelm  wrote:

> On Tue, Jun 10, 2014 at 12:10:28AM +, Jeff Squyres (jsquyres) wrote:
>> I seem to recall that you have an IB-based cluster, right?
>> 
>> From a *very quick* glance at the code, it looks like this might be a simple 
>> incorrect-finalization issue.  That is:
>> 
>> - you run the job on a single server
>> - openib disqualifies itself because you're running on a single 
>> server
>> - openib then goes to finalize/close itself
>> - but openib didn't fully initialize itself (because it disqualified 
>> itself early in the initialization process), and something in the 
>> finalization process didn't take that into account
>> 
>> Nathan -- is that anywhere close to correct?
> 
> Nope. udcm_module_finalize is being called because there was an error 
> setting up the udcm state. See btl_openib_connect_udcm.c:476. The 
> opal_list_t destructor is getting an assert failure. Probably because 
> the constructor wasn't called. I can rearrange the constructors to be 
> called first but there appears to be a deeper issue with the user's
> system: udcm_module_init should not be failing! It creates a couple of 
> CQs, allocates a small number of registered bufferes and starts 
> monitoring the fd for the completion channel. All these things are 
> also done in the setup of the openib btl itself. Keep in mind that the 
> openib btl will not disqualify itself when running single server. 
> Openib may be used to communicate on node and is needed for the dynamics case.
> 
> The user might try adding -mca btl_base_verbose 100 to shed some light 
> on what the real issue is.
> 
> BTW, I no longer monitor the user mailing list. If something needs my 
> attention forward it to me directly.
> 
> -Nathan


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Core was generated by `ring_c'.
Program terminated with signal 6, Aborted.
#0  0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6
#0  0x7f8b6ae1cb55 in raise () from /lib64/libc.so.6
#1  0x7f8b6ae1e0c5 in abort () from /lib64/libc.so.6
#2  0x7f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6
#3  0x7f8b664b684b in udcm_module_finalize (btl=0x717060, cpc=0x7190c0) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
#4  0x7f8b664b5474 in udcm_component_query (btl=0x717060, cpc=0x718a48) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
#5  0x7f8b664ae316 in ompi_btl_openib_connect_base_select_for_local_port 
(btl=0x717060) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
#6  0x7f8b66497817 in btl_openib_component_init 
(num_btl_modules=0x7fffe34cebe0, enable_progress_threads=false, 
enable_mpi_threads=false)
at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703
#7  0x7f8b6b43fa5e in mca_btl_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
#8  0x7f8b666d9d42 in mca_bml_r2_component_init (priority=0x7fffe34cecb4, 
enable_progress_threads=false, enable_mpi_threads=false)
at ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
#9  0x7f8b6b43ed1b in mca_bml_base_init (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
#10 0x7f8b655ff739 in mca_pml_ob1_component_init (priority=0x7fffe34cedf0, 
enable_progress_threads=false, enable_mpi_threads=false)
at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271
#11 0x7f8b6b4659b2 in mca_pml_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128
#12 0x7f8b6b3d233c in ompi_mpi_init (argc=1, argv=0x7fffe34cf0e8, 
requested=0, provided=0x7fffe34cef98) at 
../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604
#13 

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-10 Thread Fischer, Greg A.
Yes, it should be possible for me to get an upgraded Intel compiler on that 
system. However, as you suggest, I'm more focused on getting it working with 
GCC right now.

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
(jsquyres)
Sent: Monday, June 09, 2014 8:24 PM
To: Open MPI Users
Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c

I'm digging out from mail backlog from being at the MPI Forum last week...

Yes, from looking at the stack traces, it's segv'ing inside the memory 
allocator, which typically means some other memory error occurred before this.  
I.e., this particular segv is a symptom of the problem, not the actual problem.

Are you able to upgrade your Intel compiler to avoid this issue?

(I'm guessing the UDCM issues you reported later were with the gcc-compiled 
Open MPI)



On Jun 4, 2014, at 5:15 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Aha!! I found this in our users mailing list archives:
>
> http://www.open-mpi.org/community/lists/users/2012/01/18091.php
>
> Looks like this is a known compiler vectorization issue.
>
>
> On Jun 4, 2014, at 1:52 PM, Fischer, Greg A. <fisch...@westinghouse.com> 
> wrote:
>
>> Ralph,
>>
>> Thanks for looking. Let me know if there's any other testing that I can do.
>>
>> I recompiled with GCC and it works fine, so that lends credence to your 
>> theory that it has something to do with the Intel compilers, and possibly 
>> their interplay with SUSE.
>>
>> Greg
>>
>> -Original Message-
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph
>> Castain
>> Sent: Wednesday, June 04, 2014 4:48 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] intermittent segfaults with openib on
>> ring_c.c
>>
>> Urggg...unfortunately, the people who know the most about that code are 
>> all at the MPI Forum this week, so we may not be able to fully address it 
>> until their return. It looks like you are still going down into that malloc 
>> interceptor, so I'm not correctly blocking it for you.
>>
>> This run segfaulted in a completely different call in a different part of 
>> the startup procedure - but in the same part of the interceptor, which makes 
>> me suspicious. Don't know how much testing we've seen on SLES...
>>
>>
>> On Jun 4, 2014, at 1:18 PM, Fischer, Greg A. <fisch...@westinghouse.com> 
>> wrote:
>>
>>> Ralph,
>>>
>>> It segfaults. Here's the backtrace:
>>>
>>> Core was generated by `ring_c'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, 
>>> bytes=47840385564856) at 
>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
>>> 4098  bck->fd = unsorted_chunks(av);
>>> (gdb) bt
>>> #0  opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020,
>>> bytes=47840385564856) at
>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
>>> #1  0x2b82b1a47e38 in opal_memory_ptmalloc2_malloc
>>> (bytes=47840385564704) at
>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:3433
>>> #2  0x2b82b1a47b36 in opal_memory_linux_malloc_hook
>>> (sz=47840385564704, caller=0x2b82b53000b8) at
>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/hooks.c:691
>>> #3  0x2b82b19e7b18 in opal_malloc (size=47840385564704,
>>> file=0x2b82b53000b8 "", line=12) at
>>> ../../../openmpi-1.8.1/opal/util/malloc.c:101
>>> #4  0x2b82b199c017 in opal_hash_table_set_value_uint64
>>> (ht=0x2b82b5300020, key=47840385564856, value=0xc) at
>>> ../../openmpi-1.8.1/opal/class/opal_hash_table.c:283
>>> #5  0x2b82b170e4ca in process_uri (uri=0x2b82b5300020 "\001") at
>>> ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:348
>>> #6  0x2b82b170e941 in orte_oob_base_set_addr (fd=-1255145440,
>>> args=184, cbdata=0xc) at
>>> ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:296
>>> #7  0x2b82b19fba1c in event_process_active_single_queue
>>> (base=0x655480, activeq=0x654920) at
>>> ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent
>>> /e
>>> vent.c:1367
>>> #8  0x2b82b19fbcd9 in event_process_active (base=0x655480) at
>>> ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent
>>> /e
>>> vent.c:1437
>>> #9  0x2b82b19fc4c3 in opal_libevent2021_event_base_loop
>>> (base=0x655480, f

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Fischer, Greg A.
Yep, TCP works fine when launched via Torque/qsub:

[binf315:fischega] $ mpirun -np 2 -mca btl tcp,sm,self ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, June 06, 2014 10:34 AM
To: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Huh - how strange. I can't imagine what it has to do with Torque vs rsh - this 
is failing when the openib BTL is trying to create the connection, which comes 
way after the launch is complete.

Are you able to run this with btl tcp,sm,self? If so, that would confirm that 
everything else is correct, and the problem truly is limited to the udcm 
itself...which shouldn't have anything to do with how the proc was launched.


On Jun 6, 2014, at 6:47 AM, Fischer, Greg A. 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote:


Here are the results when logging in to the compute node via ssh and running as 
you suggest:

[binf102:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting

Here are the results when executing over Torque (launch the shell with "qsub -l 
nodes=2 -I"):

[binf316:fischega] $ mpirun -np 2 -mca btl openib,sm,self ring_c
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:21584] *** Process received signal ***
[binf316:21584] Signal: Aborted (6)
[binf316:21584] Signal code:  (-6)
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:21583] *** Process received signal ***
[binf316:21583] Signal: Aborted (6)
[binf316:21583] Signal code:  (-6)
[binf316:21584] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7fe33a2637c0]
[binf316:21584] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fe339f0fb55]
[binf316:21584] [ 2] /lib64/libc.so.6(abort+0x181)[0x7fe339f11131]
[binf316:21584] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7fe339f08a10]
[binf316:21584] [ 4] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7fe3355a984b]
[binf316:21584] [ 5] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7fe3355a8474]
[binf316:21584] [ 6] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7fe3355a1316]
[binf316:21584] [ 7] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7fe33558a817]
[binf316:21584] [ 8] [binf316:21583] [ 0] 
/lib64/libpthread.so.0(+0xf7c0)[0x7f3b586697c0]
[binf316:21583] [ 1] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7fe33a532a5e]
[binf316:21584] [ 9] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7fe3357ccd42]
[binf316:21584] [10] /lib64/libc.so.6(gsignal+0x35)[0x7f3b58315b55]
[binf316:21583] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f3b58317131]
[binf316:21583] [ 3] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7fe33a531d1b]
[binf316:21584] [11] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7fe3344e7739]
[binf316:21584] [12] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f3b5830ea10]
[binf316:21583] [ 4] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f3b539af84b]
[binf316:21583] [ 5] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f3b539ae474]
[binf316:21583] [ 6] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f3b539a7316]
[binf316:21583] [ 7] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_open

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Fischer, Greg A.
**
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f3b588cb33c]
[binf316:21583] [14] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f3b58900386]
[binf316:21583] [15] ring_c[0x40096f]
[binf316:21583] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f3b58301c36]
[binf316:21583] [17] ring_c[0x400889]
[binf316:21583] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 21583 on node 316 exited on 
signal 6 (Aborted).
--

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, June 05, 2014 7:57 PM
To: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Hmmm...I'm not sure how that is going to run with only one proc (I don't know 
if the program is protected against that scenario). If you run with -np 2 -mca 
btl openib,sm,self, is it happy?


On Jun 5, 2014, at 2:16 PM, Fischer, Greg A. 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote:


Here's the command I'm invoking and the terminal output.  (Some of this 
information doesn't appear to be captured in the backtrace.)

[binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:04549] *** Process received signal ***
[binf316:04549] Signal: Aborted (6)
[binf316:04549] Signal code:  (-6)
[binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0]
[binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55]
[binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131]
[binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10]
[binf316:04549] [ 4] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b]
[binf316:04549] [ 5] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474]
[binf316:04549] [ 6] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316]
[binf316:04549] [ 7] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817]
[binf316:04549] [ 8] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e]
[binf316:04549] [ 9] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42]
[binf316:04549] [10] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b]
[binf316:04549] [11] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739]
[binf316:04549] [12] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2]
[binf316:04549] [13] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c]
[binf316:04549] [14] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386]
[binf316:04549] [15] ring_c[0x40096f]
[binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36]
[binf316:04549] [17] ring_c[0x400889]
[binf316:04549] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 4549 on node 316 exited on 
signal 6 (Aborted).
------

From: Fischer, Greg A.
Sent: Thursday, June 05, 2014 5:10 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Cc: Fischer, Greg A.
Subject: openib segfaults with Torque

OpenMPI Users,

After encountering difficulty with the Intel compilers (see the "intermittent 
segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and 
recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL 
in a typical BASH environment. Everything appeared to work fine, so I went on 
my merry way compiling the rest of my dependencies.

After getting my dependencies and applications compiled, I began observing 
segfaults when submitting the applications through Torque. I recompiled OpenMPI 
with debug options, ran "ring_c" over the openib BTL in an interactive Torque 
session ("qsub -I"), and got the backtrace below. All other system settings 
described in the previous thread are the same. Any thoughts on how to resolve 
this issue?

Core was generated by `ring_c'.
Program termin

Re: [OMPI users] openib segfaults with Torque

2014-06-05 Thread Fischer, Greg A.
Here's the command I'm invoking and the terminal output.  (Some of this 
information doesn't appear to be captured in the backtrace.)

[binf316:fischega] $ mpirun -np 1 -mca btl openib,self ring_c
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (>cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:04549] *** Process received signal ***
[binf316:04549] Signal: Aborted (6)
[binf316:04549] Signal code:  (-6)
[binf316:04549] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7f7f5955e7c0]
[binf316:04549] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7f7f5920ab55]
[binf316:04549] [ 2] /lib64/libc.so.6(abort+0x181)[0x7f7f5920c131]
[binf316:04549] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f7f59203a10]
[binf316:04549] [ 4] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f7f548a484b]
[binf316:04549] [ 5] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f7f548a3474]
[binf316:04549] [ 6] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f7f5489c316]
[binf316:04549] [ 7] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f7f54885817]
[binf316:04549] [ 8] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f7f5982da5e]
[binf316:04549] [ 9] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f7f54ac7d42]
[binf316:04549] [10] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f7f5982cd1b]
[binf316:04549] [11] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f7f539ed739]
[binf316:04549] [12] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f7f598539b2]
[binf316:04549] [13] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f7f597c033c]
[binf316:04549] [14] 
//_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f7f597f5386]
[binf316:04549] [15] ring_c[0x40096f]
[binf316:04549] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f7f591f6c36]
[binf316:04549] [17] ring_c[0x400889]
[binf316:04549] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 4549 on node 316 exited on 
signal 6 (Aborted).
--

From: Fischer, Greg A.
Sent: Thursday, June 05, 2014 5:10 PM
To: us...@open-mpi.org
Cc: Fischer, Greg A.
Subject: openib segfaults with Torque

OpenMPI Users,

After encountering difficulty with the Intel compilers (see the "intermittent 
segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and 
recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL 
in a typical BASH environment. Everything appeared to work fine, so I went on 
my merry way compiling the rest of my dependencies.

After getting my dependencies and applications compiled, I began observing 
segfaults when submitting the applications through Torque. I recompiled OpenMPI 
with debug options, ran "ring_c" over the openib BTL in an interactive Torque 
session ("qsub -I"), and got the backtrace below. All other system settings 
described in the previous thread are the same. Any thoughts on how to resolve 
this issue?

Core was generated by `ring_c'.
Program terminated with signal 6, Aborted.
#0  0x7f7f5920ab55 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x7f7f5920ab55 in raise () from /lib64/libc.so.6
#1  0x7f7f5920c0c5 in abort () from /lib64/libc.so.6
#2  0x7f7f59203a10 in __assert_fail () from /lib64/libc.so.6
#3  0x7f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
#4  0x7f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
#5  0x7f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port 
(btl=0x716680) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
#6  0x7f7f54885817 in btl_openib_component_init 
(num_btl_modules=0x7fff906aa420, enable_progress_threads=false, 
enable_mpi_threads=false)
at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703
#7  0x7f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
#8  0x7f7f54ac7

[OMPI users] openib segfaults with Torque

2014-06-05 Thread Fischer, Greg A.
OpenMPI Users,

After encountering difficulty with the Intel compilers (see the "intermittent 
segfaults with openib on ring_c.c" thread), I installed GCC-4.8.3 and 
recompiled OpenMPI. I ran the simple examples (ring, etc.) with the openib BTL 
in a typical BASH environment. Everything appeared to work fine, so I went on 
my merry way compiling the rest of my dependencies.

After getting my dependencies and applications compiled, I began observing 
segfaults when submitting the applications through Torque. I recompiled OpenMPI 
with debug options, ran "ring_c" over the openib BTL in an interactive Torque 
session ("qsub -I"), and got the backtrace below. All other system settings 
described in the previous thread are the same. Any thoughts on how to resolve 
this issue?

Core was generated by `ring_c'.
Program terminated with signal 6, Aborted.
#0  0x7f7f5920ab55 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x7f7f5920ab55 in raise () from /lib64/libc.so.6
#1  0x7f7f5920c0c5 in abort () from /lib64/libc.so.6
#2  0x7f7f59203a10 in __assert_fail () from /lib64/libc.so.6
#3  0x7f7f548a484b in udcm_module_finalize (btl=0x716680, cpc=0x718c40) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
#4  0x7f7f548a3474 in udcm_component_query (btl=0x716680, cpc=0x717be8) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
#5  0x7f7f5489c316 in ompi_btl_openib_connect_base_select_for_local_port 
(btl=0x716680) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
#6  0x7f7f54885817 in btl_openib_component_init 
(num_btl_modules=0x7fff906aa420, enable_progress_threads=false, 
enable_mpi_threads=false)
at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703
#7  0x7f7f5982da5e in mca_btl_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
#8  0x7f7f54ac7d42 in mca_bml_r2_component_init (priority=0x7fff906aa4f4, 
enable_progress_threads=false, enable_mpi_threads=false) at 
../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
#9  0x7f7f5982cd1b in mca_bml_base_init (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
#10 0x7f7f539ed739 in mca_pml_ob1_component_init (priority=0x7fff906aa630, 
enable_progress_threads=false, enable_mpi_threads=false)
at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271
#11 0x7f7f598539b2 in mca_pml_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128
#12 0x7f7f597c033c in ompi_mpi_init (argc=1, argv=0x7fff906aa928, 
requested=0, provided=0x7fff906aa7d8) at 
../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604
#13 0x7f7f597f5386 in PMPI_Init (argc=0x7fff906aa82c, argv=0x7fff906aa820) 
at pinit.c:84
#14 0x0040096f in main (argc=1, argv=0x7fff906aa928) at ring_c.c:19

Greg


Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
Ralph,

Thanks for looking. Let me know if there's any other testing that I can do.

I recompiled with GCC and it works fine, so that lends credence to your theory 
that it has something to do with the Intel compilers, and possibly their 
interplay with SUSE.

Greg

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Wednesday, June 04, 2014 4:48 PM
To: Open MPI Users
Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c

Urggg...unfortunately, the people who know the most about that code are all 
at the MPI Forum this week, so we may not be able to fully address it until 
their return. It looks like you are still going down into that malloc 
interceptor, so I'm not correctly blocking it for you.

This run segfaulted in a completely different call in a different part of the 
startup procedure - but in the same part of the interceptor, which makes me 
suspicious. Don't know how much testing we've seen on SLES...


On Jun 4, 2014, at 1:18 PM, Fischer, Greg A. <fisch...@westinghouse.com> wrote:

> Ralph,
>
> It segfaults. Here's the backtrace:
>
> Core was generated by `ring_c'.
> Program terminated with signal 11, Segmentation fault.
> #0  opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, 
> bytes=47840385564856) at 
> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
> 4098  bck->fd = unsorted_chunks(av);
> (gdb) bt
> #0  opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020,
> bytes=47840385564856) at
> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
> #1  0x2b82b1a47e38 in opal_memory_ptmalloc2_malloc
> (bytes=47840385564704) at
> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:3433
> #2  0x2b82b1a47b36 in opal_memory_linux_malloc_hook
> (sz=47840385564704, caller=0x2b82b53000b8) at
> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/hooks.c:691
> #3  0x2b82b19e7b18 in opal_malloc (size=47840385564704,
> file=0x2b82b53000b8 "", line=12) at
> ../../../openmpi-1.8.1/opal/util/malloc.c:101
> #4  0x2b82b199c017 in opal_hash_table_set_value_uint64
> (ht=0x2b82b5300020, key=47840385564856, value=0xc) at
> ../../openmpi-1.8.1/opal/class/opal_hash_table.c:283
> #5  0x2b82b170e4ca in process_uri (uri=0x2b82b5300020 "\001") at
> ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:348
> #6  0x2b82b170e941 in orte_oob_base_set_addr (fd=-1255145440,
> args=184, cbdata=0xc) at
> ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:296
> #7  0x2b82b19fba1c in event_process_active_single_queue
> (base=0x655480, activeq=0x654920) at
> ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/e
> vent.c:1367
> #8  0x2b82b19fbcd9 in event_process_active (base=0x655480) at
> ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/e
> vent.c:1437
> #9  0x2b82b19fc4c3 in opal_libevent2021_event_base_loop
> (base=0x655480, flags=1) at
> ../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/e
> vent.c:1645
> #10 0x2b82b16f8763 in orte_progress_thread_engine
> (obj=0x2b82b5300020) at
> ../../../../openmpi-1.8.1/orte/mca/ess/base/ess_base_std_app.c:456
> #11 0x2b82b0f1c7b6 in start_thread () from /lib64/libpthread.so.0
> #12 0x2b82b1410d6d in clone () from /lib64/libc.so.6
> #13 0x in ?? ()
>
> Greg
>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph
> Castain
> Sent: Wednesday, June 04, 2014 3:49 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] intermittent segfaults with openib on
> ring_c.c
>
> Sorry for delay - digging my way out of the backlog. This is very strange as 
> you are failing in a simple asprintf call. We check that all the players are 
> non-NULL, and it appears that you are failing to allocate the memory for the 
> resulting (rather short) string.
>
> I'm wondering if this is some strange interaction between SLES, the Intel 
> compiler, and our malloc interceptor - or if there is some difference between 
> the malloc libraries on the two machines. Let's try running it without the 
> malloc interceptor and see if that helps.
>
> Try running with "-mca memory ^linux" on your cmd line
>
>
> On Jun 4, 2014, at 9:58 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> He isn't getting that far - he's failing in MPI_Init when the RTE
>> attempts to connect to the local daemon
>>
>>
>> On Jun 4, 2014, at 9:53 AM, Gus Correa <g...@ldeo.columbia.edu> wrote:
>>
>>> Hi Greg
>>>
>>> From your original email:
>>>
>>>>> [binf102:fischega] $ mpirun -np 2 --mca btl openib,s

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
Ralph,

It segfaults. Here's the backtrace:

Core was generated by `ring_c'.
Program terminated with signal 11, Segmentation fault.
#0  opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, bytes=47840385564856) 
at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
4098  bck->fd = unsorted_chunks(av);
(gdb) bt
#0  opal_memory_ptmalloc2_int_malloc (av=0x2b82b5300020, bytes=47840385564856) 
at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
#1  0x2b82b1a47e38 in opal_memory_ptmalloc2_malloc (bytes=47840385564704) 
at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:3433
#2  0x2b82b1a47b36 in opal_memory_linux_malloc_hook (sz=47840385564704, 
caller=0x2b82b53000b8) at 
../../../../../openmpi-1.8.1/opal/mca/memory/linux/hooks.c:691
#3  0x2b82b19e7b18 in opal_malloc (size=47840385564704, file=0x2b82b53000b8 
"", line=12) at ../../../openmpi-1.8.1/opal/util/malloc.c:101
#4  0x2b82b199c017 in opal_hash_table_set_value_uint64 (ht=0x2b82b5300020, 
key=47840385564856, value=0xc) at 
../../openmpi-1.8.1/opal/class/opal_hash_table.c:283
#5  0x2b82b170e4ca in process_uri (uri=0x2b82b5300020 "\001") at 
../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:348
#6  0x2b82b170e941 in orte_oob_base_set_addr (fd=-1255145440, args=184, 
cbdata=0xc) at ../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:296
#7  0x2b82b19fba1c in event_process_active_single_queue (base=0x655480, 
activeq=0x654920) at 
../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/event.c:1367
#8  0x2b82b19fbcd9 in event_process_active (base=0x655480) at 
../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/event.c:1437
#9  0x2b82b19fc4c3 in opal_libevent2021_event_base_loop (base=0x655480, 
flags=1) at 
../../../../../../openmpi-1.8.1/opal/mca/event/libevent2021/libevent/event.c:1645
#10 0x2b82b16f8763 in orte_progress_thread_engine (obj=0x2b82b5300020) at 
../../../../openmpi-1.8.1/orte/mca/ess/base/ess_base_std_app.c:456
#11 0x2b82b0f1c7b6 in start_thread () from /lib64/libpthread.so.0
#12 0x2b82b1410d6d in clone () from /lib64/libc.so.6
#13 0x in ?? ()

Greg

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Wednesday, June 04, 2014 3:49 PM
To: Open MPI Users
Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c

Sorry for delay - digging my way out of the backlog. This is very strange as 
you are failing in a simple asprintf call. We check that all the players are 
non-NULL, and it appears that you are failing to allocate the memory for the 
resulting (rather short) string.

I'm wondering if this is some strange interaction between SLES, the Intel 
compiler, and our malloc interceptor - or if there is some difference between 
the malloc libraries on the two machines. Let's try running it without the 
malloc interceptor and see if that helps.

Try running with "-mca memory ^linux" on your cmd line


On Jun 4, 2014, at 9:58 AM, Ralph Castain <r...@open-mpi.org> wrote:

> He isn't getting that far - he's failing in MPI_Init when the RTE
> attempts to connect to the local daemon
>
>
> On Jun 4, 2014, at 9:53 AM, Gus Correa <g...@ldeo.columbia.edu> wrote:
>
>> Hi Greg
>>
>> From your original email:
>>
>>>> [binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c
>>
>> This may not fix the problem,
>> but have you tried to add the shared memory btl to your mca parameter?
>>
>> mpirun -np 2 --mca btl openib,sm,self ring_c
>>
>> As far as I know, sm is the preferred transport layer for intra-node
>> communication.
>>
>> Gus Correa
>>
>>
>> On 06/04/2014 11:13 AM, Ralph Castain wrote:
>>> Thanks!! Really appreciate your help - I'll try to figure out what
>>> went wrong and get back to you
>>>
>>> On Jun 4, 2014, at 8:07 AM, Fischer, Greg A.
>>> <fisch...@westinghouse.com <mailto:fisch...@westinghouse.com>> wrote:
>>>
>>>> I re-ran with 1 processor and got more information. How about this?
>>>> Core was generated by `ring_c'.
>>>> Program terminated with signal 11, Segmentation fault.
>>>> #0  opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020,
>>>> bytes=47592367980728) at
>>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
>>>> 4098  bck->fd = unsorted_chunks(av);
>>>> (gdb) bt
>>>> #0  opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020,
>>>> bytes=47592367980728) at
>>>> ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
>>>> #1  0x2b48f2a15e38 in opal_memory_ptmalloc2_malloc
>

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
I re-ran with 1 processor and got more information. How about this?

Core was generated by `ring_c'.
Program terminated with signal 11, Segmentation fault.
#0  opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020, bytes=47592367980728) 
at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
4098  bck->fd = unsorted_chunks(av);
(gdb) bt
#0  opal_memory_ptmalloc2_int_malloc (av=0x2b48f6300020, bytes=47592367980728) 
at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
#1  0x2b48f2a15e38 in opal_memory_ptmalloc2_malloc (bytes=47592367980576) 
at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:3433
#2  0x2b48f2a15b36 in opal_memory_linux_malloc_hook (sz=47592367980576, 
caller=0x2b48f63000b8) at 
../../../../../openmpi-1.8.1/opal/mca/memory/linux/hooks.c:691
#3  0x2b48f2374b90 in vasprintf () from /lib64/libc.so.6
#4  0x2b48f2354148 in asprintf () from /lib64/libc.so.6
#5  0x2b48f26dc7d1 in orte_oob_base_get_addr (uri=0x2b48f6300020) at 
../../../../openmpi-1.8.1/orte/mca/oob/base/oob_base_stubs.c:234
#6  0x2b48f53e7d4a in orte_rml_oob_get_uri () at 
../../../../../openmpi-1.8.1/orte/mca/rml/oob/rml_oob_contact.c:36
#7  0x2b48f26fa181 in orte_routed_base_register_sync (setup=32 ' ') at 
../../../../openmpi-1.8.1/orte/mca/routed/base/routed_base_fns.c:301
#8  0x2b48f4bbcccf in init_routes (job=4130340896, ndat=0x2b48f63000b8) at 
../../../../../openmpi-1.8.1/orte/mca/routed/binomial/routed_binomial.c:705
#9  0x2b48f26c615d in orte_ess_base_app_setup (db_restrict_local=32 ' ') at 
../../../../openmpi-1.8.1/orte/mca/ess/base/ess_base_std_app.c:245
#10 0x2b48f45b069f in rte_init () at 
../../../../../openmpi-1.8.1/orte/mca/ess/env/ess_env_module.c:146
#11 0x2b48f26935ab in orte_init (pargc=0x2b48f6300020, 
pargv=0x2b48f63000b8, flags=8) at 
../../openmpi-1.8.1/orte/runtime/orte_init.c:148
#12 0x2b48f1739d38 in ompi_mpi_init (argc=1, argv=0x7fffebf0d1f8, 
requested=8, provided=0x0) at 
../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:464
#13 0x2b48f1760a37 in PMPI_Init (argc=0x2b48f6300020, argv=0x2b48f63000b8) 
at pinit.c:84
#14 0x004024ef in main (argc=1, argv=0x7fffebf0d1f8) at ring_c.c:19

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Wednesday, June 04, 2014 11:00 AM
To: Open MPI Users
Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c

Does the trace go any further back? Your prior trace seemed to indicate an 
error in our OOB framework, but in a very basic place. Looks like it could be 
an uninitialized variable, and having the line number down as deep as possible 
might help identify the source


On Jun 4, 2014, at 7:55 AM, Fischer, Greg A. 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote:


Oops, ulimit was set improperly. I generated a core file, loaded it in GDB, and 
ran a backtrace:

Core was generated by `ring_c'.
Program terminated with signal 11, Segmentation fault.
#0  opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, bytes=47890224382136) 
at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
4098  bck->fd = unsorted_chunks(av);
(gdb) bt
#0  opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, bytes=47890224382136) 
at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
#1  0x in ?? ()

Is that helpful?

Greg

From: Fischer, Greg A.
Sent: Wednesday, June 04, 2014 10:17 AM
To: 'Open MPI Users'
Cc: Fischer, Greg A.
Subject: RE: [OMPI users] intermittent segfaults with openib on ring_c.c

I recompiled with "-enable-debug" but it doesn't seem to be providing any more 
information or a core dump. I'm compiling ring.c with:

mpicc ring_c.c -g -traceback -o ring_c

and running with:

mpirun -np 4 --mca btl openib,self ring_c

and I'm getting:

[binf112:05845] *** Process received signal ***
[binf112:05845] Signal: Segmentation fault (11)
[binf112:05845] Signal code: Address not mapped (1)
[binf112:05845] Failing at address: 0x10
[binf112:05845] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x2b2fa44d57c0]
[binf112:05845] [ 1] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x4b3)[0x2b2fa4ff2b03]
[binf112:05845] [ 2] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)[0x2b2fa4ff5288]
[binf112:05845] [ 3] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(+0xd1f86)[0x2b2fa4ff4f86]
[binf112:05845] [ 4] /lib64/libc.so.6(vasprintf+0x3e)[0x2b2fa4957a7e]
[binf112:05845] [ 5] /lib64/libc.so.6(asprintf+0x88)[0x2b2fa4937148]
[binf112:05845] [ 6] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_util_convert_process_name_to_string+0xe2)[0x2b2fa4c873e2]
[binf112:05845] [ 7] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_oob_base_get_addr+0x25)[0x2b2fa4cbdb15]

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
Oops, ulimit was set improperly. I generated a core file, loaded it in GDB, and 
ran a backtrace:

Core was generated by `ring_c'.
Program terminated with signal 11, Segmentation fault.
#0  opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, bytes=47890224382136) 
at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
4098  bck->fd = unsorted_chunks(av);
(gdb) bt
#0  opal_memory_ptmalloc2_int_malloc (av=0x2b8e4fd00020, bytes=47890224382136) 
at ../../../../../openmpi-1.8.1/opal/mca/memory/linux/malloc.c:4098
#1  0x in ?? ()

Is that helpful?

Greg

From: Fischer, Greg A.
Sent: Wednesday, June 04, 2014 10:17 AM
To: 'Open MPI Users'
Cc: Fischer, Greg A.
Subject: RE: [OMPI users] intermittent segfaults with openib on ring_c.c

I recompiled with "-enable-debug" but it doesn't seem to be providing any more 
information or a core dump. I'm compiling ring.c with:

mpicc ring_c.c -g -traceback -o ring_c

and running with:

mpirun -np 4 --mca btl openib,self ring_c

and I'm getting:

[binf112:05845] *** Process received signal ***
[binf112:05845] Signal: Segmentation fault (11)
[binf112:05845] Signal code: Address not mapped (1)
[binf112:05845] Failing at address: 0x10
[binf112:05845] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x2b2fa44d57c0]
[binf112:05845] [ 1] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x4b3)[0x2b2fa4ff2b03]
[binf112:05845] [ 2] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)[0x2b2fa4ff5288]
[binf112:05845] [ 3] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(+0xd1f86)[0x2b2fa4ff4f86]
[binf112:05845] [ 4] /lib64/libc.so.6(vasprintf+0x3e)[0x2b2fa4957a7e]
[binf112:05845] [ 5] /lib64/libc.so.6(asprintf+0x88)[0x2b2fa4937148]
[binf112:05845] [ 6] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_util_convert_process_name_to_string+0xe2)[0x2b2fa4c873e2]
[binf112:05845] [ 7] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_oob_base_get_addr+0x25)[0x2b2fa4cbdb15]
[binf112:05845] [ 8] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_rml_oob.so(orte_rml_oob_get_uri+0xa)[0x2b2fa79c5d2a]
[binf112:05845] [ 9] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_routed_base_register_sync+0x1fd)[0x2b2fa4cdae7d]
[binf112:05845] [10] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_routed_binomial.so(+0x3c7b)[0x2b2fa719bc7b]
[binf112:05845] [11] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_ess_base_app_setup+0x3ad)[0x2b2fa4ca7c8d]
[binf112:05845] [12] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_ess_env.so(+0x169f)[0x2b2fa6b8f69f]
[binf112:05845] [13] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x17b)[0x2b2fa4c764bb]
[binf112:05845] [14] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x438)[0x2b2fa3d1e198]
[binf112:05845] [15] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0xf7)[0x2b2fa3d44947]
[binf112:05845] [16] ring_c[0x4024ef]
[binf112:05845] [17] /lib64/libc.so.6(__libc_start_main+0xe6)[0x2b2fa4906c36]
[binf112:05845] [18] ring_c[0x4023f9]
[binf112:05845] *** End of error message ***
--
mpirun noticed that process rank 3 with PID 5845 on node 112 exited on 
signal 11 (Segmentation fault).
--

Does any of that help?

Greg

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, June 03, 2014 11:54 PM
To: Open MPI Users
Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c

Sounds odd - can you configure OMPI --enable-debug and run it again? If it 
fails and you can get a core dump, could you tell us the line number where it 
is failing?


On Jun 3, 2014, at 9:58 AM, Fischer, Greg A. 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote:

Apologies - I forgot to add some of the information requested by the FAQ:

1.   OpenFabrics is provided by the Linux distribution:

[binf102:fischega] $ rpm -qa | grep ofed
ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5
ofed-1.5.4.1-0.11.5
ofed-doc-1.5.4.1-0.11.5

2.   Linux Distro / Kernel:

[binf102:fischega] $ cat /etc/SuSE-release
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 3

[binf102:fischega] $ uname -a
Linux 102 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990) 
x86_64 x86_64 x86_64 GNU/Linux

3.   Not sure which subnet manger is being used - I think OpenSM, but I'll 
need to check with my administrators.

4.   Output of ibv_devinfo is attached.

5.   Ifconfig output is attached.

6.   Ul

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Fischer, Greg A.
I recompiled with "-enable-debug" but it doesn't seem to be providing any more 
information or a core dump. I'm compiling ring.c with:

mpicc ring_c.c -g -traceback -o ring_c

and running with:

mpirun -np 4 --mca btl openib,self ring_c

and I'm getting:

[binf112:05845] *** Process received signal ***
[binf112:05845] Signal: Segmentation fault (11)
[binf112:05845] Signal code: Address not mapped (1)
[binf112:05845] Failing at address: 0x10
[binf112:05845] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x2b2fa44d57c0]
[binf112:05845] [ 1] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_int_malloc+0x4b3)[0x2b2fa4ff2b03]
[binf112:05845] [ 2] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(opal_memory_ptmalloc2_malloc+0x58)[0x2b2fa4ff5288]
[binf112:05845] [ 3] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-pal.so.6(+0xd1f86)[0x2b2fa4ff4f86]
[binf112:05845] [ 4] /lib64/libc.so.6(vasprintf+0x3e)[0x2b2fa4957a7e]
[binf112:05845] [ 5] /lib64/libc.so.6(asprintf+0x88)[0x2b2fa4937148]
[binf112:05845] [ 6] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_util_convert_process_name_to_string+0xe2)[0x2b2fa4c873e2]
[binf112:05845] [ 7] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_oob_base_get_addr+0x25)[0x2b2fa4cbdb15]
[binf112:05845] [ 8] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_rml_oob.so(orte_rml_oob_get_uri+0xa)[0x2b2fa79c5d2a]
[binf112:05845] [ 9] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_routed_base_register_sync+0x1fd)[0x2b2fa4cdae7d]
[binf112:05845] [10] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_routed_binomial.so(+0x3c7b)[0x2b2fa719bc7b]
[binf112:05845] [11] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_ess_base_app_setup+0x3ad)[0x2b2fa4ca7c8d]
[binf112:05845] [12] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/openmpi/mca_ess_env.so(+0x169f)[0x2b2fa6b8f69f]
[binf112:05845] [13] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libopen-rte.so.7(orte_init+0x17b)[0x2b2fa4c764bb]
[binf112:05845] [14] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(ompi_mpi_init+0x438)[0x2b2fa3d1e198]
[binf112:05845] [15] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.8.1/lib/libmpi.so.1(MPI_Init+0xf7)[0x2b2fa3d44947]
[binf112:05845] [16] ring_c[0x4024ef]
[binf112:05845] [17] /lib64/libc.so.6(__libc_start_main+0xe6)[0x2b2fa4906c36]
[binf112:05845] [18] ring_c[0x4023f9]
[binf112:05845] *** End of error message ***
--
mpirun noticed that process rank 3 with PID 5845 on node 112 exited on 
signal 11 (Segmentation fault).
--

Does any of that help?

Greg

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, June 03, 2014 11:54 PM
To: Open MPI Users
Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c

Sounds odd - can you configure OMPI --enable-debug and run it again? If it 
fails and you can get a core dump, could you tell us the line number where it 
is failing?


On Jun 3, 2014, at 9:58 AM, Fischer, Greg A. 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote:


Apologies - I forgot to add some of the information requested by the FAQ:

1.   OpenFabrics is provided by the Linux distribution:

[binf102:fischega] $ rpm -qa | grep ofed
ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5
ofed-1.5.4.1-0.11.5
ofed-doc-1.5.4.1-0.11.5


2.   Linux Distro / Kernel:

[binf102:fischega] $ cat /etc/SuSE-release
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 3

[binf102:fischega] $ uname -a
Linux 102 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990) 
x86_64 x86_64 x86_64 GNU/Linux


3.   Not sure which subnet manger is being used - I think OpenSM, but I'll 
need to check with my administrators.


4.   Output of ibv_devinfo is attached.


5.   Ifconfig output is attached.


6.   Ulimit -l output:

[binf102:fischega] $ ulimit -l
unlimited

Greg


From: Fischer, Greg A.
Sent: Tuesday, June 03, 2014 12:38 PM
To: Open MPI Users
Cc: Fischer, Greg A.
Subject: intermittent segfaults with openib on ring_c.c

Hello openmpi-users,

I'm running into a perplexing problem on a new system, whereby I'm experiencing 
intermittent segmentation faults when I run the ring_c.c example and use the 
openib BTL. See an example below. Approximately 50% of the time it provides the 
expected output, but the other 50% of the time, it segfaults. LD_LIBRARY_PATH 
is set correctly, and the version of "mpirun" being invoked is correct. The 
output of ompi_info -all is attached.

One potential problem may be that the system that OpenMPI was compiled on i

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-03 Thread Fischer, Greg A.
Apologies - I forgot to add some of the information requested by the FAQ:


1.   OpenFabrics is provided by the Linux distribution:

[binf102:fischega] $ rpm -qa | grep ofed
ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5
ofed-1.5.4.1-0.11.5
ofed-doc-1.5.4.1-0.11.5


2.   Linux Distro / Kernel:

[binf102:fischega] $ cat /etc/SuSE-release
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 3

[binf102:fischega] $ uname -a
Linux casl102 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990) 
x86_64 x86_64 x86_64 GNU/Linux


3.   Not sure which subnet manger is being used - I think OpenSM, but I'll 
need to check with my administrators.


4.   Output of ibv_devinfo is attached.


5.   Ifconfig output is attached.


6.   Ulimit -l output:

[binf102:fischega] $ ulimit -l
unlimited

Greg

From: Fischer, Greg A.
Sent: Tuesday, June 03, 2014 12:38 PM
To: Open MPI Users
Cc: Fischer, Greg A.
Subject: intermittent segfaults with openib on ring_c.c

Hello openmpi-users,

I'm running into a perplexing problem on a new system, whereby I'm experiencing 
intermittent segmentation faults when I run the ring_c.c example and use the 
openib BTL. See an example below. Approximately 50% of the time it provides the 
expected output, but the other 50% of the time, it segfaults. LD_LIBRARY_PATH 
is set correctly, and the version of "mpirun" being invoked is correct. The 
output of ompi_info -all is attached.

One potential problem may be that the system that OpenMPI was compiled on is 
mostly the same as the system where it is being executed, but there are some 
differences in the installed packages. I've checked the critical ones 
(libibverbs, librdmacm, libmlx4-rdmav2, etc.), and they appear to be the same.

Can anyone suggest how I might start tracking this problem down?

Thanks,
Greg

[binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c
[binf102:31268] *** Process received signal ***
[binf102:31268] Signal: Segmentation fault (11)
[binf102:31268] Signal code: Address not mapped (1)
[binf102:31268] Failing at address: 0x10
[binf102:31268] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2b42213f57c0]
[binf102:31268] [ 1] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3)
 [0x2b42203fd7e3]
[binf102:31268] [ 2] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_memalign+0x8b)
 [0x2b4220400d3b]
[binf102:31268] [ 3] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_memalign+0x6f)
 [0x2b42204008ef]
[binf102:31268] [ 4] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(+0x117876) 
[0x2b4220400876]
[binf102:31268] [ 5] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0xc34c)
 [0x2b422572334c]
[binf102:31268] [ 6] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_class_initialize+0xaa)
 [0x2b422041d64a]
[binf102:31268] [ 7] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0x1f12f)
 [0x2b422573612f]
[binf102:31268] [ 8] /lib64/libpthread.so.0(+0x77b6) [0x2b42213ed7b6]
[binf102:31268] [ 9] /lib64/libc.so.6(clone+0x6d) [0x2b42216dcd6d]
[binf102:31268] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 31268 on node 102 exited on 
signal 11 (Segmentation fault).
--
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.8.000
node_guid:  0002:c903:0010:371e
sys_image_guid: 0002:c903:0010:3721
vendor_id:  0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id:   HP_016009
phys_port_cnt:  2
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid:   21
port_lmc:   0x00
link_layer: IB

port:   2
state:  PORT_DOWN (1)
max_mtu:4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid:   0
port_lmc:   0x00
link_layer: IB

eth0  Link encap:Ethernet  HWaddr 3C:4A:92:F5:2F:B0  
  inet addr:10.179.32.21  Bcast:10.179.32.255  Mask:255

[OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-03 Thread Fischer, Greg A.
Hello openmpi-users,

I'm running into a perplexing problem on a new system, whereby I'm experiencing 
intermittent segmentation faults when I run the ring_c.c example and use the 
openib BTL. See an example below. Approximately 50% of the time it provides the 
expected output, but the other 50% of the time, it segfaults. LD_LIBRARY_PATH 
is set correctly, and the version of "mpirun" being invoked is correct. The 
output of ompi_info -all is attached.

One potential problem may be that the system that OpenMPI was compiled on is 
mostly the same as the system where it is being executed, but there are some 
differences in the installed packages. I've checked the critical ones 
(libibverbs, librdmacm, libmlx4-rdmav2, etc.), and they appear to be the same.

Can anyone suggest how I might start tracking this problem down?

Thanks,
Greg

[binf102:fischega] $ mpirun -np 2 --mca btl openib,self ring_c
[binf102:31268] *** Process received signal ***
[binf102:31268] Signal: Segmentation fault (11)
[binf102:31268] Signal code: Address not mapped (1)
[binf102:31268] Failing at address: 0x10
[binf102:31268] [ 0] /lib64/libpthread.so.0(+0xf7c0) [0x2b42213f57c0]
[binf102:31268] [ 1] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_malloc+0x4b3)
 [0x2b42203fd7e3]
[binf102:31268] [ 2] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_int_memalign+0x8b)
 [0x2b4220400d3b]
[binf102:31268] [ 3] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_memory_ptmalloc2_memalign+0x6f)
 [0x2b42204008ef]
[binf102:31268] [ 4] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(+0x117876) 
[0x2b4220400876]
[binf102:31268] [ 5] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0xc34c)
 [0x2b422572334c]
[binf102:31268] [ 6] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/libmpi.so.1(opal_class_initialize+0xaa)
 [0x2b422041d64a]
[binf102:31268] [ 7] 
//_ib/intel-12.1.0.233/toolset/openmpi-1.6.5/lib/openmpi/mca_btl_openib.so(+0x1f12f)
 [0x2b422573612f]
[binf102:31268] [ 8] /lib64/libpthread.so.0(+0x77b6) [0x2b42213ed7b6]
[binf102:31268] [ 9] /lib64/libc.so.6(clone+0x6d) [0x2b42216dcd6d]
[binf102:31268] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 31268 on node 102 exited on 
signal 11 (Segmentation fault).
--
 Package: Open MPI fischega@binford Distribution
Open MPI: 1.6.5
   Open MPI SVN revision: r28673
   Open MPI release date: Jun 26, 2013
Open RTE: 1.6.5
   Open RTE SVN revision: r28673
   Open RTE release date: Jun 26, 2013
OPAL: 1.6.5
   OPAL SVN revision: r28673
   OPAL release date: Jun 26, 2013
 MPI API: 2.1
Ident string: 1.6.5
   MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.6.5)
  MCA memory: linux (MCA v2.0, API v2.0, Component v1.6.5)
   MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.5)
   MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.6.5)
   MCA carto: file (MCA v2.0, API v2.0, Component v1.6.5)
   MCA shmem: mmap (MCA v2.0, API v2.0, Component v1.6.5)
   MCA shmem: posix (MCA v2.0, API v2.0, Component v1.6.5)
   MCA shmem: sysv (MCA v2.0, API v2.0, Component v1.6.5)
   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6.5)
   MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.5)
   MCA timer: linux (MCA v2.0, API v2.0, Component v1.6.5)
 MCA installdirs: env (MCA v2.0, API v2.0, Component v1.6.5)
 MCA installdirs: config (MCA v2.0, API v2.0, Component v1.6.5)
 MCA sysinfo: linux (MCA v2.0, API v2.0, Component v1.6.5)
   MCA hwloc: hwloc132 (MCA v2.0, API v2.0, Component v1.6.5)
 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.6.5)
  MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.6.5)
   MCA allocator: basic (MCA v2.0, API v2.0, Component v1.6.5)
   MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: basic (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: inter (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: self (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: sm (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: sync (MCA v2.0, API v2.0, Component v1.6.5)
MCA coll: tuned (MCA v2.0, API v2.0, Component v1.6.5)
  MCA io: romio (MCA v2.0, API v2.0, Component v1.6.5)
   MCA mpool: fake (MCA v2.0, API v2.0, 

Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-24 Thread Fischer, Greg A.
Yep. That was the problem. It works beautifully now.

Thanks for prodding me to take another look.

With regards to openmpi-1.6.5, the system that I'm compiling and running on, 
SLES10, contains some pretty dated software (e.g. Linux 2.6.x, python 2.4, gcc 
4.1.2). Is it possible there's simply an incompatibility lurking in there 
somewhere that would trip openmpi-1.6.5 but not openmpi-1.4.3?

Greg

>-Original Message-
>From: Fischer, Greg A.
>Sent: Friday, January 24, 2014 11:41 AM
>To: 'Open MPI Users'
>Cc: Fischer, Greg A.
>Subject: RE: [OMPI users] simple test problem hangs on mpi_finalize and
>consumes all system resources
>
>Hmm... It looks like CMAKE was somehow finding openmpi-1.6.5 instead of
>openmpi-1.4.3, despite the environment variables being set otherwise. This
>is likely the explanation. I'll try to chase that down.
>
>>-Original Message-
>>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
>>Squyres (jsquyres)
>>Sent: Friday, January 24, 2014 11:39 AM
>>To: Open MPI Users
>>Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and
>>consumes all system resources
>>
>>Ok.  I only mention this because the "mca_paffinity_linux.so: undefined
>>symbol: mca_base_param_reg_int" type of message is almost always an
>>indicator of two different versions being installed into the same tree.
>>
>>
>>On Jan 24, 2014, at 11:26 AM, "Fischer, Greg A."
>><fisch...@westinghouse.com> wrote:
>>
>>> Version 1.4.3 and 1.6.5 were and are installed in separate trees:
>>>
>>> 1003 fischega@lxlogin2[~]> ls
>>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.*
>>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.4.3:
>>> bin  etc  include  lib  share
>>>
>>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5:
>>> bin  etc  include  lib  share
>>>
>>> I'm fairly sure I was careful to check that the LD_LIBRARY_PATH was
>>> set
>>correctly, but I'll check again.
>>>
>>>> -Original Message-
>>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
>>>> Squyres (jsquyres)
>>>> Sent: Friday, January 24, 2014 11:07 AM
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize
>>>> and consumes all system resources
>>>>
>>>> On Jan 22, 2014, at 10:21 AM, "Fischer, Greg A."
>>>> <fisch...@westinghouse.com> wrote:
>>>>
>>>>> The reason for deleting the openmpi-1.6.5 installation was that I
>>>>> went back
>>>> and installed openmpi-1.4.3 and the problem (mostly) went away.
>>>> Openmpi-
>>>> 1.4.3 can run the simple tests without issue, but on my "real"
>>>> program, I'm getting symbol lookup errors:
>>>>>
>>>>> mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int
>>>>
>>>> This sounds like you are mixing 1.6.x and 1.4.x in the same
>>>> installation
>>tree.
>>>> This can definitely lead to sadness.
>>>>
>>>> More specifically: installing 1.6 over an existing 1.4 installation
>>>> (and vice
>>>> versa) is definitely NOT supported.  The set of plugins that the two
>>>> install are different, and can lead to all manner of weird/undefined
>>behavior.
>>>>
>>>> FWIW: I typically install Open MPI into a tree by itself.  And if I
>>>> later want to remove that installation, I just "rm -rf" that tree.
>>>> Then I can install a different version of OMPI into that same tree
>>>> (because the prior tree is completely gone).
>>>>
>>>> However, if you can't install OMPI into a tree by itself, you can
>>>> "make uninstall" from the source tree, and that should surgically
>>>> completely remove OMPI from the installation tree.  Then it is safe
>>>> to install a different version of OMPI into that same tree.
>>>>
>>>> Can you verify that you had installed OMPI into completely clean
>>>> trees?  If you didn't, I can imagine that causing the kinds of
>>>> errors that you
>>described.
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>--
>>Jeff Squyres
>>jsquy...@cisco.com
>>For corporate legal information go to:
>>http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>___
>>users mailing list
>>us...@open-mpi.org
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>




Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-24 Thread Fischer, Greg A.
Hmm... It looks like CMAKE was somehow finding openmpi-1.6.5 instead of 
openmpi-1.4.3, despite the environment variables being set otherwise. This is 
likely the explanation. I'll try to chase that down.

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
>Squyres (jsquyres)
>Sent: Friday, January 24, 2014 11:39 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and
>consumes all system resources
>
>Ok.  I only mention this because the "mca_paffinity_linux.so: undefined
>symbol: mca_base_param_reg_int" type of message is almost always an
>indicator of two different versions being installed into the same tree.
>
>
>On Jan 24, 2014, at 11:26 AM, "Fischer, Greg A."
><fisch...@westinghouse.com> wrote:
>
>> Version 1.4.3 and 1.6.5 were and are installed in separate trees:
>>
>> 1003 fischega@lxlogin2[~]> ls
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.*
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.4.3:
>> bin  etc  include  lib  share
>>
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5:
>> bin  etc  include  lib  share
>>
>> I'm fairly sure I was careful to check that the LD_LIBRARY_PATH was set
>correctly, but I'll check again.
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
>>> Squyres (jsquyres)
>>> Sent: Friday, January 24, 2014 11:07 AM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize
>>> and consumes all system resources
>>>
>>> On Jan 22, 2014, at 10:21 AM, "Fischer, Greg A."
>>> <fisch...@westinghouse.com> wrote:
>>>
>>>> The reason for deleting the openmpi-1.6.5 installation was that I
>>>> went back
>>> and installed openmpi-1.4.3 and the problem (mostly) went away.
>>> Openmpi-
>>> 1.4.3 can run the simple tests without issue, but on my "real"
>>> program, I'm getting symbol lookup errors:
>>>>
>>>> mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int
>>>
>>> This sounds like you are mixing 1.6.x and 1.4.x in the same installation
>tree.
>>> This can definitely lead to sadness.
>>>
>>> More specifically: installing 1.6 over an existing 1.4 installation
>>> (and vice
>>> versa) is definitely NOT supported.  The set of plugins that the two
>>> install are different, and can lead to all manner of weird/undefined
>behavior.
>>>
>>> FWIW: I typically install Open MPI into a tree by itself.  And if I
>>> later want to remove that installation, I just "rm -rf" that tree.
>>> Then I can install a different version of OMPI into that same tree
>>> (because the prior tree is completely gone).
>>>
>>> However, if you can't install OMPI into a tree by itself, you can
>>> "make uninstall" from the source tree, and that should surgically
>>> completely remove OMPI from the installation tree.  Then it is safe
>>> to install a different version of OMPI into that same tree.
>>>
>>> Can you verify that you had installed OMPI into completely clean
>>> trees?  If you didn't, I can imagine that causing the kinds of errors that 
>>> you
>described.
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>--
>Jeff Squyres
>jsquy...@cisco.com
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>




Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-24 Thread Fischer, Greg A.
Version 1.4.3 and 1.6.5 were and are installed in separate trees:

1003 fischega@lxlogin2[~]> ls 
/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.*
/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.4.3:
bin  etc  include  lib  share

/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5:
bin  etc  include  lib  share

I'm fairly sure I was careful to check that the LD_LIBRARY_PATH was set 
correctly, but I'll check again.

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
>Squyres (jsquyres)
>Sent: Friday, January 24, 2014 11:07 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and
>consumes all system resources
>
>On Jan 22, 2014, at 10:21 AM, "Fischer, Greg A."
><fisch...@westinghouse.com> wrote:
>
>> The reason for deleting the openmpi-1.6.5 installation was that I went back
>and installed openmpi-1.4.3 and the problem (mostly) went away. Openmpi-
>1.4.3 can run the simple tests without issue, but on my "real" program, I'm
>getting symbol lookup errors:
>>
>> mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int
>
>This sounds like you are mixing 1.6.x and 1.4.x in the same installation tree.
>This can definitely lead to sadness.
>
>More specifically: installing 1.6 over an existing 1.4 installation (and vice
>versa) is definitely NOT supported.  The set of plugins that the two install 
>are
>different, and can lead to all manner of weird/undefined behavior.
>
>FWIW: I typically install Open MPI into a tree by itself.  And if I later want 
>to
>remove that installation, I just "rm -rf" that tree.  Then I can install a 
>different
>version of OMPI into that same tree (because the prior tree is completely
>gone).
>
>However, if you can't install OMPI into a tree by itself, you can "make
>uninstall" from the source tree, and that should surgically completely
>remove OMPI from the installation tree.  Then it is safe to install a different
>version of OMPI into that same tree.
>
>Can you verify that you had installed OMPI into completely clean trees?  If
>you didn't, I can imagine that causing the kinds of errors that you described.
>
>--
>Jeff Squyres
>jsquy...@cisco.com
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>




Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-22 Thread Fischer, Greg A.
Well, this is a little strange. The hanging behavior is gone, but I'm getting a 
segfault now. The output of "hello_c.c" and "ring_c.c" are attached. 

I'm getting a segfault with the Fortran test, also. I'm afraid I may have 
polluted the experiment by removing the target openmpi-1.6.5 installation 
directory yesterday. To produce the attached outputs, I just went back and did 
"make install" in the openmpi-1.6.5 build directory. I've re-set the 
environment variables as they were a few days ago by sourcing the same bash 
script. Perhaps I forgot something, or something on the system changed? 
Regardless, LD_LIBRARY_PATH and PATH are set correctly, and aberrant behavior 
persists.

The reason for deleting the openmpi-1.6.5 installation was that I went back and 
installed openmpi-1.4.3 and the problem (mostly) went away. Openmpi-1.4.3 can 
run the simple tests without issue, but on my "real" program, I'm getting 
symbol lookup errors: 

mca_paffinity_linux.so: undefined symbol: mca_base_param_reg_int

Perhaps that's a separate thread.

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
>Squyres (jsquyres)
>Sent: Tuesday, January 21, 2014 3:57 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and
>consumes all system resources
>
>Just for giggles, can you repeat the same test but with hello_c.c and ring_c.c?
>I.e., let's get the Fortran out of the way and use just the base C bindings, 
>and
>see what happens.
>
>
>On Jan 19, 2014, at 6:18 PM, "Fischer, Greg A." <fisch...@westinghouse.com>
>wrote:
>
>> I just tried running "hello_f90.f90" and see the same behavior: 100% CPU
>usage, gradually increasing memory consumption, and failure to get past
>mpi_finalize. LD_LIBRARY_PATH is set as:
>>
>>
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib
>>
>> The installation target for this version of OpenMPI is:
>>
>>
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5
>>
>> 1045
>> fischega@lxlogin2[/data/fischega/petsc_configure/mpi_test/simple]>
>> which mpirun
>> /tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin/mpir
>> un
>>
>> Perhaps something strange is happening with GCC? I've tried simple hello
>world C and Fortran programs, and they work normally.
>>
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph
>> Castain
>> Sent: Sunday, January 19, 2014 11:36 AM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize
>> and consumes all system resources
>>
>> The OFED warning about registration is something OMPI added at one point
>when we isolated the cause of jobs occasionally hanging, so you won't see
>that warning from other MPIs or earlier versions of OMPI (I forget exactly
>when we added it).
>>
>> The problem you describe doesn't sound like an OMPI issue - it sounds like
>you've got a memory corruption problem in the code. Have you tried running
>the examples in our example directory to confirm that the installation is
>good?
>>
>> Also, check to ensure that your LD_LIBRARY_PATH is correctly set to pickup
>the OMPI libs you installed - most Linux distros come with an older version,
>and that can cause problems if you inadvertently pick them up.
>>
>>
>> On Jan 19, 2014, at 5:51 AM, Fischer, Greg A. <fisch...@westinghouse.com>
>wrote:
>>
>>
>> Hello,
>>
>> I have a simple, 1-process test case that gets stuck on the mpi_finalize 
>> call.
>The test case is a dead-simple calculation of pi - 50 lines of Fortran. The
>process gradually consumes more and more memory until the system
>becomes unresponsive and needs to be rebooted, unless the job is killed
>first.
>>
>> In the output, attached, I see the warning message about OpenFabrics
>being configured to only allow registering part of physical memory. I've tried
>to chase this down with my administrator to no avail yet. (I am aware of the
>relevant FAQ entry.)  A different installation of MPI on the same system,
>made with a different compiler, does not produce the OpenFabrics memory
>registration warning - which seems strange because I thought it was a system
>configuration issue independent of MPI. Also curious in the output is that LSF
>seems to think there are 7 processes and 11 threads associated with this job.
>>
>> The particulars of my configuration are attached and detailed below. Does
>anyone see anything potentially problematic?
>>
>> Thanks,
>> Greg
>>
>> OpenMPI Version: 1.6.5
>

Re: [OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-19 Thread Fischer, Greg A.
I just tried running "hello_f90.f90" and see the same behavior: 100% CPU usage, 
gradually increasing memory consumption, and failure to get past mpi_finalize. 
LD_LIBRARY_PATH is set as:


/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib

The installation target for this version of OpenMPI is:

/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5

1045 fischega@lxlogin2[/data/fischega/petsc_configure/mpi_test/simple]> which 
mpirun
/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin/mpirun

Perhaps something strange is happening with GCC? I've tried simple hello world 
C and Fortran programs, and they work normally.

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Sunday, January 19, 2014 11:36 AM
To: Open MPI Users
Subject: Re: [OMPI users] simple test problem hangs on mpi_finalize and 
consumes all system resources

The OFED warning about registration is something OMPI added at one point when 
we isolated the cause of jobs occasionally hanging, so you won't see that 
warning from other MPIs or earlier versions of OMPI (I forget exactly when we 
added it).

The problem you describe doesn't sound like an OMPI issue - it sounds like 
you've got a memory corruption problem in the code. Have you tried running the 
examples in our example directory to confirm that the installation is good?

Also, check to ensure that your LD_LIBRARY_PATH is correctly set to pickup the 
OMPI libs you installed - most Linux distros come with an older version, and 
that can cause problems if you inadvertently pick them up.


On Jan 19, 2014, at 5:51 AM, Fischer, Greg A. 
<fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote:


Hello,

I have a simple, 1-process test case that gets stuck on the mpi_finalize call. 
The test case is a dead-simple calculation of pi - 50 lines of Fortran. The 
process gradually consumes more and more memory until the system becomes 
unresponsive and needs to be rebooted, unless the job is killed first.

In the output, attached, I see the warning message about OpenFabrics being 
configured to only allow registering part of physical memory. I've tried to 
chase this down with my administrator to no avail yet. (I am aware of the 
relevant FAQ entry.)  A different installation of MPI on the same system, made 
with a different compiler, does not produce the OpenFabrics memory registration 
warning - which seems strange because I thought it was a system configuration 
issue independent of MPI. Also curious in the output is that LSF seems to think 
there are 7 processes and 11 threads associated with this job.

The particulars of my configuration are attached and detailed below. Does 
anyone see anything potentially problematic?

Thanks,
Greg

OpenMPI Version: 1.6.5
Compiler: GCC 4.6.1
OS: SuSE Linux Enterprise Server 10, Patchlevel 2

uname -a : Linux lxlogin2 2.6.16.60-0.21-smp #1 SMP Tue May 6 12:41:02 UTC 2008 
x86_64 x86_64 x86_64 GNU/Linux

LD_LIBRARY_PATH=/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/lib64:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/lib

PATH= 
/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/python-2.7.6/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/git-1.7.0.4/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/cmake-2.8.11.2/bin:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/etc:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/bin:/usr/bin:.:/bin:/usr/scripts

Execution command: (executed via LSF - effectively "mpirun -np 1 test_program")
___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] simple test problem hangs on mpi_finalize and consumes all system resources

2014-01-19 Thread Fischer, Greg A.
Hello,

I have a simple, 1-process test case that gets stuck on the mpi_finalize call. 
The test case is a dead-simple calculation of pi - 50 lines of Fortran. The 
process gradually consumes more and more memory until the system becomes 
unresponsive and needs to be rebooted, unless the job is killed first.

In the output, attached, I see the warning message about OpenFabrics being 
configured to only allow registering part of physical memory. I've tried to 
chase this down with my administrator to no avail yet. (I am aware of the 
relevant FAQ entry.)  A different installation of MPI on the same system, made 
with a different compiler, does not produce the OpenFabrics memory registration 
warning - which seems strange because I thought it was a system configuration 
issue independent of MPI. Also curious in the output is that LSF seems to think 
there are 7 processes and 11 threads associated with this job.

The particulars of my configuration are attached and detailed below. Does 
anyone see anything potentially problematic?

Thanks,
Greg

OpenMPI Version: 1.6.5
Compiler: GCC 4.6.1
OS: SuSE Linux Enterprise Server 10, Patchlevel 2

uname -a : Linux lxlogin2 2.6.16.60-0.21-smp #1 SMP Tue May 6 12:41:02 UTC 2008 
x86_64 x86_64 x86_64 GNU/Linux

LD_LIBRARY_PATH=/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/lib:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/lib64:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/lib

PATH= 
/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/python-2.7.6/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/openmpi-1.6.5/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/gcc-4.6.1/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/git-1.7.0.4/bin:/tools/casl_sles10/vera_clean/gcc-4.6.1/toolset/cmake-2.8.11.2/bin:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/etc:/tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/bin:/usr/bin:.:/bin:/usr/scripts

Execution command: (executed via LSF - effectively "mpirun -np 1 test_program")
Sender: LSF System 
Subject: Job 900527:  Exited

Job  was submitted from host  by user  in 
cluster .
Job was executed on host(s) , in queue , as user  in 
cluster .
 was used as the home directory.
 was used as the working directory.
Started at Sat Jan 18 21:47:47 2014
Results reported at Sat Jan 18 21:48:33 2014

Your job looked like:


# LSBATCH: User input
mpirun.lsf pi


TERM_OWNER: job killed by owner.
Exited with exit code 1.

Resource usage summary:

CPU time   : 41.56 sec.
Max Memory : 12075 MB
Max Swap   : 12213 MB

Max Processes  : 7
Max Threads:11

The output (if any) follows:

--
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:  bl1211
  Registerable memory: 32768 MiB
  Total memory:64618 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--
 MPI process0  running on node bl1211.
 Running 5000  samples over1  proc(s).
 pi is3.1415926535895617   Error is  2.31370478331882623E-013
 THIS IS THE END.
--
mpirun has exited due to process rank 0 with PID 29294 on
node bl1211 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--
Job  /tools/lsf/7.0.6.EC/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper pi

TID   HOST_NAME   COMMAND_LINESTATUS