Hi,
I've encountered strange issues when trying to run a simple mpi job
on a single host which has IB.
The complete errors:
-> mpirun -n 1 hello
--------------------------------------------------------------------------
WARNING: Failed to open "ofa-v2-mlx4_0-1"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the
uDAPL
Registry which is contained in the dat.conf file. Contact your
local
System Administrator to confirm the availability of the interfaces
in
the dat.conf file.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[[53031,1],0]: A high-performance Open MPI point-to-point
messaging module
was unable to find any relevant network interfaces:
Module: uDAPL
Host: n01
Another transport will be used instead, although this may result
in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured
to only
allow registering part of your physical memory. This can cause
MPI jobs to
run with erratic performance, hang, and/or crash.
This may be caused by your OpenFabrics vendor limiting the amount
of
physical memory that can be registered. You should investigate
the
relevant Linux kernel module parameters that control how much
physical
memory can be registered, and increase them to allow registering
all
physical memory on your machine.
See this Open MPI FAQ item for more information on these Linux
kernel module
parameters:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: n01
Registerable memory: 32768 MiB
Total memory: 65503 MiB
Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------
Process 0 on n01 out of 1
[n01:13534] 7 more processes have sent help message
help-mpi-btl-udapl.txt / dat_ia_open fail
[n01:13534] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
Following my setup and other info:
OS: CentOS 6.3 x86_64
installed ofed 3.5 from source ( ./install.pl --all)
installed openmpi 1.6.4 with the following build parameters:
rpmbuild --rebuild openmpi-1.6.4-1.src.rpm
--define '_prefix /opt/openmpi/1.6.4/gcc' --define '_defaultdocdir
/opt/openmpi/1.6.4/gcc' --define '_mandir %{_prefix}/share/man'
--define '_datadir %{_prefix}/share' --define 'configure_options
--with-openib=/usr --with-openib-libdir=/usr/lib64 CC=gcc CXX=g++
F77=gfortran FC=gfortran --enable-mpirun-prefix-by-default
--target=x86_64-unknown-linux-gnu --with-hwloc=/usr/local
--with-libltdl --enable-branch-probabilities --with-udapl
--with-sge --disable-vt' --define 'use_default_rpm_opt_flags 1'
--define '_name openmpi-1.6.4_gcc' --define 'install_shell_scripts
1' --define 'shell_scripts_basename mpivars' --define '_usr /usr'
--define 'ofed 0' 2>&1 | tee openmpi.build.sge
(disable -vt was used due to cuda presence which is automatically
linked by vt, and becomes a dependency with no matching rpm).
memorylocked is unlimited:
->ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515028
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
IB devices are present:
->ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.9.1000
node_guid: 0002:c903:004d:b0e2
sys_image_guid: 0002:c903:004d:b0e5
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: MT_0D90110009
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 2
port_lid: 53
port_lmc: 0x00
link_layer: InfiniBand
the hello program source:
->cat hello.c
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
printf("Process %d on %s out of %d\n", rank, processor_name,
numprocs);
MPI_Finalize();
}
simply compiled as:
mpicc hello.c -o hello
the IB modules seem to be present:
->service openibd status
HCA driver loaded
Configured IPoIB devices:
ib0
Currently active IPoIB devices:
ib0
The following OFED modules are loaded:
rdma_ucm
rdma_cm
ib_addr
ib_ipoib
mlx4_core
mlx4_ib
mlx4_en
ib_mthca
ib_uverbs
ib_umad
ib_sa
ib_cm
ib_mad
ib_core
iw_cxgb3
iw_cxgb4
iw_nes
ib_qib
Can anyone help?
|