Hi,

I've encountered strange issues when trying to run a simple mpi job on a single host which has IB.
The complete errors:

-> mpirun -n 1 hello
--------------------------------------------------------------------------
WARNING: Failed to open "ofa-v2-mlx4_0-1" [DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[[53031,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: uDAPL
  Host: n01

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              n01
  Registerable memory:     32768 MiB
  Total memory:            65503 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------
Process 0 on n01 out of 1
[n01:13534] 7 more processes have sent help message help-mpi-btl-udapl.txt / dat_ia_open fail
[n01:13534] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Following my setup and other info:
OS: CentOS 6.3 x86_64
installed ofed 3.5 from source ( ./install.pl --all)
installed openmpi 1.6.4 with the following build parameters:
rpmbuild --rebuild openmpi-1.6.4-1.src.rpm --define '_prefix /opt/openmpi/1.6.4/gcc' --define '_defaultdocdir /opt/openmpi/1.6.4/gcc' --define '_mandir %{_prefix}/share/man' --define '_datadir %{_prefix}/share' --define 'configure_options --with-openib=/usr --with-openib-libdir=/usr/lib64 CC=gcc CXX=g++ F77=gfortran FC=gfortran --enable-mpirun-prefix-by-default --target=x86_64-unknown-linux-gnu --with-hwloc=/usr/local --with-libltdl --enable-branch-probabilities --with-udapl --with-sge --disable-vt' --define 'use_default_rpm_opt_flags 1' --define '_name openmpi-1.6.4_gcc' --define 'install_shell_scripts 1' --define 'shell_scripts_basename mpivars' --define '_usr /usr' --define 'ofed 0' 2>&1 | tee openmpi.build.sge
(disable -vt was used due to cuda presence which is automatically linked by vt, and becomes a dependency with no matching rpm).

memorylocked is unlimited:
->ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515028
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
IB devices are present:
->ibv_devinfo
hca_id:    mlx4_0
    transport:            InfiniBand (0)
    fw_ver:                2.9.1000
    node_guid:            0002:c903:004d:b0e2
    sys_image_guid:            0002:c903:004d:b0e5
    vendor_id:            0x02c9
    vendor_part_id:            26428
    hw_ver:                0xB0
    board_id:            MT_0D90110009
    phys_port_cnt:            1
        port:    1
            state:            PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:        4096 (5)
            sm_lid:            2
            port_lid:        53
            port_lmc:        0x00
            link_layer:        InfiniBand

the hello program source:
->cat hello.c
#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[]) {
  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME];

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);

  MPI_Finalize();
}
simply compiled as:
mpicc hello.c -o hello

the IB modules seem to be present:
->service openibd status

  HCA driver loaded

Configured IPoIB devices:
ib0

Currently active IPoIB devices:
ib0

The following OFED modules are loaded:

  rdma_ucm
  rdma_cm
  ib_addr
  ib_ipoib
  mlx4_core
  mlx4_ib
  mlx4_en
  ib_mthca
  ib_uverbs
  ib_umad
  ib_sa
  ib_cm
  ib_mad
  ib_core
  iw_cxgb3
  iw_cxgb4
  iw_nes
  ib_qib

Can anyone help?

Reply via email to