Any hint for the previous mail?
Does Open MPI-1.3.3 support only a limited versions of OFED?
Or any version is ok?
On Sun, Oct 11, 2009 at 3:55 PM, Sangamesh B <forum....@gmail.com>
wrote:
Hi,
A fortran application is installed with Intel Fortran 10.1, MKL-10
and Openmpi-1.3.3 on a Rocks-5.1 HPC Linux cluster. The jobs are not
scaling when more than one node is used. The cluster has Intel Quad
core Xeon (E5472) @ 3.00GHz Dual processor (total 8 cores per node,
16GB RAM) and Infiniband interconnectivity.
Here are some of the timings:
12 cores (Node 1: 8 cores, Node2: 4 cores) -- No progress in the job
8 cores (Node 1: 8 cores)
-- 21 hours (38 CG move steps)
4 cores (Node 1: 4 cores) -- 25 hours
12 cores (Node 1, Node 2, Node 3: 4cores each) -- No progress
Later to check, whether Open MPI is using IB or not, I used --mca
btl openib. But the job failed with following error message:
# cat /home1/g03/apps_test/amber/test16/err.352.job16
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[23671,1],12]) is on host: compute-0-12.local
Process 2 ([[23671,1],12]) is on host: compute-0-12.local
BTLs attempted: openib
Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process
is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[compute-0-12.local:5496] Abort before MPI_INIT completed
successfully; not able to guarantee that all other processes were
killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[compute-0-5.local:6916] Abort before MPI_INIT completed
successfully; not able to guarantee that all other processes were
killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[compute-0-5.local:6914] Abort before MPI_INIT completed
successfully; not able to guarantee that all other processes were
killed!
*** An error occurred in MPI_Init
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[compute-0-5.local:6915] Abort before MPI_INIT completed
successfully; not able to guarantee that all other processes were
killed!
[compute-0-5.local:6913] Abort before MPI_INIT completed
successfully; not able to guarantee that all other processes were
killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
mpirun has exited due to process rank 12 with PID 5496 on
node compute-0-12.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[compute-0-5.local:06910] 15 more processes have sent help message
help-mca-bml-r2.txt / unreachable proc
[compute-0-5.local:06910] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
[compute-0-5.local:06910] 15 more processes have sent help message
help-mpi-runtime / mpi_init:startup:internal-failure
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Process 1 ([[23958,1],2]) is on host: compute-0-5.local
Process 2 ([[23958,1],2]) is on host: compute-0-5.local
BTLs attempted: openib
Then added 'self' to --mca btl openib,. With this it started
running, but I can make sure its not using IB as I observed it from
the netstat -i command.
1st Snap:
Every 2.0s: netstat -
i
Sun
Oct 11 15:29:29 2009
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-
DRP TX-OVR Flg
eth0 1500 0 1847619 0 0 0 2073010 0
0 0 BMRU
ib0 65520 0 708 0 0 0 509 0
5 0 BMRU
lo 16436 0 5731 0 0 0 5731 0
0 0 LRU
2nd Snap:
Every 2.0s: netstat -
i
Sun
Oct 11 15:29:57 2009
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-
DRP TX-OVR Flg
eth0 1500 0 1847647 0 0 0 2073073 0
0 0 BMRU
ib0 65520 0 708 0 0 0 509 0
5 0 BMRU
lo 16436 0 5731 0 0 0 5731 0
0 0 LRU
Why OpenMPI is not able to use IB?
The ldd to the executable shows, no IB libraries are linked. Is this
the reason:
ldd /opt/apps/siesta/siesta_mpi
/opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_intel_lp64.so
(0x00002aaaaaaad000)
/opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_intel_thread.so
(0x00002aaaaadc2000)
/opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_core.so
(0x00002aaaab2ad000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00000034a6200000)
libmpi_f90.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/
libmpi_f90.so.0 (0x00002aaaab4a0000)
libmpi_f77.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/
libmpi_f77.so.0 (0x00002aaaab6a3000)
libmpi.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi.so.0
(0x00002aaaab8db000)
libopen-rte.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libopen-
rte.so.0 (0x00002aaaabbaa000)
libopen-pal.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libopen-
pal.so.0 (0x00002aaaabe07000)
libdl.so.2 => /lib64/libdl.so.2 (0x00000034a5e00000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00000034a8200000)
libutil.so.1 => /lib64/libutil.so.1 (0x00000034a6600000)
libifport.so.5 => /opt/intel/fce/10.1.008/lib/libifport.so.5
(0x00002aaaac09a000)
libifcoremt.so.5 => /opt/intel/fce/10.1.008/lib/libifcoremt.so.5
(0x00002aaaac1d0000)
libimf.so => /opt/intel/cce/10.1.018/lib/libimf.so
(0x00002aaaac401000)
libsvml.so => /opt/intel/cce/10.1.018/lib/libsvml.so
(0x00002aaaac766000)
libm.so.6 => /lib64/libm.so.6 (0x00000034a6e00000)
libguide.so => /opt/intel/mkl/10.0.5.025/lib/em64t/libguide.so
(0x00002aaaac8f1000)
libintlc.so.5 => /opt/intel/cce/10.1.018/lib/libintlc.so.5
(0x00002aaaaca65000)
libc.so.6 => /lib64/libc.so.6 (0x00000034a5a00000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000034a7e00000)
/lib64/ld-linux-x86-64.so.2 (0x00000034a5600000)
With the help of Open MPI FAQ:
# /opt/mpi/openmpi/1.3.3/intel/bin/ompi_info --param btl openib
MCA btl: parameter "btl_base_verbose" (current
value: "0", data source: default value)
Verbosity level of the BTL framework
MCA btl: parameter "btl" (current value: <none>,
data source: default value)
Default selection set of components for
the btl framework (<none> means use all components that can be found)
MCA btl: parameter "btl_openib_verbose" (current
value: "0", data source: default value)
Output some verbose OpenIB BTL information
(0 = no output, nonzero = output)
MCA btl: parameter
"btl_openib_warn_no_device_params_found" (current value: "1", data
source: default value, synonyms:
btl_openib_warn_no_hca_params_found)
Warn when no device-specific parameters
are found in the INI file specified by the
btl_openib_device_param_files MCA parameter (0 =
do not warn; any other value = warn)
MCA btl: parameter
"btl_openib_warn_no_hca_params_found" (current value: "1", data
source: default value, deprecated, synonym of:
btl_openib_warn_no_device_params_found)
Warn when no device-specific parameters
are found in the INI file specified by the
btl_openib_device_param_files MCA parameter (0 =
do not warn; any other value = warn)
MCA btl: parameter
"btl_openib_warn_default_gid_prefix" (current value: "1", data
source: default value)
Warn when there is more than one active
ports and at least one of them connected to the network with only
default GID prefix
configured (0 = do not warn; any other
value = warn)
MCA btl: parameter
"btl_openib_warn_nonexistent_if" (current value: "1", data source:
default value)
Warn if non-existent devices and/or ports
are specified in the btl_openib_if_[in|ex]clude MCA parameters (0 =
do not warn; any
other value = warn)
During Open MPI install I've used --with-openib=/usr. So I believe
its compiled with IB support.
The IB utilities such as ibv_rc_pingpong are working fine.
I'm not getting why its OMPI is not using IB? Please help me to
resolve this issue.
Thanks
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users