Any hint for the previous mail? Does Open MPI-1.3.3 support only a limited versions of OFED? Or any version is ok? On Sun, Oct 11, 2009 at 3:55 PM, Sangamesh B <forum....@gmail.com> wrote:
> Hi, > > A fortran application is installed with Intel Fortran 10.1, MKL-10 and > Openmpi-1.3.3 on a Rocks-5.1 HPC Linux cluster. The jobs are not scaling > when more than one node is used. The cluster has Intel Quad core Xeon > (E5472) @ 3.00GHz Dual processor (total 8 cores per node, 16GB RAM) and > Infiniband interconnectivity. > > Here are some of the timings: > > 12 cores (Node 1: 8 cores, Node2: 4 cores) -- No progress in the job > 8 cores (Node 1: 8 cores) -- 21 hours (38 CG > move steps) > 4 cores (Node 1: 4 cores) -- 25 hours > 12 cores (Node 1, Node 2, Node 3: 4cores each) -- No progress > > > Later to check, whether Open MPI is using IB or not, I used --mca btl > openib. But the job failed with following error message: > # cat /home1/g03/apps_test/amber/test16/err.352.job16 > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[23671,1],12]) is on host: compute-0-12.local > Process 2 ([[23671,1],12]) is on host: compute-0-12.local > BTLs attempted: openib > > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > PML add procs failed > --> Returned "Unreachable" (-12) instead of "Success" (0) > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [compute-0-12.local:5496] Abort before MPI_INIT completed successfully; not > able to guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [compute-0-5.local:6916] Abort before MPI_INIT completed successfully; not > able to guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [compute-0-5.local:6914] Abort before MPI_INIT completed successfully; not > able to guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > [compute-0-5.local:6915] Abort before MPI_INIT completed successfully; not > able to guarantee that all other processes were killed! > [compute-0-5.local:6913] Abort before MPI_INIT completed successfully; not > able to guarantee that all other processes were killed! > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) > mpirun has exited due to process rank 12 with PID 5496 on > node compute-0-12.local exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > [compute-0-5.local:06910] 15 more processes have sent help message > help-mca-bml-r2.txt / unreachable proc > [compute-0-5.local:06910] Set MCA parameter "orte_base_help_aggregate" to 0 > to see all help / error messages > [compute-0-5.local:06910] 15 more processes have sent help message > help-mpi-runtime / mpi_init:startup:internal-failure > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[23958,1],2]) is on host: compute-0-5.local > Process 2 ([[23958,1],2]) is on host: compute-0-5.local > BTLs attempted: openib > > Then added 'self' to --mca btl openib,. With this it started running, but I > can make sure its not using IB as I observed it from the netstat -i command. > > 1st Snap: > > Every 2.0s: netstat > -i > Sun Oct 11 15:29:29 2009 > > Kernel Interface table > Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP > TX-OVR Flg > eth0 1500 0 1847619 0 0 0 2073010 0 0 0 > BMRU > ib0 65520 0 708 0 0 0 509 0 5 0 > BMRU > lo 16436 0 5731 0 0 0 5731 0 0 0 > LRU > > 2nd Snap: > > Every 2.0s: netstat > -i > Sun Oct 11 15:29:57 2009 > > Kernel Interface table > Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP > TX-OVR Flg > eth0 1500 0 1847647 0 0 0 2073073 0 0 0 > BMRU > ib0 65520 0 708 0 0 0 509 0 5 0 > BMRU > lo 16436 0 5731 0 0 0 5731 0 0 0 > LRU > > Why OpenMPI is not able to use IB? > > The ldd to the executable shows, no IB libraries are linked. Is this the > reason: > ldd /opt/apps/siesta/siesta_mpi > > /opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_intel_lp64.so(0x00002aaaaaaad000) > > /opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_intel_thread.so(0x00002aaaaadc2000) > /opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_core.so(0x00002aaaab2ad000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x00000034a6200000) > libmpi_f90.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi_f90.so.0 > (0x00002aaaab4a0000) > libmpi_f77.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi_f77.so.0 > (0x00002aaaab6a3000) > libmpi.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi.so.0 > (0x00002aaaab8db000) > libopen-rte.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libopen-rte.so.0 > (0x00002aaaabbaa000) > libopen-pal.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libopen-pal.so.0 > (0x00002aaaabe07000) > libdl.so.2 => /lib64/libdl.so.2 (0x00000034a5e00000) > libnsl.so.1 => /lib64/libnsl.so.1 (0x00000034a8200000) > libutil.so.1 => /lib64/libutil.so.1 (0x00000034a6600000) > libifport.so.5 => /opt/intel/fce/10.1.008/lib/libifport.so.5 > (0x00002aaaac09a000) > libifcoremt.so.5 => /opt/intel/fce/10.1.008/lib/libifcoremt.so.5 > (0x00002aaaac1d0000) > libimf.so => /opt/intel/cce/10.1.018/lib/libimf.so (0x00002aaaac401000) > libsvml.so => /opt/intel/cce/10.1.018/lib/libsvml.so > (0x00002aaaac766000) > libm.so.6 => /lib64/libm.so.6 (0x00000034a6e00000) > libguide.so => > /opt/intel/mkl/10.0.5.025/lib/em64t/libguide.so(0x00002aaaac8f1000) > libintlc.so.5 => /opt/intel/cce/10.1.018/lib/libintlc.so.5 > (0x00002aaaaca65000) > libc.so.6 => /lib64/libc.so.6 (0x00000034a5a00000) > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000034a7e00000) > /lib64/ld-linux-x86-64.so.2 (0x00000034a5600000) > > With the help of Open MPI FAQ: > > # /opt/mpi/openmpi/1.3.3/intel/bin/ompi_info --param btl openib > MCA btl: parameter "btl_base_verbose" (current value: "0", > data source: default value) > Verbosity level of the BTL framework > MCA btl: parameter "btl" (current value: <none>, data > source: default value) > Default selection set of components for the btl > framework (<none> means use all components that can be found) > MCA btl: parameter "btl_openib_verbose" (current value: > "0", data source: default value) > Output some verbose OpenIB BTL information (0 = > no output, nonzero = output) > MCA btl: parameter > "btl_openib_warn_no_device_params_found" (current value: "1", data source: > default value, synonyms: > btl_openib_warn_no_hca_params_found) > Warn when no device-specific parameters are found > in the INI file specified by the btl_openib_device_param_files MCA parameter > (0 = > do not warn; any other value = warn) > MCA btl: parameter "btl_openib_warn_no_hca_params_found" > (current value: "1", data source: default value, deprecated, synonym of: > btl_openib_warn_no_device_params_found) > Warn when no device-specific parameters are found > in the INI file specified by the btl_openib_device_param_files MCA parameter > (0 = > do not warn; any other value = warn) > MCA btl: parameter "btl_openib_warn_default_gid_prefix" > (current value: "1", data source: default value) > Warn when there is more than one active ports and > at least one of them connected to the network with only default GID prefix > configured (0 = do not warn; any other value = > warn) > MCA btl: parameter "btl_openib_warn_nonexistent_if" > (current value: "1", data source: default value) > Warn if non-existent devices and/or ports are > specified in the btl_openib_if_[in|ex]clude MCA parameters (0 = do not warn; > any > other value = warn) > > During Open MPI install I've used --with-openib=/usr. So I believe its > compiled with IB support. > > The IB utilities such as ibv_rc_pingpong are working fine. > > I'm not getting why its OMPI is not using IB? Please help me to resolve > this issue. > > Thanks >