Any hint for the previous mail?

Does Open MPI-1.3.3 support only a limited versions of OFED?
Or any version is ok?
On Sun, Oct 11, 2009 at 3:55 PM, Sangamesh B <forum....@gmail.com> wrote:

> Hi,
>
> A fortran application is installed with Intel Fortran 10.1, MKL-10 and
> Openmpi-1.3.3 on a Rocks-5.1 HPC Linux cluster. The jobs are not scaling
> when more than one node is used. The cluster has Intel Quad core Xeon
> (E5472) @ 3.00GHz Dual processor (total 8 cores per node, 16GB RAM) and
> Infiniband interconnectivity.
>
> Here are some of the timings:
>
> 12 cores (Node 1: 8 cores, Node2: 4 cores)  --  No progress in the job
>  8 cores (Node 1: 8 cores)                           -- 21 hours (38 CG
> move steps)
>  4 cores (Node 1: 4 cores)                           -- 25 hours
> 12 cores (Node 1, Node 2, Node 3: 4cores each) -- No progress
>
>
>    Later to check, whether Open MPI is using IB or not, I used --mca btl
> openib. But the job failed with following error message:
> # cat /home1/g03/apps_test/amber/test16/err.352.job16
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[23671,1],12]) is on host: compute-0-12.local
>   Process 2 ([[23671,1],12]) is on host: compute-0-12.local
>   BTLs attempted: openib
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   PML add procs failed
>   --> Returned "Unreachable" (-12) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-12.local:5496] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-5.local:6916] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-5.local:6914] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-5.local:6915] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> [compute-0-5.local:6913] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> mpirun has exited due to process rank 12 with PID 5496 on
> node compute-0-12.local exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [compute-0-5.local:06910] 15 more processes have sent help message
> help-mca-bml-r2.txt / unreachable proc
> [compute-0-5.local:06910] Set MCA parameter "orte_base_help_aggregate" to 0
> to see all help / error messages
> [compute-0-5.local:06910] 15 more processes have sent help message
> help-mpi-runtime / mpi_init:startup:internal-failure
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[23958,1],2]) is on host: compute-0-5.local
>   Process 2 ([[23958,1],2]) is on host: compute-0-5.local
>   BTLs attempted: openib
>
> Then added 'self' to --mca btl openib,. With this it started running, but I
> can make sure its not using IB as I observed it from the netstat -i command.
>
> 1st Snap:
>
> Every 2.0s: netstat
> -i
> Sun Oct 11 15:29:29 2009
>
> Kernel Interface table
> Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
> TX-OVR Flg
> eth0       1500   0  1847619      0     0    0  2073010    0      0      0
> BMRU
> ib0      65520   0     708      0     0    0      509    0      5      0
> BMRU
> lo        16436   0     5731      0     0    0     5731    0      0      0
> LRU
>
> 2nd Snap:
>
> Every 2.0s: netstat
> -i
> Sun Oct 11 15:29:57 2009
>
> Kernel Interface table
> Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
> TX-OVR Flg
> eth0       1500   0  1847647      0     0    0  2073073    0      0      0
> BMRU
> ib0      65520   0     708      0     0    0      509    0      5      0
> BMRU
> lo        16436   0     5731      0     0    0     5731    0      0      0
> LRU
>
> Why OpenMPI is not able to use IB?
>
> The ldd to the executable shows, no IB libraries are linked. Is this the
> reason:
> ldd /opt/apps/siesta/siesta_mpi
>     
> /opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_intel_lp64.so(0x00002aaaaaaad000)
>     
> /opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_intel_thread.so(0x00002aaaaadc2000)
>     /opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_core.so(0x00002aaaab2ad000)
>     libpthread.so.0 => /lib64/libpthread.so.0 (0x00000034a6200000)
>     libmpi_f90.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi_f90.so.0
> (0x00002aaaab4a0000)
>     libmpi_f77.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi_f77.so.0
> (0x00002aaaab6a3000)
>     libmpi.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi.so.0
> (0x00002aaaab8db000)
>     libopen-rte.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libopen-rte.so.0
> (0x00002aaaabbaa000)
>     libopen-pal.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libopen-pal.so.0
> (0x00002aaaabe07000)
>     libdl.so.2 => /lib64/libdl.so.2 (0x00000034a5e00000)
>     libnsl.so.1 => /lib64/libnsl.so.1 (0x00000034a8200000)
>     libutil.so.1 => /lib64/libutil.so.1 (0x00000034a6600000)
>     libifport.so.5 => /opt/intel/fce/10.1.008/lib/libifport.so.5
> (0x00002aaaac09a000)
>     libifcoremt.so.5 => /opt/intel/fce/10.1.008/lib/libifcoremt.so.5
> (0x00002aaaac1d0000)
>     libimf.so => /opt/intel/cce/10.1.018/lib/libimf.so (0x00002aaaac401000)
>     libsvml.so => /opt/intel/cce/10.1.018/lib/libsvml.so
> (0x00002aaaac766000)
>     libm.so.6 => /lib64/libm.so.6 (0x00000034a6e00000)
>     libguide.so => 
> /opt/intel/mkl/10.0.5.025/lib/em64t/libguide.so(0x00002aaaac8f1000)
>     libintlc.so.5 => /opt/intel/cce/10.1.018/lib/libintlc.so.5
> (0x00002aaaaca65000)
>     libc.so.6 => /lib64/libc.so.6 (0x00000034a5a00000)
>     libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000034a7e00000)
>     /lib64/ld-linux-x86-64.so.2 (0x00000034a5600000)
>
> With the help of Open MPI FAQ:
>
> # /opt/mpi/openmpi/1.3.3/intel/bin/ompi_info --param btl openib
>                  MCA btl: parameter "btl_base_verbose" (current value: "0",
> data source: default value)
>                           Verbosity level of the BTL framework
>                  MCA btl: parameter "btl" (current value: <none>, data
> source: default value)
>                           Default selection set of components for the btl
> framework (<none> means use all components that can be found)
>                  MCA btl: parameter "btl_openib_verbose" (current value:
> "0", data source: default value)
>                           Output some verbose OpenIB BTL information (0 =
> no output, nonzero = output)
>                  MCA btl: parameter
> "btl_openib_warn_no_device_params_found" (current value: "1", data source:
> default value, synonyms:
>                           btl_openib_warn_no_hca_params_found)
>                           Warn when no device-specific parameters are found
> in the INI file specified by the btl_openib_device_param_files MCA parameter
> (0 =
>                           do not warn; any other value = warn)
>                  MCA btl: parameter "btl_openib_warn_no_hca_params_found"
> (current value: "1", data source: default value, deprecated, synonym of:
>                           btl_openib_warn_no_device_params_found)
>                           Warn when no device-specific parameters are found
> in the INI file specified by the btl_openib_device_param_files MCA parameter
> (0 =
>                           do not warn; any other value = warn)
>                  MCA btl: parameter "btl_openib_warn_default_gid_prefix"
> (current value: "1", data source: default value)
>                           Warn when there is more than one active ports and
> at least one of them connected to the network with only default GID prefix
>                           configured (0 = do not warn; any other value =
> warn)
>                  MCA btl: parameter "btl_openib_warn_nonexistent_if"
> (current value: "1", data source: default value)
>                           Warn if non-existent devices and/or ports are
> specified in the btl_openib_if_[in|ex]clude MCA parameters (0 = do not warn;
> any
>                           other value = warn)
>
> During Open MPI install I've used --with-openib=/usr. So I believe its
> compiled with IB support.
>
> The IB utilities such as ibv_rc_pingpong are working fine.
>
> I'm not getting why its OMPI is not using IB? Please help me to resolve
> this issue.
>
> Thanks
>

Reply via email to