Any hint for the previous mail?

Does Open MPI-1.3.3 support only a limited versions of OFED?
Or any version is ok?
On Sun, Oct 11, 2009 at 3:55 PM, Sangamesh B <> wrote:

> Hi,
> A fortran application is installed with Intel Fortran 10.1, MKL-10 and
> Openmpi-1.3.3 on a Rocks-5.1 HPC Linux cluster. The jobs are not scaling
> when more than one node is used. The cluster has Intel Quad core Xeon
> (E5472) @ 3.00GHz Dual processor (total 8 cores per node, 16GB RAM) and
> Infiniband interconnectivity.
> Here are some of the timings:
> 12 cores (Node 1: 8 cores, Node2: 4 cores)  --  No progress in the job
>  8 cores (Node 1: 8 cores)                           -- 21 hours (38 CG
> move steps)
>  4 cores (Node 1: 4 cores)                           -- 25 hours
> 12 cores (Node 1, Node 2, Node 3: 4cores each) -- No progress
>    Later to check, whether Open MPI is using IB or not, I used --mca btl
> openib. But the job failed with following error message:
> # cat /home1/g03/apps_test/amber/test16/err.352.job16
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>   Process 1 ([[23671,1],12]) is on host: compute-0-12.local
>   Process 2 ([[23671,1],12]) is on host: compute-0-12.local
>   BTLs attempted: openib
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>   PML add procs failed
>   --> Returned "Unreachable" (-12) instead of "Success" (0)
> --------------------------------------------------------------------------
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-12.local:5496] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-5.local:6916] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-5.local:6914] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [compute-0-5.local:6915] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> [compute-0-5.local:6913] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> mpirun has exited due to process rank 12 with PID 5496 on
> node compute-0-12.local exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> [compute-0-5.local:06910] 15 more processes have sent help message
> help-mca-bml-r2.txt / unreachable proc
> [compute-0-5.local:06910] Set MCA parameter "orte_base_help_aggregate" to 0
> to see all help / error messages
> [compute-0-5.local:06910] 15 more processes have sent help message
> help-mpi-runtime / mpi_init:startup:internal-failure
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>   Process 1 ([[23958,1],2]) is on host: compute-0-5.local
>   Process 2 ([[23958,1],2]) is on host: compute-0-5.local
>   BTLs attempted: openib
> Then added 'self' to --mca btl openib,. With this it started running, but I
> can make sure its not using IB as I observed it from the netstat -i command.
> 1st Snap:
> Every 2.0s: netstat
> -i
> Sun Oct 11 15:29:29 2009
> Kernel Interface table
> TX-OVR Flg
> eth0       1500   0  1847619      0     0    0  2073010    0      0      0
> ib0      65520   0     708      0     0    0      509    0      5      0
> lo        16436   0     5731      0     0    0     5731    0      0      0
> 2nd Snap:
> Every 2.0s: netstat
> -i
> Sun Oct 11 15:29:57 2009
> Kernel Interface table
> TX-OVR Flg
> eth0       1500   0  1847647      0     0    0  2073073    0      0      0
> ib0      65520   0     708      0     0    0      509    0      5      0
> lo        16436   0     5731      0     0    0     5731    0      0      0
> Why OpenMPI is not able to use IB?
> The ldd to the executable shows, no IB libraries are linked. Is this the
> reason:
> ldd /opt/apps/siesta/siesta_mpi
> /opt/intel/mkl/
> /opt/intel/mkl/
>     /opt/intel/mkl/
> => /lib64/ (0x00000034a6200000)
> => /opt/mpi/openmpi/1.3.3/intel/lib/
> (0x00002aaaab4a0000)
> => /opt/mpi/openmpi/1.3.3/intel/lib/
> (0x00002aaaab6a3000)
> => /opt/mpi/openmpi/1.3.3/intel/lib/
> (0x00002aaaab8db000)
> => /opt/mpi/openmpi/1.3.3/intel/lib/
> (0x00002aaaabbaa000)
> => /opt/mpi/openmpi/1.3.3/intel/lib/
> (0x00002aaaabe07000)
> => /lib64/ (0x00000034a5e00000)
> => /lib64/ (0x00000034a8200000)
> => /lib64/ (0x00000034a6600000)
> => /opt/intel/fce/10.1.008/lib/
> (0x00002aaaac09a000)
> => /opt/intel/fce/10.1.008/lib/
> (0x00002aaaac1d0000)
> => /opt/intel/cce/10.1.018/lib/ (0x00002aaaac401000)
> => /opt/intel/cce/10.1.018/lib/
> (0x00002aaaac766000)
> => /lib64/ (0x00000034a6e00000)
> => 
> /opt/intel/mkl/
> => /opt/intel/cce/10.1.018/lib/
> (0x00002aaaaca65000)
> => /lib64/ (0x00000034a5a00000)
> => /lib64/ (0x00000034a7e00000)
>     /lib64/ (0x00000034a5600000)
> With the help of Open MPI FAQ:
> # /opt/mpi/openmpi/1.3.3/intel/bin/ompi_info --param btl openib
>                  MCA btl: parameter "btl_base_verbose" (current value: "0",
> data source: default value)
>                           Verbosity level of the BTL framework
>                  MCA btl: parameter "btl" (current value: <none>, data
> source: default value)
>                           Default selection set of components for the btl
> framework (<none> means use all components that can be found)
>                  MCA btl: parameter "btl_openib_verbose" (current value:
> "0", data source: default value)
>                           Output some verbose OpenIB BTL information (0 =
> no output, nonzero = output)
>                  MCA btl: parameter
> "btl_openib_warn_no_device_params_found" (current value: "1", data source:
> default value, synonyms:
>                           btl_openib_warn_no_hca_params_found)
>                           Warn when no device-specific parameters are found
> in the INI file specified by the btl_openib_device_param_files MCA parameter
> (0 =
>                           do not warn; any other value = warn)
>                  MCA btl: parameter "btl_openib_warn_no_hca_params_found"
> (current value: "1", data source: default value, deprecated, synonym of:
>                           btl_openib_warn_no_device_params_found)
>                           Warn when no device-specific parameters are found
> in the INI file specified by the btl_openib_device_param_files MCA parameter
> (0 =
>                           do not warn; any other value = warn)
>                  MCA btl: parameter "btl_openib_warn_default_gid_prefix"
> (current value: "1", data source: default value)
>                           Warn when there is more than one active ports and
> at least one of them connected to the network with only default GID prefix
>                           configured (0 = do not warn; any other value =
> warn)
>                  MCA btl: parameter "btl_openib_warn_nonexistent_if"
> (current value: "1", data source: default value)
>                           Warn if non-existent devices and/or ports are
> specified in the btl_openib_if_[in|ex]clude MCA parameters (0 = do not warn;
> any
>                           other value = warn)
> During Open MPI install I've used --with-openib=/usr. So I believe its
> compiled with IB support.
> The IB utilities such as ibv_rc_pingpong are working fine.
> I'm not getting why its OMPI is not using IB? Please help me to resolve
> this issue.
> Thanks

Reply via email to