Hi George,

     I've run ibpingpong tests. They are working fine.
 Are there any additional tests available which will make sure that "there
is no problem with IB software and Open MPI. The problem is with Application
or IB hardware"?

    Because we've faced some critical problems:

http://www.open-mpi.org/community/lists/users/2009/10/10843.php

and

http://www.open-mpi.org/community/lists/users/2009/09/10700.php

Thanks,
Sangamesh

On Wed, Oct 14, 2009 at 10:12 PM, George Bosilca <bosi...@eecs.utk.edu>wrote:

> Sangamesh,
>
> If there is a version issue with OFED it will be detected at configure
> time. If you manage to compile and install Open MPI there should be no
> issues with OFED.
>
> What I can tell is that "--mca btl openib,self" will not allow any other
> network (with the exception of Infiniband) to be used for communications
> between MPI processes. However, our runtime is still allowed to use TCP, and
> this is what you see on your netstat. These are not performance critical
> communications (i.e. only startup the job, distribute the contact
> informations and so on).
>
> Have you run the IB tests to validate the IB network?
>
>  george.
>
>
> On Oct 12, 2009, at 03:38 , Sangamesh B wrote:
>
>  Any hint for the previous mail?
>>
>> Does Open MPI-1.3.3 support only a limited versions of OFED?
>> Or any version is ok?
>> On Sun, Oct 11, 2009 at 3:55 PM, Sangamesh B <forum....@gmail.com> wrote:
>> Hi,
>>
>> A fortran application is installed with Intel Fortran 10.1, MKL-10 and
>> Openmpi-1.3.3 on a Rocks-5.1 HPC Linux cluster. The jobs are not scaling
>> when more than one node is used. The cluster has Intel Quad core Xeon
>> (E5472) @ 3.00GHz Dual processor (total 8 cores per node, 16GB RAM) and
>> Infiniband interconnectivity.
>>
>> Here are some of the timings:
>>
>> 12 cores (Node 1: 8 cores, Node2: 4 cores)  --  No progress in the job
>>  8 cores (Node 1: 8 cores)
>>   -- 21 hours (38 CG move steps)
>>  4 cores (Node 1: 4 cores)                           -- 25 hours
>> 12 cores (Node 1, Node 2, Node 3: 4cores each) -- No progress
>>
>>
>>   Later to check, whether Open MPI is using IB or not, I used --mca btl
>> openib. But the job failed with following error message:
>> # cat /home1/g03/apps_test/amber/test16/err.352.job16
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>>  Process 1 ([[23671,1],12]) is on host: compute-0-12.local
>>  Process 2 ([[23671,1],12]) is on host: compute-0-12.local
>>  BTLs attempted: openib
>>
>> Your MPI job is now going to abort; sorry.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or
>> environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>  PML add procs failed
>>  --> Returned "Unreachable" (-12) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [compute-0-12.local:5496] Abort before MPI_INIT completed successfully;
>> not able to guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [compute-0-5.local:6916] Abort before MPI_INIT completed successfully; not
>> able to guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [compute-0-5.local:6914] Abort before MPI_INIT completed successfully; not
>> able to guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [compute-0-5.local:6915] Abort before MPI_INIT completed successfully; not
>> able to guarantee that all other processes were killed!
>> [compute-0-5.local:6913] Abort before MPI_INIT completed successfully; not
>> able to guarantee that all other processes were killed!
>> *** An error occurred in MPI_Init
>> *** before MPI was initialized
>> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> mpirun has exited due to process rank 12 with PID 5496 on
>> node compute-0-12.local exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>> [compute-0-5.local:06910] 15 more processes have sent help message
>> help-mca-bml-r2.txt / unreachable proc
>> [compute-0-5.local:06910] Set MCA parameter "orte_base_help_aggregate" to
>> 0 to see all help / error messages
>> [compute-0-5.local:06910] 15 more processes have sent help message
>> help-mpi-runtime / mpi_init:startup:internal-failure
>> --------------------------------------------------------------------------
>> At least one pair of MPI processes are unable to reach each other for
>> MPI communications.  This means that no Open MPI device has indicated
>> that it can be used to communicate between these processes.  This is
>> an error; Open MPI requires that all MPI processes be able to reach
>> each other.  This error can sometimes be the result of forgetting to
>> specify the "self" BTL.
>>
>>  Process 1 ([[23958,1],2]) is on host: compute-0-5.local
>>  Process 2 ([[23958,1],2]) is on host: compute-0-5.local
>>  BTLs attempted: openib
>>
>> Then added 'self' to --mca btl openib,. With this it started running, but
>> I can make sure its not using IB as I observed it from the netstat -i
>> command.
>>
>> 1st Snap:
>>
>> Every 2.0s: netstat -i
>>                                                           Sun Oct 11
>> 15:29:29 2009
>>
>> Kernel Interface table
>> Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>> eth0       1500   0  1847619      0     0    0  2073010    0      0      0
>> BMRU
>> ib0      65520   0     708      0     0    0      509    0      5      0
>> BMRU
>> lo        16436   0     5731      0     0    0     5731    0      0      0
>> LRU
>>
>> 2nd Snap:
>>
>> Every 2.0s: netstat -i
>>                                                           Sun Oct 11
>> 15:29:57 2009
>>
>> Kernel Interface table
>> Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP
>> TX-OVR Flg
>> eth0       1500   0  1847647      0     0    0  2073073    0      0      0
>> BMRU
>> ib0      65520   0     708      0     0    0      509    0      5      0
>> BMRU
>> lo        16436   0     5731      0     0    0     5731    0      0      0
>> LRU
>>
>> Why OpenMPI is not able to use IB?
>>
>> The ldd to the executable shows, no IB libraries are linked. Is this the
>> reason:
>> ldd /opt/apps/siesta/siesta_mpi
>>    
>> /opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_intel_lp64.so(0x00002aaaaaaad000)
>>    
>> /opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_intel_thread.so(0x00002aaaaadc2000)
>>    /opt/intel/mkl/10.0.5.025/lib/em64t/libmkl_core.so(0x00002aaaab2ad000)
>>    libpthread.so.0 => /lib64/libpthread.so.0 (0x00000034a6200000)
>>    libmpi_f90.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi_f90.so.0
>> (0x00002aaaab4a0000)
>>    libmpi_f77.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi_f77.so.0
>> (0x00002aaaab6a3000)
>>    libmpi.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libmpi.so.0
>> (0x00002aaaab8db000)
>>    libopen-rte.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libopen-rte.so.0
>> (0x00002aaaabbaa000)
>>    libopen-pal.so.0 => /opt/mpi/openmpi/1.3.3/intel/lib/libopen-pal.so.0
>> (0x00002aaaabe07000)
>>    libdl.so.2 => /lib64/libdl.so.2 (0x00000034a5e00000)
>>    libnsl.so.1 => /lib64/libnsl.so.1 (0x00000034a8200000)
>>    libutil.so.1 => /lib64/libutil.so.1 (0x00000034a6600000)
>>    libifport.so.5 => /opt/intel/fce/10.1.008/lib/libifport.so.5
>> (0x00002aaaac09a000)
>>    libifcoremt.so.5 => /opt/intel/fce/10.1.008/lib/libifcoremt.so.5
>> (0x00002aaaac1d0000)
>>    libimf.so => /opt/intel/cce/10.1.018/lib/libimf.so (0x00002aaaac401000)
>>    libsvml.so => /opt/intel/cce/10.1.018/lib/libsvml.so
>> (0x00002aaaac766000)
>>    libm.so.6 => /lib64/libm.so.6 (0x00000034a6e00000)
>>    libguide.so => 
>> /opt/intel/mkl/10.0.5.025/lib/em64t/libguide.so(0x00002aaaac8f1000)
>>    libintlc.so.5 => /opt/intel/cce/10.1.018/lib/libintlc.so.5
>> (0x00002aaaaca65000)
>>    libc.so.6 => /lib64/libc.so.6 (0x00000034a5a00000)
>>    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00000034a7e00000)
>>    /lib64/ld-linux-x86-64.so.2 (0x00000034a5600000)
>>
>> With the help of Open MPI FAQ:
>>
>> # /opt/mpi/openmpi/1.3.3/intel/bin/ompi_info --param btl openib
>>                 MCA btl: parameter "btl_base_verbose" (current value: "0",
>> data source: default value)
>>                          Verbosity level of the BTL framework
>>                 MCA btl: parameter "btl" (current value: <none>, data
>> source: default value)
>>                          Default selection set of components for the btl
>> framework (<none> means use all components that can be found)
>>                 MCA btl: parameter "btl_openib_verbose" (current value:
>> "0", data source: default value)
>>                          Output some verbose OpenIB BTL information (0 =
>> no output, nonzero = output)
>>                 MCA btl: parameter
>> "btl_openib_warn_no_device_params_found" (current value: "1", data source:
>> default value, synonyms:
>>                          btl_openib_warn_no_hca_params_found)
>>                          Warn when no device-specific parameters are found
>> in the INI file specified by the btl_openib_device_param_files MCA parameter
>> (0 =
>>                          do not warn; any other value = warn)
>>                 MCA btl: parameter "btl_openib_warn_no_hca_params_found"
>> (current value: "1", data source: default value, deprecated, synonym of:
>>                          btl_openib_warn_no_device_params_found)
>>                          Warn when no device-specific parameters are found
>> in the INI file specified by the btl_openib_device_param_files MCA parameter
>> (0 =
>>                          do not warn; any other value = warn)
>>                 MCA btl: parameter "btl_openib_warn_default_gid_prefix"
>> (current value: "1", data source: default value)
>>                          Warn when there is more than one active ports and
>> at least one of them connected to the network with only default GID prefix
>>                          configured (0 = do not warn; any other value =
>> warn)
>>                 MCA btl: parameter "btl_openib_warn_nonexistent_if"
>> (current value: "1", data source: default value)
>>                          Warn if non-existent devices and/or ports are
>> specified in the btl_openib_if_[in|ex]clude MCA parameters (0 = do not warn;
>> any
>>                          other value = warn)
>>
>> During Open MPI install I've used --with-openib=/usr. So I believe its
>> compiled with IB support.
>>
>> The IB utilities such as ibv_rc_pingpong are working fine.
>>
>> I'm not getting why its OMPI is not using IB? Please help me to resolve
>> this issue.
>>
>> Thanks
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to