Hi there

i'm facing a strange issue with this HCA. A cluster I support has been recently expanded with 4 new nodes, all using the mentioned HCA. 3 nodes are working fine, but one will not use the IB network when running jobs. Let's call 'node a' the working one, and 'node b' the not working one. Here's my scenario :



OS: Rocks Linux 6.1 ( Centos 6.5 x86_64 )

MPI: Stock Centos rpm. 'ompi_info' output below:
package:Open MPI mockbu...@c6b8.bsys.dev.centos.org Distribution
ompi:version:full:1.5.4
ompi:version:svn:r25060
ompi:version:release_date:Aug 18, 2011
orte:version:full:1.5.4
orte:version:svn:r25060
orte:version:release_date:Aug 18, 2011
opal:version:full:1.5.4
opal:version:svn:r25060
opal:version:release_date:Aug 18, 2011
ident:1.5.4

PATH: /usr/lib64/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maui/bin:/opt/torque/bin:/opt/torque/sbin:/opt/rocks/bin:/opt/rocks/sbin:/root/bin

LD_LIBRARY_PATH: /usr/lib64/openmpi/lib

OpenFabrics: Stock centos rpm
libibumad-1.3.8-1.el6.x86_64
libibmad-1.3.9-1.el6.x86_64
libibverbs-utils-1.1.7-1.el6.x86_64
libibverbs-1.1.7-1.el6.x86_64
librdmacm-1.0.17-1.el6.x86_64
infinipath-psm-3.0.1-115.1015_open.2.el6.x86_64

ulimit -l :
'unlimited' in both nodes



Here's where things get interesting. On all nodes with qlogic HCA, 'ibv_devinfo' does not outputs what is expected, only :
        "libibverbs: Warning: no userspace device-specific driver found
        for /sys/class/infiniband_verbs/uverbs0
        No IB devices found"

But i've successfully ran tests on 'node a' , like IMB ping and hello world, from other working nodes of the cluster, so despite the output of 'ibv_devinfo', 'node a' HCA is working.

I can run 'hello world' from 'node b' to 'node a' without problems,
but the opposite does not work.

So this is my question: why only 'node b' HCA is not working ?
Is there any other tests i can make to get closer to the source of the problem ?


TIA
Fabricio

Reply via email to