Hi there
i'm facing a strange issue with this HCA. A cluster I support has been
recently expanded with 4 new nodes, all using the mentioned HCA. 3 nodes
are working fine, but one will not use the IB network when running jobs.
Let's call 'node a' the working one, and 'node b' the not working one.
Here's my scenario :
OS: Rocks Linux 6.1 ( Centos 6.5 x86_64 )
MPI: Stock Centos rpm. 'ompi_info' output below:
package:Open MPI mockbu...@c6b8.bsys.dev.centos.org Distribution
ompi:version:full:1.5.4
ompi:version:svn:r25060
ompi:version:release_date:Aug 18, 2011
orte:version:full:1.5.4
orte:version:svn:r25060
orte:version:release_date:Aug 18, 2011
opal:version:full:1.5.4
opal:version:svn:r25060
opal:version:release_date:Aug 18, 2011
ident:1.5.4
PATH:
/usr/lib64/openmpi/bin:/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/usr/java/latest/bin:/opt/maui/bin:/opt/torque/bin:/opt/torque/sbin:/opt/rocks/bin:/opt/rocks/sbin:/root/bin
LD_LIBRARY_PATH: /usr/lib64/openmpi/lib
OpenFabrics: Stock centos rpm
libibumad-1.3.8-1.el6.x86_64
libibmad-1.3.9-1.el6.x86_64
libibverbs-utils-1.1.7-1.el6.x86_64
libibverbs-1.1.7-1.el6.x86_64
librdmacm-1.0.17-1.el6.x86_64
infinipath-psm-3.0.1-115.1015_open.2.el6.x86_64
ulimit -l :
'unlimited' in both nodes
Here's where things get interesting. On all nodes with qlogic HCA,
'ibv_devinfo' does not outputs what is expected, only :
"libibverbs: Warning: no userspace device-specific driver found
for /sys/class/infiniband_verbs/uverbs0
No IB devices found"
But i've successfully ran tests on 'node a' , like IMB ping and hello
world, from other working nodes of the cluster, so despite the output of
'ibv_devinfo', 'node a' HCA is working.
I can run 'hello world' from 'node b' to 'node a' without problems,
but the opposite does not work.
So this is my question: why only 'node b' HCA is not working ?
Is there any other tests i can make to get closer to the source of the
problem ?
TIA
Fabricio