I'm trying to get openmpi over RoCE working with this setup:



card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov


OS: CentOS 7.7


modinfo qede

filename:       
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/qede/qede.ko.xz
version:        8.37.0.20
license:        GPL
description:    QLogic FastLinQ 4xxxx Ethernet Driver
retpoline:      Y
rhelversion:    7.7
srcversion:     A6AFD0788918644F2EFFF31
alias:          pci:v00001077d00008090sv*sd*bc*sc*i*
alias:          pci:v00001077d00008070sv*sd*bc*sc*i*
alias:          pci:v00001077d00001664sv*sd*bc*sc*i*
alias:          pci:v00001077d00001656sv*sd*bc*sc*i*
alias:          pci:v00001077d00001654sv*sd*bc*sc*i*
alias:          pci:v00001077d00001644sv*sd*bc*sc*i*
alias:          pci:v00001077d00001636sv*sd*bc*sc*i*
alias:          pci:v00001077d00001666sv*sd*bc*sc*i*
alias:          pci:v00001077d00001634sv*sd*bc*sc*i*
depends:        ptp,qed
intree:         Y
vermagic:       3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer:         CentOS Linux kernel signing key
sig_key:        60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256
parm:           debug: Default debug msglevel (uint)

modinfo qedr

filename:       
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qedr.ko.xz
license:        Dual BSD/GPL
author:         QLogic Corporation
description:    QLogic 40G/100G ROCE Driver
retpoline:      Y
rhelversion:    7.7
srcversion:     B5B65473217AA2B1F2F619B
depends:        qede,qed,ib_core
intree:         Y
vermagic:       3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer:         CentOS Linux kernel signing key
sig_key:        60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256

ibv_devinfo

hca_id: qedr0
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:8439
sys_image_guid: b62e:99ff:fea7:8439
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

hca_id: qedr1
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:843a
sys_image_guid: b62e:99ff:fea7:843a
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet




RDMA actually works at system level which means that I cand do

rdma ping-pong tests etc.



But when I try to run openmpi with these options:



mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ...





I get the following error messages:




--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: node001
--------------------------------------------------------------------------
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           node002
  Local device:         qedr0
  Local port:           1
  CPCs attempted:       rdmacm
--------------------------------------------------------------------------
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.

...


I've tried several things such as:


1) upgrade the 3.10 kernel's qed* drivers to the latest stable version 8.42.9

2) upgrade the CentOS kernel from 3.10 to 5.3 via elrepo

3) install the latest OFED-4.17-1.tgz stack


but the error messages never go away ad do remain always the same.



Any advice is highly appreciated.

Reply via email to