I'm not using Mellanox OFED because the card

is a Marvell OCP type 25Gb/s 2-port LAN Card.


Kernel drivers used are:


qede + qedr



Beside that,


I did a quick test on two nodes installing

CentSO 7.6 and:


ofed_info -s

OFED-4.17-1:


and now the error message is different:


--------------------------------------------------------------------------
[[30578,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: node001

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------




________________________________
From: Jeff Squyres (jsquyres) <jsquy...@cisco.com>
Sent: Wednesday, November 13, 2019 7:16:41 PM
To: Open MPI User's List
Cc: Llolsten Kaonga; Matteo Guglielmi
Subject: Re: [OMPI users] qelr_alloc_context: Failed to allocate context for 
device.

Have you tried using the UCX PML?

The UCX PML is Mellanox's preferred Open MPI mechanism (instead of using the 
openib BTL).


> On Nov 13, 2019, at 9:35 AM, Matteo Guglielmi via users 
> <users@lists.open-mpi.org> wrote:
>
> I rolled everything back to stock centos 7.7 installing OFED via:
>
>
>
>
> yum groupinstall @infiniband
>
> yum install rdma-core-devel infiniband-diags-devel
>
>
> which does not install the ofed_info command, or at least I could
> not find it (do you know where it is?).
>
>
>
> openmpi is version 3.1.4
>
>
>
>
> the fw version should be 8.37.7.0
>
>
>
> will now try to upgrade the firmware since changing OS is not an option.
>
>
>
> Other suggestions?
>
>
> Thank you!
>
>
> ________________________________
> From: Llolsten Kaonga <l...@soft-forge.com>
> Sent: Wednesday, November 13, 2019 3:25:16 PM
> To: 'Open MPI Users'
> Cc: Matteo Guglielmi
> Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for 
> device.
>
> Hello Mateo,
>
> What version of openmpi are you running?
>
> Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It
> supports CentsOS 7.6.
>
> Apologies if you have already tried CentOS 7.6.
>
> We have been able to run openmpi (earlier this month):
>
> OS:                      CentOS 7.6
> mpirun --version:        3.1.4
> ofed_info -s:            OFED-4.17-1
>
> RNIC fw version          8.50.9.0
>
> Thanks.
> --
> Llolsten
>
> -----Original Message-----
> From: users <users-boun...@lists.open-mpi.org> On Behalf Of Matteo Guglielmi
> via users
> Sent: Wednesday, November 13, 2019 2:12 AM
> To: users@lists.open-mpi.org
> Cc: Matteo Guglielmi <matteo.guglie...@dalco.ch>
> Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for
> device.
>
> I'm trying to get openmpi over RoCE working with this setup:
>
>
>
>
> card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov
>
>
> OS: CentOS 7.7
>
>
> modinfo qede
>
> filename:
> /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q
> ede/qede.ko.xz
> version:        8.37.0.20
> license:        GPL
> description:    QLogic FastLinQ 4xxxx Ethernet Driver
> retpoline:      Y
> rhelversion:    7.7
> srcversion:     A6AFD0788918644F2EFFF31
> alias:          pci:v00001077d00008090sv*sd*bc*sc*i*
> alias:          pci:v00001077d00008070sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001664sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001656sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001654sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001644sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001636sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001666sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001634sv*sd*bc*sc*i*
> depends:        ptp,qed
> intree:         Y
> vermagic:       3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
> signer:         CentOS Linux kernel signing key
> sig_key:        60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
> sig_hashalgo:   sha256
> parm:           debug: Default debug msglevel (uint)
>
> modinfo qedr
>
> filename:
> /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe
> dr.ko.xz
> license:        Dual BSD/GPL
> author:         QLogic Corporation
> description:    QLogic 40G/100G ROCE Driver
> retpoline:      Y
> rhelversion:    7.7
> srcversion:     B5B65473217AA2B1F2F619B
> depends:        qede,qed,ib_core
> intree:         Y
> vermagic:       3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
> signer:         CentOS Linux kernel signing key
> sig_key:        60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
> sig_hashalgo:   sha256
>
> ibv_devinfo
>
> hca_id: qedr0
> transport: InfiniBand (0)
> fw_ver: 8.37.7.0
> node_guid: b62e:99ff:fea7:8439
> sys_image_guid: b62e:99ff:fea7:8439
> vendor_id: 0x1077
> vendor_part_id: 32880
> hw_ver: 0x0
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 4096 (5)
> active_mtu: 1024 (3)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
> link_layer: Ethernet
>
> hca_id: qedr1
> transport: InfiniBand (0)
> fw_ver: 8.37.7.0
> node_guid: b62e:99ff:fea7:843a
> sys_image_guid: b62e:99ff:fea7:843a
> vendor_id: 0x1077
> vendor_part_id: 32880
> hw_ver: 0x0
> phys_port_cnt: 1
> port: 1
> state: PORT_DOWN (1)
> max_mtu: 4096 (5)
> active_mtu: 1024 (3)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
> link_layer: Ethernet
>
>
>
>
> RDMA actually works at system level which means that I cand do
>
> rdma ping-pong tests etc.
>
>
>
> But when I try to run openmpi with these options:
>
>
>
> mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ...
>
>
>
>
>
> I get the following error messages:
>
>
>
>
> --------------------------------------------------------------------------
> WARNING: There is at least non-excluded one OpenFabrics device found, but
> there are no active ports detected (or Open MPI was unable to use them).
> This is most certainly not what you wanted.  Check your cables, subnet
> manager configuration, etc.  The openib BTL will be ignored for this job.
>
>  Local host: node001
> --------------------------------------------------------------------------
> qelr_alloc_context: Failed to allocate context for device.
> qelr_alloc_context: Failed to allocate context for device.
> qelr_alloc_context: Failed to allocate context for device.
> qelr_alloc_context: Failed to allocate context for device.
> qelr_alloc_context: Failed to allocate context for device.
> --------------------------------------------------------------------------
> No OpenFabrics connection schemes reported that they were able to be used on
> a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
>
>  Local host:           node002
>  Local device:         qedr0
>  Local port:           1
>  CPCs attempted:       rdmacm
> --------------------------------------------------------------------------
> qelr_alloc_context: Failed to allocate context for device.
> qelr_alloc_context: Failed to allocate context for device.
>
> ...
>
>
> I've tried several things such as:
>
>
> 1) upgrade the 3.10 kernel's qed* drivers to the latest stable version
> 8.42.9
>
> 2) upgrade the CentOS kernel from 3.10 to 5.3 via elrepo
>
> 3) install the latest OFED-4.17-1.tgz stack
>
>
> but the error messages never go away ad do remain always the same.
>
>
>
> Any advice is highly appreciated.
>
>
>


--
Jeff Squyres
jsquy...@cisco.com

Reply via email to