In addition to what Olaf has said
 
ESS upgrades include mellanox modules upgrades in the ESS nodes. In fact, on those noes you should do not update those solo (unless support says so in your PMR), so if that's been the recommendation, I suggest you look at it.
 
Changelog on ESS 4.0.4 (no idea what ESS level you are running)
 
  c) Support of MLNX_OFED_LINUX-3.2-2.0.0.1
     - Updated from MLNX_OFED_LINUX-3.1-1.0.6.1 (ESS 4.0, 4.0.1, 4.0.2)
     - Updated from MLNX_OFED_LINUX-3.1-1.0.0.2 (ESS 3.5.x)
     - Updated from MLNX_OFED_LINUX-2.4-1.0.2 (ESS 3.0.x)
     - Support for PCIe3 LP 2-port 100 Gb EDR InfiniBand adapter x16 (FC EC3E)
       - Requires System FW level FW840.20 (SV840_104)
     - No changes from ESS 4.0.3

--
Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations

Luis Bolinches
Lab Services
http://www-03.ibm.com/systems/services/labservices/

IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland
Phone: +358 503112585

"If you continually give you will continually have." Anonymous
 
 
----- Original message -----
From: "Olaf Weiser" <olaf.wei...@de.ibm.com>
Sent by: gpfsug-discuss-boun...@spectrumscale.org
To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
Cc:
Subject: Re: [gpfsug-discuss] nodes being ejected out of the cluster
Date: Wed, Jan 11, 2017 5:03 PM
 
most likely, there's smth wrong with your IB fabric ...
you say, you run ~ 700 nodes ? ...
Are you running with verbsRdmaSendenabled ? ,if so, please consider to disable  - and discuss this within the PMR
another issue, you may check is  - Are you running the IPoIB in connected mode or datagram ... but as I said, please discuss this within the PMR .. there are to much dependencies to discuss this here ..


cheers

 
Mit freundlichen Grüßen / Kind regards

 
Olaf Weiser

EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform,
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland
IBM Allee 1
71139 Ehningen
Phone: +49-170-579-44-66
E-Mail: olaf.wei...@de.ibm.com
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940




From:        Damir Krstic <damir.krs...@gmail.com>
To:        gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
Date:        01/11/2017 03:39 PM
Subject:        [gpfsug-discuss] nodes being ejected out of the cluster
Sent by:        gpfsug-discuss-boun...@spectrumscale.org



We are running GPFS 4.2 on our cluster (around 700 compute nodes). Our storage (ESS GL6) is also running GPFS 4.2. Compute nodes and storage are connected via Infiniband (FDR14). At the time of implementation of ESS, we were instructed to enable RDMA in addition to IPoIB. Previously we only ran IPoIB on our GPFS3.5 cluster.

Every since the implementation (sometime back in July of 2016) we see a lot of compute nodes being ejected. What usually precedes the ejection are following messages:

Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA rdma send error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 vendor_err 135 
Jan 11 02:03:15 quser13 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error IBV_WC_RNR_RETRY_EXC_ERR index 2
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 vendor_err 135 
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error IBV_WC_WR_FLUSH_ERR index 1
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA rdma send error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 vendor_err 135 
Jan 11 02:03:26 quser13 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error IBV_WC_RNR_RETRY_EXC_ERR index 2
Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA rdma send error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 vendor_err 135 
Jan 11 02:06:38 quser11 mmfs: [E] VERBS RDMA closed connection to 172.41.2.5 (gssio2-fdr) on mlx4_0 port 1 fabnum 0 due to send error IBV_WC_WR_FLUSH_ERR index 400

Even our ESS IO server sometimes ends up being ejected (case in point - yesterday morning):

Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA rdma send error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 1 fabnum 0 vendor_err 135
Jan 10 11:23:42 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 1 fabnum 0 due to send error IBV_WC_RNR_RETRY_EXC_ERR index 3001
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 2 fabnum 0 vendor_err 135
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 (gssio1-fdr) on mlx5_1 port 2 fabnum 0 due to send error IBV_WC_RNR_RETRY_EXC_ERR index 2671
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA rdma send error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 2 fabnum 0 vendor_err 135
Jan 10 11:23:43 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 2 fabnum 0 due to send error IBV_WC_RNR_RETRY_EXC_ERR index 2495
Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA rdma send error IBV_WC_RNR_RETRY_EXC_ERR to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 1 fabnum 0 vendor_err 135
Jan 10 11:23:44 gssio2 mmfs: [E] VERBS RDMA closed connection to 172.41.2.1 (gssio1-fdr) on mlx5_0 port 1 fabnum 0 due to send error IBV_WC_RNR_RETRY_EXC_ERR index 3077
Jan 10 11:24:11 gssio2 mmfs: [N] Node 172.41.2.1 (gssio1-fdr) lease renewal is overdue. Pinging to check if it is alive

I've had multiple PMRs open for this issue, and I am told that our ESS needs code level upgrades in order to fix this issue. Looking at the errors, I think the issue is Infiniband related, and I am wondering if anyone on this list has seen similar issues?

Thanks for your help in advance.

Damir_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org

http://gpfsug.org/mailman/listinfo/gpfsug-discuss

 
 

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
 

Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
Oy IBM Finland Ab
PL 265, 00101 Helsinki, Finland
Business ID, Y-tunnus: 0195876-3
Registered in Finland

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to