If you have multiple switches, this could be a faulty ISL (or to your NSDs). So 
I would look for SYMBOL errors on the ports, high churning numbers indicates a 
cable fault.

Simon


From: <[email protected]> on behalf of 
"[email protected]" <[email protected]>
Reply to: "[email protected]" <[email protected]>
Date: Friday, 9 July 2021 at 12:36
To: "[email protected]" <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR

smells like a network problem ..

IBV_WC_RETRY_EXC_ERR  comes from OFED and clearly says that the data didn't get 
through successfully,

further help .. check
ibstat
iblinkinfo
ibdiagnet
and the sminfo .. (should be the same on all members)




----- Ursprüngliche Nachricht -----
Von: "Iban Cabrillo" <[email protected]>
Gesendet von: [email protected]
An: "gpfsug-discuss" <[email protected]>
CC:
Betreff: [EXTERNAL] [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR
Datum: Fr, 9. Jul 2021 13:29

Dear,
    Since a couple of hours we are seen lots off IB error at GPFS logs, on 
every IB node (gpfs version is 5.0.4-3):

  2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 
10.10.152.73 (node157) on mlx5_0 port 1 fabnum 0 index 251 cookie 648 RDMA 
write error IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.18 
(node102) on mlx5_0 port 1 fabnum 0 index 227 cookie 687 RDMA write error 
IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.17 
(node101) on mlx5_0 port 1 fabnum 0 index 298 cookie 693 RDMA write error 
IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.151.6 
(node6) on mlx5_0 port 1 fabnum 0 index 18 cookie 696 RDMA write error 
IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.601+0200: [E] VERBS RDMA closed connection to 10.10.152.46 
(node130) on mlx5_0 port 1 fabnum 0 index 254 cookie 680 RDMA write error 
IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.601+0200: [E] VERBS RDMA closed connection to 10.10.151.81 
(node81) on mlx5_0 port 1 fabnum 0 index 289 cookie 679 RDMA read error 
IBV_WC_RETRY_EXC_ERR

and ofcourse long waiters:

=== mmdiag: waiters ===
Waiting 34.8493 sec since 13:11:35, ignored, thread 2935 VerbsReconnectThread: 
delaying for 25.150686000 more seconds, reason: delaying for next reconnect 
attempt
Waiting 34.6249 sec since 13:11:35, ignored, thread 10198 VerbsReconnectThread: 
delaying for 25.375072000 more seconds, reason: delaying for next reconnect 
attempt
Waiting 27.0957 sec since 13:11:43, ignored, thread 10052 VerbsReconnectThread: 
delaying for 32.904264000 more seconds, reason: delaying for next reconnect 
attempt
Waiting 14.8909 sec since 13:11:55, monitored, thread 23135 NSDThread: for RDMA 
write completion fast on node 10.10.151.65 <c0n258>
Waiting 14.8891 sec since 13:11:55, monitored, thread 23109 NSDThread: for RDMA 
write completion fast on node 10.10.152.32 <c0n247>
Waiting 14.8865 sec since 13:11:55, monitored, thread 23302 NSDThread: for RDMA 
write completion fast on node 10.10.150.1 <c0n29>

[common]
verbsRdma enable
verbsPorts mlx4_0/1/0
[gpfs02,gpfs04,gpfs05,gpfs06,gpfs07,gpfs08]
verbsPorts mlx5_0/1/0
[gpfs01]
verbsPorts mlx5_1/1/0
[gpfs03]
verbsPorts mlx5_0/1/0 mlx5_1/1/0


[common]
verbsRdma enable
verbsPorts mlx4_0/1/0
[gpfs02,gpfs04,gpfs05,gpfs06,gpfs07,gpfs08,wngpu001,wngpu002,wngpu003,wngpu004,wngpu005]
verbsPorts mlx5_0/1/0
[gpfs01]
verbsPorts mlx5_1/1/0
[gpfs03]
verbsPorts mlx5_0/1/0 mlx5_1/1/0

Any advise is welcomed
regards, I


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to