Hi,

Recently, we frequently see OSTs are randomly dropped by some client nodes.

We have 4 Lustre filesystems, total 126 OSTs. All clients are running 2.15.3 
client on CentOS 7.
Servers are CentOS 7 with Lustre 2.12.8 (3 FS') and 2.15.3 on Alma 8.8. 
Failures can happen
from both versions of servers. LNET is using OPA interface.

One example of the failure is like

# lctl dl | grep ' IN '
126 IN osc cedar_sc-OST000a-osc-ffff980c76944800 
52e66575-6443-4be9-a7ce-348b526a0836 4

In syslog, we see

Oct  4 23:24:30 cedar5 kernel: LustreError: 11-0: 
cedar_sc-OST000a-osc-ffff980c76944800: operation ldlm_enqueue to node 
172.19.128.33@o2ib failed: rc = -107
Oct  4 23:24:30 cedar5 kernel: Lustre: cedar_sc-OST000a-osc-ffff980c76944800: 
Connection to cedar_sc-OST000a (at 172.19.128.33@o2ib) was lost; in progress 
operations using this service will wait for recovery to complete
Oct  4 23:24:30 cedar5 kernel: LustreError: 
5195:0:(osc_request.c:1037:osc_init_grant()) 
cedar_sc-OST000a-osc-ffff980c76944800: granted 3407872 but already consumed 
519700480
Oct  4 23:24:30 cedar5 kernel: LustreError: 167-0: 
cedar_sc-OST000a-osc-ffff980c76944800: This client was evicted by 
cedar_sc-OST000a; in progress operations using this service will fail.
Oct  4 23:24:31 cedar5 kernel: LustreError: 
62880:0:(ldlm_resource.c:1126:ldlm_resource_complain()) 
cedar_sc-OST000a-osc-ffff980c76944800: namespace resource 
[0x73fbbe2:0x0:0x0].0x0 (ffff97fe127e3080) refcount nonzero (1) after lock 
cleanup; forcing cleanup.
Oct  4 23:24:31 cedar5 kernel: LustreError: 
5218:0:(osc_request.c:711:osc_announce_cached()) 
cedar_sc-OST000a-osc-ffff980c76944800: dirty 131074 > system dirty_max 131072
Oct  4 23:24:36 cedar5 kernel: LustreError: 
5209:0:(osc_request.c:711:osc_announce_cached()) 
cedar_sc-OST000a-osc-ffff980c76944800: dirty 131074 > system dirty_max 131072
Oct  4 23:24:47 cedar5 kernel: LustreError: 
5220:0:(osc_request.c:711:osc_announce_cached()) 
cedar_sc-OST000a-osc-ffff980c76944800: dirty 131072 > system dirty_max 131072
Oct  4 23:25:36 cedar5 kernel: LustreError: 
5242:0:(osc_request.c:711:osc_announce_cached()) 
cedar_sc-OST000a-osc-ffff980c76944800: dirty 131074 > system dirty_max 131072
....

This one in particular is 2.15.3 server. Once this happen, it appears the only 
way is to reboot the
client and then the issue goes away.

Any ideas where we should check?

Thank you very much.

Lixin.



_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to