Hi, Recently, we frequently see OSTs are randomly dropped by some client nodes.
We have 4 Lustre filesystems, total 126 OSTs. All clients are running 2.15.3 client on CentOS 7. Servers are CentOS 7 with Lustre 2.12.8 (3 FS') and 2.15.3 on Alma 8.8. Failures can happen from both versions of servers. LNET is using OPA interface. One example of the failure is like # lctl dl | grep ' IN ' 126 IN osc cedar_sc-OST000a-osc-ffff980c76944800 52e66575-6443-4be9-a7ce-348b526a0836 4 In syslog, we see Oct 4 23:24:30 cedar5 kernel: LustreError: 11-0: cedar_sc-OST000a-osc-ffff980c76944800: operation ldlm_enqueue to node 172.19.128.33@o2ib failed: rc = -107 Oct 4 23:24:30 cedar5 kernel: Lustre: cedar_sc-OST000a-osc-ffff980c76944800: Connection to cedar_sc-OST000a (at 172.19.128.33@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 4 23:24:30 cedar5 kernel: LustreError: 5195:0:(osc_request.c:1037:osc_init_grant()) cedar_sc-OST000a-osc-ffff980c76944800: granted 3407872 but already consumed 519700480 Oct 4 23:24:30 cedar5 kernel: LustreError: 167-0: cedar_sc-OST000a-osc-ffff980c76944800: This client was evicted by cedar_sc-OST000a; in progress operations using this service will fail. Oct 4 23:24:31 cedar5 kernel: LustreError: 62880:0:(ldlm_resource.c:1126:ldlm_resource_complain()) cedar_sc-OST000a-osc-ffff980c76944800: namespace resource [0x73fbbe2:0x0:0x0].0x0 (ffff97fe127e3080) refcount nonzero (1) after lock cleanup; forcing cleanup. Oct 4 23:24:31 cedar5 kernel: LustreError: 5218:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-ffff980c76944800: dirty 131074 > system dirty_max 131072 Oct 4 23:24:36 cedar5 kernel: LustreError: 5209:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-ffff980c76944800: dirty 131074 > system dirty_max 131072 Oct 4 23:24:47 cedar5 kernel: LustreError: 5220:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-ffff980c76944800: dirty 131072 > system dirty_max 131072 Oct 4 23:25:36 cedar5 kernel: LustreError: 5242:0:(osc_request.c:711:osc_announce_cached()) cedar_sc-OST000a-osc-ffff980c76944800: dirty 131074 > system dirty_max 131072 .... This one in particular is 2.15.3 server. Once this happen, it appears the only way is to reboot the client and then the issue goes away. Any ideas where we should check? Thank you very much. Lixin. _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org