Hi all,

(This is on 2.15.4 with very limited modifications, none to speak of in ldlm or 
similar)

Very rarely, when attempting to perform an lctl barrier_freeze, we run into a 
situation where it fails with EINVAL.  At that point all future lctl barrier 
operations (including rescan) return EINVAL, except barrier_status, which 
simply shows failed.  This remains stuck that way until we reboot the MDS or we 
reach heat death of the universe.  We have only tested the former solution, and 
that does work, though it's extremely heavyweight.

Since we've been able to repro this on a test system (so we're leaving it in 
that state), we've been able to collect traces, and we see these two that are 
quite interesting:

00010000:00010000:7.0:1713898822.714580:0:2921:0:(ldlm_lockd.c:2382:ldlm_callback_handler())
 callback on lock 0x831100496f36c2f4 - lock disappeared
00010000:00010000:4.0:1713898822.710567:0:2676:0:(ldlm_lockd.c:2286:ldlm_callback_errmsg())
 @@@ Operate with invalid parameter, NID=12345-0@lo lock=0x831100496f36c2f4: rc 
= 0  req@000000002ae045d9 x1783750461943232/t0(0) 
o106->8afc7dcc-fc22-4bb8-8ca1-b5df01779cf4@0@lo:64/0 lens 400/224 e 0 to 0 dl 
1713899139 ref 1 fl Interpret:/0/0 rc -22/-22 job:''

This seems to correlate with this code:
2390         /*
2391          * Force a known safe race, send a cancel to the server for a lock
2392          * which the server has already started a blocking callback on.
2393          */
2394         if (OBD_FAIL_CHECK(OBD_FAIL_LDLM_CANCEL_BL_CB_RACE) &&
2395             lustre_msg_get_opc(req->rq_reqmsg) == LDLM_BL_CALLBACK) {
2396                 rc = ldlm_cli_cancel(&dlm_req->lock_handle[0], 0);
2397                 if (rc < 0)
2398                         CERROR("ldlm_cli_cancel: %d\n", rc);
2399         }
2400
2401         lock = ldlm_handle2lock_long(&dlm_req->lock_handle[0], 0);
2402         if (!lock) {
2403                 CDEBUG(D_DLMTRACE,
2404                        "callback on lock %#llx - lock disappeared\n",
2405                        dlm_req->lock_handle[0].cookie);
2406                 rc = ldlm_callback_reply(req, -EINVAL);
2407                 ldlm_callback_errmsg(req, "Operate with invalid 
parameter", rc,
2408                                      &dlm_req->lock_handle[0]);
2409                 RETURN(0);
2410         }

The weird thing is that this lock never expires.  0x831100496f36c2f4 is here to 
stay.  Attempting to clear it by setting clear on the lru_size setting does 
nothing.

Practical questions:
1. I'm assuming NID=12345-0@lo is basically equivalent to "localhost" for 
Lustre, but if it's not, let me know.  I see the first part of this is defined 
as LNET_PID_LUSTRE, though I'm not 100% sure what PID here stands for (I doubt 
process ID?)
2. Is there any tool I can't find to instruct Lustre to drop a lock by ID?  
That'd be handy, though I realize I'm asking for a "turn off airbags" button.
3. One of my engineers made the following comment in case it is helpful:
"It appears that the "lock disappeared" lock exists in the dump_namespaces 
output as the remote end of another lock but nowhere as a lock itself. It's 
also interesting that it seems like the same resource appears twice with 
different resource IDs but the same 3 part ID that looks like a FID:
00010000:00010000:10.0:1713987044.511342:0:1897178:0:(ldlm_resource.c:1783:ldlm_resource_dump())
 --- Resource: [0x736665727473756c:0x5:0x0].0x0 (00000000c3f4ce61) refcount = 3
00010000:00010000:10.0:1713987044.511343:0:1897178:0:(ldlm_resource.c:1787:ldlm_resource_dump())
 Granted locks (in reverse order):
00010000:00010000:10.0:1713987044.511343:0:1897178:0:(ldlm_resource.c:1790:ldlm_resource_dump())
 ### ### ns: MGS lock: 00000000e2096801/0x831100496f36eea6 lrc: 2/0,0 mode: 
CR/CR res: [0x736665727473756c:0x5:0x0].0x0 rrc: 4 type: PLN flags: 
0x40000000000000 nid: 0@lo remote: 0x831100496f36ee9f expref: 14 pid: 2699 
timeout: 0 lvb_type: 0
00010000:00000040:10.0:1713987044.511344:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref())
 putref res: 00000000c3f4ce61 count: 3
00010000:00010000:10.0:1713987044.511344:0:1897178:0:(ldlm_resource.c:1790:ldlm_resource_dump())
 ### ### ns: MGS lock: 000000001007e670/0x831100496f36c2fb lrc: 2/0,0 mode: 
CR/CR res: [0x736665727473756c:0x5:0x0].0x0 rrc: 4 type: PLN flags: 
0x40000000000000 nid: 0@lo remote: 0x831100496f36c2f4 expref: 14 pid: 2700 
timeout: 0 lvb_type: 0
00010000:00000040:10.0:1713987044.511345:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref())
 putref res: 00000000c3f4ce61 count: 3
00010000:00000040:10.0:1713987044.511346:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref())
 putref res: 00000000c3f4ce61 count: 2
00010000:00000040:10.0:1713987044.511346:0:1897178:0:(ldlm_resource.c:1566:ldlm_resource_getref())
 getref res: 000000001ab54d34 count: 35

00010000:00010000:10.0:1713987044.512238:0:1897178:0:(ldlm_resource.c:1783:ldlm_resource_dump())
 --- Resource: [0x736665727473756c:0x5:0x0].0x0 (0000000019b90cb9) refcount = 2
00010000:00010000:10.0:1713987044.512239:0:1897178:0:(ldlm_resource.c:1787:ldlm_resource_dump())
 Granted locks (in reverse order):
00010000:00010000:10.0:1713987044.512239:0:1897178:0:(ldlm_resource.c:1790:ldlm_resource_dump())
 ### ### ns: MGC10.0.1.6@tcp lock: 00000000b9bf701e/0x831100496f36ee9f lrc: 
1/0,0 mode: CR/CR res: [0x736665727473756c:0x5:0x0].0x0 rrc: 3 type: PLN flags: 
0x1000000000000 nid: local remote: 0x831100496f36eea6 expref: -99 pid: 2702 
timeout: 0 lvb_type: 0
00010000:00000040:10.0:1713987044.512240:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref())
 putref res: 0000000019b90cb9 count: 2
00010000:00000040:10.0:1713987044.512240:0:1897178:0:(ldlm_resource.c:1600:ldlm_resource_putref())
 putref res: 0000000019b90cb9 count: 1
00010000:00000040:10.0:1713987044.512241:0:1897178:0:(ldlm_resource.c:1566:ldlm_resource_getref())
 getref res: 00000000d6eb813a count: 2"

Finally, note that I found this bug upstream:
https://jira.whamcloud.com/browse/LU-17412
But this appears to be triggered by:
https://review.whamcloud.com/c/fs/lustre-release/+/52498, which only lands in 
2.16.0, not in our versions (both of these hits were on 2.15.1 -- we're onto 
2.15.4 but I see no changes to ldlm that suggest this would be fixed between 
those two).  This commit feels like it would be helpful to our cause here 
(since it handles EINVAL), but it ironically appears to be inversely correlated 
with helping this pathology in the aforementioned bug.

Any recommendations on further data I can gather?  I'll open a bug this week, 
but I thought I'd first ask if somebody has good ideas or can mentally pattern 
match this to some other bug I've overlooked that's been fixed in a later 
release.

Thanks as always!

ellis
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to