Hi Alexander,
On 2021/8/12 4:35, Alexander Aring wrote:
Hi,
On Wed, Aug 11, 2021 at 6:41 AM Gang He wrote:
Hello List,
I am using kernel 5.13.4 (some old version kernels have the same problem).
When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A
got a BAST message,
then node A downcoverted the dlm lock to NL, dlm_lock function failed with the
error -16.
The function failure did not always happen, but in some case, I could encounter
this failure.
Why does dlm_lock function fails when downconvert a dlm lock? there are any
documents for describe these error cases?
If the code ignores dlm_lock return error from node A, node B will not get the
dlm lock permanently.
How should we handle such situation? call dlm_lock function to downconvert the
dlm lock again?
What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)?
ocfs2 file system.
I believe you are running into case [0]. Can you provide the
corresponding log_debug() message? It's necessary to insert
"log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG
in your kernel log then.
[Thu Aug 12 12:04:55 2021] dlm: ED6296E929054DFF87853DD3610D838F:
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F:
addwait 10 cur 2 overlap 4 count 2 f 10
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F:
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F:
addwait 10 cur 2 overlap 4 count 2 f 10
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F:
validate_lock_args -16 10 10 10c 2 0 M046e02
[Thu Aug 12 12:05:05 2021]
(ocfs2dc-ED6296E,1602,1):ocfs2_downconvert_lock:3674 ERROR: DLM error
-16 while calling ocfs2_dlm_lock on resource M046e02
[Thu Aug 12 12:05:05 2021]
(ocfs2dc-ED6296E,1602,1):ocfs2_unblock_lock:3918 ERROR: status = -16
[Thu Aug 12 12:05:05 2021]
(ocfs2dc-ED6296E,1602,1):ocfs2_process_blocked_lock:4317 ERROR: status = -16
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F:
remwait 10 cancel_reply overlap
The whole kernel log for this node is here:
https://pastebin.com/FBn8Uwsu
The other two node kernel log:
https://pastebin.com/XxrZw6ds
https://pastebin.com/2Jw1ZqVb
In fact, I can reproduce this problem stably.
I want to know if this error happen is by our expectation? since there
is not any extreme pressure test.
Second, how should we handle these error cases? call dlm_lock function
again? maybe the function will fails again, that will lead to kernel
soft-lockup after multiple re-tries.
Thanks
Gang
Thanks.
- Alex
[0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886