Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
On Mon, Aug 16, 2021 at 09:41:18AM -0500, David Teigland wrote: > On Fri, Aug 13, 2021 at 02:49:04PM +0800, Gang He wrote: > > Hi David, > > > > On 2021/8/13 1:45, David Teigland wrote: > > > On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote: > > > > In fact, I can reproduce this problem stably. > > > > I want to know if this error happen is by our expectation? since there > > > > is > > > > not any extreme pressure test. > > > > Second, how should we handle these error cases? call dlm_lock function > > > > again? maybe the function will fails again, that will lead to kernel > > > > soft-lockup after multiple re-tries. > > > > > > What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel > > > an in-progress dlm_lock() request. Before the cancel completes (or the > > > original request completes), ocfs2 calls dlm_lock() again on the same > > > resource. This dlm_lock() returns -EBUSY because the previous request has > > > not completed, either normally or by cancellation. This is expected. > > These dlm_lock and dlm_unlock are invoked in the same node, or the different > > nodes? > > different Sorry, same node
Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
On Fri, Aug 13, 2021 at 02:49:04PM +0800, Gang He wrote: > Hi David, > > On 2021/8/13 1:45, David Teigland wrote: > > On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote: > > > In fact, I can reproduce this problem stably. > > > I want to know if this error happen is by our expectation? since there is > > > not any extreme pressure test. > > > Second, how should we handle these error cases? call dlm_lock function > > > again? maybe the function will fails again, that will lead to kernel > > > soft-lockup after multiple re-tries. > > > > What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel > > an in-progress dlm_lock() request. Before the cancel completes (or the > > original request completes), ocfs2 calls dlm_lock() again on the same > > resource. This dlm_lock() returns -EBUSY because the previous request has > > not completed, either normally or by cancellation. This is expected. > These dlm_lock and dlm_unlock are invoked in the same node, or the different > nodes? different > > A couple options to try: wait for the original request to complete > > (normally or by cancellation) before calling dlm_lock() again, or retry > > dlm_lock() on -EBUSY. > If I retry dlm_lock() repeatedly, I just wonder if this will lead to kernel > soft lockup or waste lots of CPU. I'm not aware of other code doing this, so I can't tell you with certainty. It would depend largely on the implementation in the caller. > If dlm_lock() function returns -EAGAIN, how should we handle this case? > retry it repeatedly? Again, this is a question more about the implementation of the calling code and what it wants to do. EAGAIN is specifically related to the DLM_LKF_NOQUEUE flag.
Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
Hi David, On 2021/8/13 1:45, David Teigland wrote: On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote: In fact, I can reproduce this problem stably. I want to know if this error happen is by our expectation? since there is not any extreme pressure test. Second, how should we handle these error cases? call dlm_lock function again? maybe the function will fails again, that will lead to kernel soft-lockup after multiple re-tries. What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel an in-progress dlm_lock() request. Before the cancel completes (or the original request completes), ocfs2 calls dlm_lock() again on the same resource. This dlm_lock() returns -EBUSY because the previous request has not completed, either normally or by cancellation. This is expected. These dlm_lock and dlm_unlock are invoked in the same node, or the different nodes? A couple options to try: wait for the original request to complete (normally or by cancellation) before calling dlm_lock() again, or retry dlm_lock() on -EBUSY. If I retry dlm_lock() repeatedly, I just wonder if this will lead to kernel soft lockup or waste lots of CPU. If dlm_lock() function returns -EAGAIN, how should we handle this case? retry it repeatedly? Thanks Gang Dave
Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote: > In fact, I can reproduce this problem stably. > I want to know if this error happen is by our expectation? since there is > not any extreme pressure test. > Second, how should we handle these error cases? call dlm_lock function > again? maybe the function will fails again, that will lead to kernel > soft-lockup after multiple re-tries. What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel an in-progress dlm_lock() request. Before the cancel completes (or the original request completes), ocfs2 calls dlm_lock() again on the same resource. This dlm_lock() returns -EBUSY because the previous request has not completed, either normally or by cancellation. This is expected. A couple options to try: wait for the original request to complete (normally or by cancellation) before calling dlm_lock() again, or retry dlm_lock() on -EBUSY. Dave
Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
Hi Alexander, On 2021/8/12 4:35, Alexander Aring wrote: Hi, On Wed, Aug 11, 2021 at 6:41 AM Gang He wrote: Hello List, I am using kernel 5.13.4 (some old version kernels have the same problem). When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A got a BAST message, then node A downcoverted the dlm lock to NL, dlm_lock function failed with the error -16. The function failure did not always happen, but in some case, I could encounter this failure. Why does dlm_lock function fails when downconvert a dlm lock? there are any documents for describe these error cases? If the code ignores dlm_lock return error from node A, node B will not get the dlm lock permanently. How should we handle such situation? call dlm_lock function to downconvert the dlm lock again? What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)? ocfs2 file system. I believe you are running into case [0]. Can you provide the corresponding log_debug() message? It's necessary to insert "log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG in your kernel log then. [Thu Aug 12 12:04:55 2021] dlm: ED6296E929054DFF87853DD3610D838F: remwait 10 cancel_reply overlap [Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: addwait 10 cur 2 overlap 4 count 2 f 10 [Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: remwait 10 cancel_reply overlap [Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: addwait 10 cur 2 overlap 4 count 2 f 10 [Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: validate_lock_args -16 10 10 10c 2 0 M046e02 [Thu Aug 12 12:05:05 2021] (ocfs2dc-ED6296E,1602,1):ocfs2_downconvert_lock:3674 ERROR: DLM error -16 while calling ocfs2_dlm_lock on resource M046e02 [Thu Aug 12 12:05:05 2021] (ocfs2dc-ED6296E,1602,1):ocfs2_unblock_lock:3918 ERROR: status = -16 [Thu Aug 12 12:05:05 2021] (ocfs2dc-ED6296E,1602,1):ocfs2_process_blocked_lock:4317 ERROR: status = -16 [Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: remwait 10 cancel_reply overlap The whole kernel log for this node is here: https://pastebin.com/FBn8Uwsu The other two node kernel log: https://pastebin.com/XxrZw6ds https://pastebin.com/2Jw1ZqVb In fact, I can reproduce this problem stably. I want to know if this error happen is by our expectation? since there is not any extreme pressure test. Second, how should we handle these error cases? call dlm_lock function again? maybe the function will fails again, that will lead to kernel soft-lockup after multiple re-tries. Thanks Gang Thanks. - Alex [0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886
Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
Hi, On Wed, Aug 11, 2021 at 6:41 AM Gang He wrote: > > Hello List, > > I am using kernel 5.13.4 (some old version kernels have the same problem). > When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node > A got a BAST message, > then node A downcoverted the dlm lock to NL, dlm_lock function failed with > the error -16. > The function failure did not always happen, but in some case, I could > encounter this failure. > Why does dlm_lock function fails when downconvert a dlm lock? there are any > documents for describe these error cases? > If the code ignores dlm_lock return error from node A, node B will not get > the dlm lock permanently. > How should we handle such situation? call dlm_lock function to downconvert > the dlm lock again? What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)? I believe you are running into case [0]. Can you provide the corresponding log_debug() message? It's necessary to insert "log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG in your kernel log then. Thanks. - Alex [0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886
[Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?
Hello List, I am using kernel 5.13.4 (some old version kernels have the same problem). When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A got a BAST message, then node A downcoverted the dlm lock to NL, dlm_lock function failed with the error -16. The function failure did not always happen, but in some case, I could encounter this failure. Why does dlm_lock function fails when downconvert a dlm lock? there are any documents for describe these error cases? If the code ignores dlm_lock return error from node A, node B will not get the dlm lock permanently. How should we handle such situation? call dlm_lock function to downconvert the dlm lock again? Thanks Gang