On Fri, Aug 13, 2021 at 02:49:04PM +0800, Gang He wrote: > Hi David, > > On 2021/8/13 1:45, David Teigland wrote: > > On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote: > > > In fact, I can reproduce this problem stably. > > > I want to know if this error happen is by our expectation? since there is > > > not any extreme pressure test. > > > Second, how should we handle these error cases? call dlm_lock function > > > again? maybe the function will fails again, that will lead to kernel > > > soft-lockup after multiple re-tries. > > > > What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel > > an in-progress dlm_lock() request. Before the cancel completes (or the > > original request completes), ocfs2 calls dlm_lock() again on the same > > resource. This dlm_lock() returns -EBUSY because the previous request has > > not completed, either normally or by cancellation. This is expected. > These dlm_lock and dlm_unlock are invoked in the same node, or the different > nodes?
different > > A couple options to try: wait for the original request to complete > > (normally or by cancellation) before calling dlm_lock() again, or retry > > dlm_lock() on -EBUSY. > If I retry dlm_lock() repeatedly, I just wonder if this will lead to kernel > soft lockup or waste lots of CPU. I'm not aware of other code doing this, so I can't tell you with certainty. It would depend largely on the implementation in the caller. > If dlm_lock() function returns -EAGAIN, how should we handle this case? > retry it repeatedly? Again, this is a question more about the implementation of the calling code and what it wants to do. EAGAIN is specifically related to the DLM_LKF_NOQUEUE flag.