Re: [Cluster-devel] [PATCH 11/30] iomap: add the new iomap_iter model

2021-08-12 Thread Darrick J. Wong
On Thu, Aug 12, 2021 at 08:49:14AM +0200, Christoph Hellwig wrote:
> On Wed, Aug 11, 2021 at 12:17:08PM -0700, Darrick J. Wong wrote:
> > > iter.c is also my preference, but in the end I don't care too much.
> > 
> > Ok.  My plan for this is to change this patch to add the new iter code
> > to apply.c, and change patch 24 to remove iomap_apply.  I'll add a patch
> > on the end to rename apply.c to iter.c, which will avoid breaking the
> > history.
> 
> What history?  There is no shared code, so no shared history and.

The history of the gluecode that enables us to walk a bunch of extent
mappings.  In the beginning it was the _apply function, but now in our
spectre-weary world, you've switched it to a direct loop to reduce the
number of indirect calls in the hot path by 30-50%.

As you correctly point out, there's no /code/ shared by the two
implementations, but Dave and I would like to preserve the continuity
from one to the next.

> > I'll send the updated patches as replies to this series to avoid
> > spamming the list, since I also have a patchset of bugfixes to send out
> > and don't want to overwhelm everyone.
> 
> Just as a clear statement:  I think this dance is obsfucation and doesn't
> help in any way.  But if that's what it takes..

I /would/ appreciate it if you'd rvb (or at least ack) patch 31 so I can
get the 5.15 iomap changes finalized next week.  Pretty please? :)

--D



Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

2021-08-12 Thread David Teigland
On Thu, Aug 12, 2021 at 01:44:53PM +0800, Gang He wrote:
> In fact, I can reproduce this problem stably.
> I want to know if this error happen is by our expectation? since there is
> not any extreme pressure test.
> Second, how should we handle these error cases? call dlm_lock function
> again? maybe the function will fails again, that will lead to kernel
> soft-lockup after multiple re-tries.

What's probably happening is that ocfs2 calls dlm_unlock(CANCEL) to cancel
an in-progress dlm_lock() request.  Before the cancel completes (or the
original request completes), ocfs2 calls dlm_lock() again on the same
resource.  This dlm_lock() returns -EBUSY because the previous request has
not completed, either normally or by cancellation.  This is expected.

A couple options to try: wait for the original request to complete
(normally or by cancellation) before calling dlm_lock() again, or retry
dlm_lock() on -EBUSY.

Dave



Re: [Cluster-devel] Why does dlm_lock function fails when downconvert a dlm lock?

2021-08-12 Thread Gang He

Hi Alexander,


On 2021/8/12 4:35, Alexander Aring wrote:

Hi,

On Wed, Aug 11, 2021 at 6:41 AM Gang He  wrote:


Hello List,

I am using kernel 5.13.4 (some old version kernels have the same problem).
When node A acquired a dlm (EX) lock, node B tried to get the dlm lock, node A 
got a BAST message,
then node A downcoverted the dlm lock to NL, dlm_lock function failed with the 
error -16.
The function failure did not always happen, but in some case, I could encounter 
this failure.
Why does dlm_lock function fails when downconvert a dlm lock? there are any 
documents for describe these error cases?
If the code ignores dlm_lock return error from node A, node B will not get the 
dlm lock permanently.
How should we handle such situation? call dlm_lock function to downconvert the 
dlm lock again?


What is your dlm user? Is it kernel (e.g. gfs2/ocfs2/md) or user (libdlm)?

ocfs2 file system.



I believe you are running into case [0]. Can you provide the
corresponding log_debug() message? It's necessary to insert
"log_debug=1" in your dlm.conf and it will be reported on KERN_DEBUG
in your kernel log then.
[Thu Aug 12 12:04:55 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
addwait 10 cur 2 overlap 4 count 2 f 10
[Thu Aug 12 12:05:00 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
addwait 10 cur 2 overlap 4 count 2 f 10
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
validate_lock_args -16 10 10 10c 2 0 M046e02
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_downconvert_lock:3674 ERROR: DLM error 
-16 while calling ocfs2_dlm_lock on resource M046e02
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_unblock_lock:3918 ERROR: status = -16
[Thu Aug 12 12:05:05 2021] 
(ocfs2dc-ED6296E,1602,1):ocfs2_process_blocked_lock:4317 ERROR: status = -16
[Thu Aug 12 12:05:05 2021] dlm: ED6296E929054DFF87853DD3610D838F: 
remwait 10 cancel_reply overlap


The whole kernel log for this node is here:
https://pastebin.com/FBn8Uwsu
The other two node kernel log:
https://pastebin.com/XxrZw6ds
https://pastebin.com/2Jw1ZqVb

In fact, I can reproduce this problem stably.
I want to know if this error happen is by our expectation? since there 
is not any extreme pressure test.
Second, how should we handle these error cases? call dlm_lock function 
again? maybe the function will fails again, that will lead to kernel 
soft-lockup after multiple re-tries.


Thanks
Gang



Thanks.

- Alex

[0] https://elixir.bootlin.com/linux/v5.14-rc5/source/fs/dlm/lock.c#L2886