Re: [Cluster-devel] [PATCH 07/11] vfs: add nowait parameter for file_accessed()
On Fri, Sep 08, 2023 at 01:29:55AM +0100, Pavel Begunkov wrote: > On 9/3/23 23:30, Dave Chinner wrote: > > On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote: > > > On 8/29/23 19:53, Matthew Wilcox wrote: > > > > On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote: > > > > > On 8/28/23 05:32, Matthew Wilcox wrote: > > > > > > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote: > > > > > > > From: Hao Xu > > > > > > > > > > > > > > Add a boolean parameter for file_accessed() to support nowait > > > > > > > semantics. > > > > > > > Currently it is true only with io_uring as its initial caller. > > > > > > > > > > > > So why do we need to do this as part of this series? Apparently it > > > > > > hasn't caused any problems for filemap_read(). > > > > > > > > > > > > > > > > We need this parameter to indicate if nowait semantics should be > > > > > enforced in > > > > > touch_atime(), There are locks and maybe IOs in it. > > > > > > > > That's not my point. We currently call file_accessed() and > > > > touch_atime() for nowait reads and nowait writes. You haven't done > > > > anything to fix those. > > > > > > > > I suspect you can trim this patchset down significantly by avoiding > > > > fixing the file_accessed() problem. And then come back with a later > > > > patchset that fixes it for all nowait i/o. Or do a separate prep series > > > > > > I'm ok to do that. > > > > > > > first that fixes it for the existing nowait users, and then a second > > > > series to do all the directory stuff. > > > > > > > > I'd do the first thing. Just ignore the problem. Directory atime > > > > updates cause I/O so rarely that you can afford to ignore it. Almost > > > > everyone uses relatime or nodiratime. > > > > > > Hi Matthew, > > > The previous discussion shows this does cause issues in real > > > producations: > > > https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a3051...@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write > > > > > > > Then separate it out into it's own patch set so we can have a > > discussion on the merits of requiring using noatime, relatime or > > lazytime for really latency sensitive IO applications. Changing code > > is not always the right solution... > > Separation sounds reasonable, but it can hardly be said that only > latency sensitive apps would care about >1s nowait/async submission > delays. Presumably, btrfs can improve on that, but it still looks > like it's perfectly legit for filesystems do heavy stuff in > timestamping like waiting for IO. Right? Yes, it is, no-one is denying that. And some filesystems are worse than others, but none of that means it has to be fixed so getdents can be converted to NOWAIT semantics. ie. this patchset is about the getdents NOWAIT machinery, and fiddling around with timestamps has much, much wider scope than just NOWAIT getdents machinery. We'll have this discussion about NOWAIT timestamp updates when a RFC is proposed to address the wider problem of how timestamp updates should behave in NOWAIT context. -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 02/11] xfs: add NOWAIT semantics for readdir
On Sun, Aug 27, 2023 at 09:28:26PM +0800, Hao Xu wrote: > From: Hao Xu > > Implement NOWAIT semantics for readdir. Return EAGAIN error to the > caller if it would block, like failing to get locks, or going to > do IO. > > Co-developed-by: Dave Chinner Not really. "Co-developed" implies equal development input between all the parties, which is not the case here - this patch is based on prototype I wrote, whilst you're doing the refining, testing and correctness work. In these cases with XFS code, we add a line in the commit message to say: "This is based on a patch originally written by Dave Chinner." > Signed-off-by: Dave Chinner > Signed-off-by: Hao Xu > [fixes deadlock issue, tweak code style] With a signoff chain like you already have. In the end you'll also get a RVB from me, which seems rather wrong to me if I've apparently been "co-developing" the code > @@ -156,7 +157,9 @@ xfs_dir2_block_getdents( > if (xfs_dir2_dataptr_to_db(geo, ctx->pos) > geo->datablk) > return 0; > > - error = xfs_dir3_block_read(args->trans, dp, ); > + if (ctx->flags & DIR_CONTEXT_F_NOWAIT) > + flags |= XFS_DABUF_NOWAIT; > + error = xfs_dir3_block_read(args->trans, dp, flags, ); > if (error) > return error; > Given we do this same check in both block and leaf formats to set XFS_DABUF_NOWAIT, and we do the DIR_CONTEXT_F_NOWAIT check in xfs_readdir() as well, we should probably do this check once at the higher level and pass flags down from there with XFS_DABUF_NOWAIT already set. > @@ -240,6 +243,7 @@ xfs_dir2_block_getdents( > STATIC int > xfs_dir2_leaf_readbuf( > struct xfs_da_args *args, > + struct dir_context *ctx, > size_t bufsize, > xfs_dir2_off_t *cur_off, > xfs_dablk_t *ra_blk, > @@ -258,10 +262,15 @@ xfs_dir2_leaf_readbuf( > struct xfs_iext_cursor icur; > int ra_want; > int error = 0; > - > - error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK); > - if (error) > - goto out; > + unsigned intflags = 0; > + > + if (ctx->flags & DIR_CONTEXT_F_NOWAIT) { > + flags |= XFS_DABUF_NOWAIT; > + } else { > + error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK); > + if (error) > + goto out; > + } Especially as, in hindsight, this doesn't make a whole lot of sense. If XFS_DABUF_NOWAIT is set, we keep going until xfs_ilock_data_map_shared_nowait() where we call xfs_need_iread_extents() to see if we need to read the extents in and abort at that point. So, really, we shouldn't get this far with nowait semantics if we haven't read the extents in yet - we're supposed to already have the inode locked here and so we should have already checked this condition before we bother locking the inode... i.e. all we should be doing here is this: if (!(flags & XFS_DABUF_NOWAIT)) { error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK); if (error) goto out; } And then we don't need to pass the VFS dir_context down into low level XFS functions unnecessarily. > > /* >* Look for mapped directory blocks at or above the current offset. > @@ -280,7 +289,7 @@ xfs_dir2_leaf_readbuf( > new_off = xfs_dir2_da_to_byte(geo, map.br_startoff); > if (new_off > *cur_off) > *cur_off = new_off; > - error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, 0, ); > + error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, flags, > ); > if (error) > goto out; > > @@ -360,6 +369,7 @@ xfs_dir2_leaf_getdents( > int byteoff;/* offset in current block */ > unsigned intoffset = 0; > int error = 0; /* error return value */ > + int written = 0; > > /* >* If the offset is at or past the largest allowed value, > @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents( > bp = NULL; > } > > - if (*lock_mode == 0) > - *lock_mode = xfs_ilock_data_map_shared(dp); > - error = xfs_dir2_leaf_readbuf(args, bufsize, , > - , ); > + if (*lock_mode == 0) { > + *lock_mode = > + xfs_ilock_data_map_shared_generic(dp, >
Re: [Cluster-devel] [PATCH 07/11] vfs: add nowait parameter for file_accessed()
On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote: > On 8/29/23 19:53, Matthew Wilcox wrote: > > On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote: > > > On 8/28/23 05:32, Matthew Wilcox wrote: > > > > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote: > > > > > From: Hao Xu > > > > > > > > > > Add a boolean parameter for file_accessed() to support nowait > > > > > semantics. > > > > > Currently it is true only with io_uring as its initial caller. > > > > > > > > So why do we need to do this as part of this series? Apparently it > > > > hasn't caused any problems for filemap_read(). > > > > > > > > > > We need this parameter to indicate if nowait semantics should be enforced > > > in > > > touch_atime(), There are locks and maybe IOs in it. > > > > That's not my point. We currently call file_accessed() and > > touch_atime() for nowait reads and nowait writes. You haven't done > > anything to fix those. > > > > I suspect you can trim this patchset down significantly by avoiding > > fixing the file_accessed() problem. And then come back with a later > > patchset that fixes it for all nowait i/o. Or do a separate prep series > > I'm ok to do that. > > > first that fixes it for the existing nowait users, and then a second > > series to do all the directory stuff. > > > > I'd do the first thing. Just ignore the problem. Directory atime > > updates cause I/O so rarely that you can afford to ignore it. Almost > > everyone uses relatime or nodiratime. > > Hi Matthew, > The previous discussion shows this does cause issues in real > producations: > https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a3051...@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write > Then separate it out into it's own patch set so we can have a discussion on the merits of requiring using noatime, relatime or lazytime for really latency sensitive IO applications. Changing code is not always the right solution... -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH RFC v5 00/29] io_uring getdents
On Fri, Aug 25, 2023 at 09:54:02PM +0800, Hao Xu wrote: > From: Hao Xu > > This series introduce getdents64 to io_uring, the code logic is similar > with the snychronized version's. It first try nowait issue, and offload > it to io-wq threads if the first try fails. > > Patch1 and Patch2 are some preparation > Patch3 supports nowait for xfs getdents code > Patch4-11 are vfs change, include adding helpers and trylock for locks > Patch12-29 supports nowait for involved xfs journal stuff > note, Patch24 and 27 are actually two questions, might be removed later. > an xfs test may come later. You need to drop all the XFS journal stuff. It's fundamentally broken as it stands, and we cannot support non-blocking transactional changes without first putting a massive investment in transaction and intent chain rollback to allow correctly undoing partially complete modifications. Regardless, non-blocking transactions are completely unnecessary for a non-blocking readdir implementation. readdir should only be touching atime, and with relatime it should only occur once every 24 hours per inode. If that's a problem, then we have noatime mount options. Hence I just don't see any point in worrying about having a timestamp update block occasionally... I also don't really don't see why you need to fiddle with xfs buffer cache semantics - it already has the functionality "nowait" buffer reads require (i.e. XBF_INCORE|XBF_TRYLOCK). However, the readahead IO that the xfs readdir code issues cannot use your defined NOWAIT semantics - it must be able to allocate memory and issue IO. Readahead already avoids blocking on memory allocation and blocking on IO via the XBF_READ_AHEAD flag. This sets __GFP_NORETRY for buffer allocation and REQ_RAHEAD for IO. Hence readahead only needs the existing XBF_TRYLOCK flag to be set to be compatible with the required NOWAIT semantics As for the NOIO memory allocation restrictions io_uring requires, that should be enforced at the io_uring layer before calling into the VFS using memalloc_noio_save/restore. At that point no memory allocation will trigger IO and none of the code running under NOWAIT conditions even needs to be aware that io_uring has a GFP_NOIO restriction on memory allocation Please go back to the simple "do non-blocking buffer IO" implementation we started with and don't try to solve every little blocking problem that might exist in the VFS and filesystems... -Dave -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 25/29] xfs: support nowait for xfs_buf_item_init()
On Fri, Aug 25, 2023 at 09:54:27PM +0800, Hao Xu wrote: > From: Hao Xu > > support nowait for xfs_buf_item_init() and error out -EAGAIN to > _xfs_trans_bjoin() when it would block. > > Signed-off-by: Hao Xu > --- > fs/xfs/xfs_buf_item.c | 9 +++-- > fs/xfs/xfs_buf_item.h | 2 +- > fs/xfs/xfs_buf_item_recover.c | 2 +- > fs/xfs/xfs_trans_buf.c| 16 +--- > 4 files changed, 22 insertions(+), 7 deletions(-) > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c > index 023d4e0385dd..b1e63137d65b 100644 > --- a/fs/xfs/xfs_buf_item.c > +++ b/fs/xfs/xfs_buf_item.c > @@ -827,7 +827,8 @@ xfs_buf_item_free_format( > int > xfs_buf_item_init( > struct xfs_buf *bp, > - struct xfs_mount *mp) > + struct xfs_mount *mp, > + bool nowait) > { > struct xfs_buf_log_item *bip = bp->b_log_item; > int chunks; > @@ -847,7 +848,11 @@ xfs_buf_item_init( > return 0; > } > > - bip = kmem_cache_zalloc(xfs_buf_item_cache, GFP_KERNEL | __GFP_NOFAIL); > + bip = kmem_cache_zalloc(xfs_buf_item_cache, > + GFP_KERNEL | (nowait ? 0 : __GFP_NOFAIL)); > + if (!bip) > + return -EAGAIN; > + > xfs_log_item_init(mp, >bli_item, XFS_LI_BUF, _buf_item_ops); > bip->bli_buf = bp; I see filesystem shutdowns > diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c > index 016371f58f26..a1e4f2e8629a 100644 > --- a/fs/xfs/xfs_trans_buf.c > +++ b/fs/xfs/xfs_trans_buf.c > @@ -57,13 +57,14 @@ xfs_trans_buf_item_match( > * If the buffer does not yet have a buf log item associated with it, > * then allocate one for it. Then add the buf item to the transaction. > */ > -STATIC void > +STATIC int > _xfs_trans_bjoin( > struct xfs_trans*tp, > struct xfs_buf *bp, > int reset_recur) > { > struct xfs_buf_log_item *bip; > + int ret; > > ASSERT(bp->b_transp == NULL); > > @@ -72,7 +73,11 @@ _xfs_trans_bjoin( >* it doesn't have one yet, then allocate one and initialize it. >* The checks to see if one is there are in xfs_buf_item_init(). >*/ > - xfs_buf_item_init(bp, tp->t_mountp); > + ret = xfs_buf_item_init(bp, tp->t_mountp, > + tp->t_flags & XFS_TRANS_NOWAIT); > + if (ret < 0) > + return ret; > + > bip = bp->b_log_item; > ASSERT(!(bip->bli_flags & XFS_BLI_STALE)); > ASSERT(!(bip->__bli_format.blf_flags & XFS_BLF_CANCEL)); > @@ -92,6 +97,7 @@ _xfs_trans_bjoin( > xfs_trans_add_item(tp, >bli_item); > bp->b_transp = tp; > > + return 0; > } > > void > @@ -309,7 +315,11 @@ xfs_trans_read_buf_map( > } > > if (tp) { > - _xfs_trans_bjoin(tp, bp, 1); > + error = _xfs_trans_bjoin(tp, bp, 1); > + if (error) { > + xfs_buf_relse(bp); > + return error; > + } > trace_xfs_trans_read_buf(bp->b_log_item); So what happens at the callers when we have a dirty transaction and joining a buffer fails with -EAGAIN? Apart from the fact this may well propagate -EAGAIN up to userspace, cancelling a dirty transaction at this point will result in a filesystem shutdown Indeed, this can happen in the "simple" timestamp update case that this "nowait" semantic is being aimed at. We log the inode in the timestamp update, which dirties the log item and registers a precommit operation to be run. We commit the transaction, which then runs xfs_inode_item_precommit() and that may need to attach the inode to the inode cluster buffer. This results in: xfs_inode_item_precommit xfs_imap_to_bp xfs_trans_read_buf_map _xfs_trans_bjoin xfs_buf_item_init(XFS_TRANS_NOWAIT) kmem_cache_zalloc(GFP_NOFS) gets -EAGAIN error propagates -EAGAIN fails due to -EAGAIN And now xfs_trans_commit() fails with a dirty transaction and the filesystem shuts down. IOWs, XFS_TRANS_NOWAIT as it stands is fundamentally broken. Once we dirty an item in a transaction, we *cannot* back out of the transaction. We *must block* in every place that could fail - locking, memory allocation and/or IO - until the transaction completes because we cannot undo the changes we've already made to the dirty items in the transaction It's even worse than that - once we have committed intents, the whole chain of intent processing must be run to completionr. Hence we can't tolerate backing out of that defered processing chain half way through because we might have to block. Until we can roll back partial dirty transactions and partially completed defered intent chains at any random point of completion, XFS_TRANS_NOWAIT will not work. -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 28/29] xfs: support nowait semantics for xc_ctx_lock in xlog_cil_commit()
On Fri, Aug 25, 2023 at 09:54:30PM +0800, Hao Xu wrote: > From: Hao Xu > > Apply trylock logic for xc_ctx_lock in xlog_cil_commit() in nowait > case and error out -EAGAIN for xlog_cil_commit(). Again, fundamentally broken. Any error from xlog_cil_commit() will result in a filesystem shutdown as we cannot back out from failure with dirty log items gracefully at this point. -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 26/29] xfs: return -EAGAIN when nowait meets sync in transaction commit
On Fri, Aug 25, 2023 at 09:54:28PM +0800, Hao Xu wrote: > From: Hao Xu > > if the log transaction is a sync one, let's fail the nowait try and > return -EAGAIN directly since sync transaction means blocked by IO. > > Signed-off-by: Hao Xu > --- > fs/xfs/xfs_trans.c | 14 +- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 7988b4c7f36e..f1f84a3dd456 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -968,12 +968,24 @@ __xfs_trans_commit( > xfs_csn_t commit_seq = 0; > int error = 0; > int sync = tp->t_flags & XFS_TRANS_SYNC; > + boolnowait = tp->t_flags & XFS_TRANS_NOWAIT; > + boolperm_log = tp->t_flags & XFS_TRANS_PERM_LOG_RES; > > trace_xfs_trans_commit(tp, _RET_IP_); > > + if (nowait && sync) { > + /* > + * Currently nowait is only from xfs_vn_update_time() > + * so perm_log is always false here, but let's make > + * code general. > + */ > + if (perm_log) > + xfs_defer_cancel(tp); > + goto out_unreserve; > + } This is fundamentally broken. We cannot about a transaction commit with dirty items at this point with shutting down the filesystem. This points to XFS_TRANS_NOWAIT being completely broken, too, because we don't call xfs_trans_set_sync() until just before we commit the transaction. At this point, it is -too late- for nowait+sync to be handled gracefully, and it will *always* go bad. IOWs, the whole transaction "nowait" semantics as designed and implemented is not a workable solution -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 24/29] xfs: support nowait for xfs_buf_read_map()
On Fri, Aug 25, 2023 at 09:54:26PM +0800, Hao Xu wrote: > From: Hao Xu > > This causes xfstests generic/232 hung in umount process, waiting for ail > push, so I comment it for now, need some hints from xfs folks. > Not a real patch. > > Signed-off-by: Hao Xu > --- > fs/xfs/xfs_buf.c | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index cdad80e1ae25..284962a9f31a 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c > @@ -828,6 +828,13 @@ xfs_buf_read_map( > trace_xfs_buf_read(bp, flags, _RET_IP_); > > if (!(bp->b_flags & XBF_DONE)) { > +// /* > +//* Let's bypass the _xfs_buf_read() for now > +//*/ > +// if (flags & XBF_NOWAIT) { > +// xfs_buf_relse(bp); > +// return -EAGAIN; > +// } This is *fundamentally broken*, and apart from anything else breaks readahead. IF we asked for a read, we cannot instantiate the buffer and then *not issue any IO on it* and release it. That leaves an uninitialised buffer in memory, and there's every chance that something then trips over it and bad things happen. A buffer like this *must* be errored out and marked stale so that the next access to it will then re-initialise the buffer state and trigger any preparatory work that needs to be done for the new operation. This comes back to my first comments that XBF_TRYLOCK cannot simpy be replaced with XBF_NOWAIT semantics... -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 02/29] xfs: rename XBF_TRYLOCK to XBF_NOWAIT
45,7 @@ struct xfs_buf; > > /* flags used only as arguments to access routines */ > #define XBF_INCORE(1u << 29)/* lookup only, return if found in cache */ > -#define XBF_TRYLOCK (1u << 30)/* lock requested, but do not wait */ > +#define XBF_NOWAIT(1u << 30)/* mem/lock requested, but do not wait */ That's now a really poor comment. It doesn't describe the semantics or constraints that NOWAIT might imply. -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v4 46/48] mm: shrinker: make memcg slab shrink lockless
ly(!shrinker || !shrinker_try_get(shrinker))) > { > + clear_bit(offset, unit->map); > + rcu_read_unlock(); > continue; > } > + rcu_read_unlock(); > > /* Call non-slab shrinkers even though kmem is disabled > */ > if (!memcg_kmem_online() && > @@ -523,15 +546,20 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > set_shrinker_bit(memcg, nid, > shrinker_id); > } > freed += ret; > - > - if (rwsem_is_contended(_rwsem)) { > - freed = freed ? : 1; > - goto unlock; > - } > + shrinker_put(shrinker); Ok, so why is this safe to call without holding the rcu read lock? The global shrinker has to hold the rcu_read_lock() whilst calling shrinker_put() to guarantee the validity of the list next pointer, but we don't hold off RCU here so what guarantees a racing global shrinker walk doesn't trip over this shrinker_put() call dropping the refcount to zero and freeing occuring in a different context... > + /* > + * We have already exited the read-side of rcu critical section > + * before calling do_shrink_slab(), the shrinker_info may be > + * released in expand_one_shrinker_info(), so reacquire the > + * shrinker_info. > + */ > + index++; > + goto again; With that, what makes the use of shrinker_info in xchg_nr_deferred_memcg() in do_shrink_slab() coherent and valid? -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote: > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index eb342994675a..f06225f18531 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -4,6 +4,8 @@ > > #include > #include > +#include > +#include > > #define SHRINKER_UNIT_BITS BITS_PER_LONG > > @@ -87,6 +89,10 @@ struct shrinker { > int seeks; /* seeks to recreate an obj */ > unsigned flags; > > + refcount_t refcount; > + struct completion done; > + struct rcu_head rcu; Documentation, please. What does the refcount protect, what does the completion provide, etc. > + > void *private_data; > > /* These are for internal use */ > @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, > const char *fmt, ...); > void shrinker_register(struct shrinker *shrinker); > void shrinker_free(struct shrinker *shrinker); > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > +{ > + return refcount_inc_not_zero(>refcount); > +} > + > +static inline void shrinker_put(struct shrinker *shrinker) > +{ > + if (refcount_dec_and_test(>refcount)) > + complete(>done); > +} > + > #ifdef CONFIG_SHRINKER_DEBUG > extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, > const char *fmt, ...); > diff --git a/mm/shrinker.c b/mm/shrinker.c > index 1911c06b8af5..d318f5621862 100644 > --- a/mm/shrinker.c > +++ b/mm/shrinker.c > @@ -2,6 +2,7 @@ > #include > #include > #include > +#include > #include > > #include "internal.h" > @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, > struct mem_cgroup *memcg, > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > - if (!down_read_trylock(_rwsem)) > - goto out; > - > - list_for_each_entry(shrinker, _list, list) { > + rcu_read_lock(); > + list_for_each_entry_rcu(shrinker, _list, list) { > struct shrink_control sc = { > .gfp_mask = gfp_mask, > .nid = nid, > .memcg = memcg, > }; > > + if (!shrinker_try_get(shrinker)) > + continue; > + > + /* > + * We can safely unlock the RCU lock here since we already > + * hold the refcount of the shrinker. > + */ > + rcu_read_unlock(); > + > ret = do_shrink_slab(, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > + > /* > - * Bail out if someone want to register a new shrinker to > - * prevent the registration from being stalled for long periods > - * by parallel ongoing shrinking. > + * This shrinker may be deleted from shrinker_list and freed > + * after the shrinker_put() below, but this shrinker is still > + * used for the next traversal. So it is necessary to hold the > + * RCU lock first to prevent this shrinker from being freed, > + * which also ensures that the next shrinker that is traversed > + * will not be freed (even if it is deleted from shrinker_list > + * at the same time). >*/ This needs to be moved to the head of the function, and document the whole list walk, get, put and completion parts of the algorithm that make it safe. There's more to this than "we hold a reference count", especially the tricky "we might see the shrinker before it is fully initialised" case . > void shrinker_free(struct shrinker *shrinker) > { > struct dentry *debugfs_entry = NULL; > @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker) > if (!shrinker) > return; > > + if (shrinker->flags & SHRINKER_REGISTERED) { > + shrinker_put(shrinker); > + wait_for_completion(>done); > + } Needs a comment explaining why we need to wait here... > + > down_write(_rwsem); > if (shrinker->flags & SHRINKER_REGISTERED) { > - list_del(>list); > + /* > + * Lookups on the shrinker are over and will fail in the future, > + * so we can now remove it from the lists and free it. > + */ rather than here after the wait has been done and provided the guarantee that no shrinker is running or will run again... -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v4 44/48] mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}
} > } > } > up_read(_rwsem); > @@ -407,7 +458,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > { > struct shrinker_info *info; > unsigned long ret, freed = 0; > - int i; > + int offset, index = 0; > > if (!mem_cgroup_online(memcg)) > return 0; > @@ -419,56 +470,63 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, > int nid, > if (unlikely(!info)) > goto unlock; > > - for_each_set_bit(i, info->map, info->map_nr_max) { > - struct shrink_control sc = { > - .gfp_mask = gfp_mask, > - .nid = nid, > - .memcg = memcg, > - }; > - struct shrinker *shrinker; > + for (; index < shriner_id_to_index(info->map_nr_max); index++) { > + struct shrinker_info_unit *unit; This adds another layer of indent to shrink_slab_memcg(). Please factor it first so that the code ends up being readable. Doing that first as a separate patch will also make the actual algorithm changes in this patch be much more obvious - this huge hunk of diff is pretty much impossible to review... -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote: > The shrinker_rwsem is a global read-write lock in shrinkers subsystem, > which protects most operations such as slab shrink, registration and > unregistration of shrinkers, etc. This can easily cause problems in the > following cases. > This commit uses the refcount+RCU method [5] proposed by Dave Chinner > to re-implement the lockless global slab shrink. The memcg slab shrink is > handled in the subsequent patch. > --- > include/linux/shrinker.h | 17 ++ > mm/shrinker.c| 70 +--- > 2 files changed, 68 insertions(+), 19 deletions(-) There's no documentation in the code explaining how the lockless shrinker algorithm works. It's left to the reader to work out how this all goes together > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h > index eb342994675a..f06225f18531 100644 > --- a/include/linux/shrinker.h > +++ b/include/linux/shrinker.h > @@ -4,6 +4,8 @@ > > #include > #include > +#include > +#include > > #define SHRINKER_UNIT_BITS BITS_PER_LONG > > @@ -87,6 +89,10 @@ struct shrinker { > int seeks; /* seeks to recreate an obj */ > unsigned flags; > > + refcount_t refcount; > + struct completion done; > + struct rcu_head rcu; What does the refcount protect, why do we need the completion, etc? > + > void *private_data; > > /* These are for internal use */ > @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, > const char *fmt, ...); > void shrinker_register(struct shrinker *shrinker); > void shrinker_free(struct shrinker *shrinker); > > +static inline bool shrinker_try_get(struct shrinker *shrinker) > +{ > + return refcount_inc_not_zero(>refcount); > +} > + > +static inline void shrinker_put(struct shrinker *shrinker) > +{ > + if (refcount_dec_and_test(>refcount)) > + complete(>done); > +} > + > #ifdef CONFIG_SHRINKER_DEBUG > extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, > const char *fmt, ...); > diff --git a/mm/shrinker.c b/mm/shrinker.c > index 1911c06b8af5..d318f5621862 100644 > --- a/mm/shrinker.c > +++ b/mm/shrinker.c > @@ -2,6 +2,7 @@ > #include > #include > #include > +#include > #include > > #include "internal.h" > @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, > struct mem_cgroup *memcg, > if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg)) > return shrink_slab_memcg(gfp_mask, nid, memcg, priority); > > - if (!down_read_trylock(_rwsem)) > - goto out; > - > - list_for_each_entry(shrinker, _list, list) { > + rcu_read_lock(); > + list_for_each_entry_rcu(shrinker, _list, list) { > struct shrink_control sc = { > .gfp_mask = gfp_mask, > .nid = nid, > .memcg = memcg, > }; > > + if (!shrinker_try_get(shrinker)) > + continue; > + > + /* > + * We can safely unlock the RCU lock here since we already > + * hold the refcount of the shrinker. > + */ > + rcu_read_unlock(); > + > ret = do_shrink_slab(, shrinker, priority); > if (ret == SHRINK_EMPTY) > ret = 0; > freed += ret; > + > /* > - * Bail out if someone want to register a new shrinker to > - * prevent the registration from being stalled for long periods > - * by parallel ongoing shrinking. > + * This shrinker may be deleted from shrinker_list and freed > + * after the shrinker_put() below, but this shrinker is still > + * used for the next traversal. So it is necessary to hold the > + * RCU lock first to prevent this shrinker from being freed, > + * which also ensures that the next shrinker that is traversed > + * will not be freed (even if it is deleted from shrinker_list > + * at the same time). >*/ This comment really should be at the head of the function, describing the algorithm used within the function itself. i.e. how reference counts are used w.r.t. the rcu_read_lock() usage to guarantee existence of the shrinker and the validity of the list walk. I'm not going to remember all these little details when I look at this code in another 6 months time, and having to work it out from first principles every time I look at the code will waste of a lot of time... -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v3 28/49] dm zoned: dynamically allocate the dm-zoned-meta shrinker
On Thu, Jul 27, 2023 at 07:20:46PM +0900, Damien Le Moal wrote: > On 7/27/23 17:55, Qi Zheng wrote: > >>> goto err; > >>> } > >>> + zmd->mblk_shrinker->count_objects = dmz_mblock_shrinker_count; > >>> + zmd->mblk_shrinker->scan_objects = dmz_mblock_shrinker_scan; > >>> + zmd->mblk_shrinker->seeks = DEFAULT_SEEKS; > >>> + zmd->mblk_shrinker->private_data = zmd; > >>> + > >>> + shrinker_register(zmd->mblk_shrinker); > >> > >> I fail to see how this new shrinker API is better... Why isn't there a > >> shrinker_alloc_and_register() function ? That would avoid adding all this > >> code > >> all over the place as the new API call would be very similar to the current > >> shrinker_register() call with static allocation. > > > > In some registration scenarios, memory needs to be allocated in advance. > > So we continue to use the previous prealloc/register_prepared() > > algorithm. The shrinker_alloc_and_register() is just a helper function > > that combines the two, and this increases the number of APIs that > > shrinker exposes to the outside, so I choose not to add this helper. > > And that results in more code in many places instead of less code + a simple > inline helper in the shrinker header file... It's not just a "simple helper" - it's a function that has to take 6 or 7 parameters with a return value that must be checked and handled. This was done in the first versions of the patch set - the amount of code in each caller does not go down and, IMO, was much harder to read and determine "this is obviously correct" that what we have now. > So not adding that super simple > helper is not exactly the best choice in my opinion. Each to their own - I much prefer the existing style/API over having to go look up a helper function every time I want to check some random shrinker has been set up correctly -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v2 03/47] mm: shrinker: add infrastructure for dynamically allocating shrinker
On Mon, Jul 24, 2023 at 05:43:10PM +0800, Qi Zheng wrote: > Currently, the shrinker instances can be divided into the following three > types: > > a) global shrinker instance statically defined in the kernel, such as >workingset_shadow_shrinker. > > b) global shrinker instance statically defined in the kernel modules, such >as mmu_shrinker in x86. > > c) shrinker instance embedded in other structures. > > For case a, the memory of shrinker instance is never freed. For case b, > the memory of shrinker instance will be freed after synchronize_rcu() when > the module is unloaded. For case c, the memory of shrinker instance will > be freed along with the structure it is embedded in. > > In preparation for implementing lockless slab shrink, we need to > dynamically allocate those shrinker instances in case c, then the memory > can be dynamically freed alone by calling kfree_rcu(). > > So this commit adds the following new APIs for dynamically allocating > shrinker, and add a private_data field to struct shrinker to record and > get the original embedded structure. > > 1. shrinker_alloc() > > Used to allocate shrinker instance itself and related memory, it will > return a pointer to the shrinker instance on success and NULL on failure. > > 2. shrinker_free_non_registered() > > Used to destroy the non-registered shrinker instance. This is a bit nasty > > 3. shrinker_register() > > Used to register the shrinker instance, which is same as the current > register_shrinker_prepared(). > > 4. shrinker_unregister() rename this "shrinker_free()" and key the two different freeing cases on the SHRINKER_REGISTERED bit rather than mostly duplicating the two. void shrinker_free(struct shrinker *shrinker) { struct dentry *debugfs_entry = NULL; int debugfs_id; if (!shrinker) return; down_write(_rwsem); if (shrinker->flags & SHRINKER_REGISTERED) { list_del(>list); debugfs_entry = shrinker_debugfs_detach(shrinker, _id); } else if (IS_ENABLED(CONFIG_SHRINKER_DEBUG)) { kfree_const(shrinker->name); } if (shrinker->flags & SHRINKER_MEMCG_AWARE) unregister_memcg_shrinker(shrinker); up_write(_rwsem); if (debugfs_entry) shrinker_debugfs_remove(debugfs_entry, debugfs_id); kfree(shrinker->nr_deferred); kfree(shrinker); } EXPORT_SYMBOL_GPL(shrinker_free); -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v5 6/8] xfs: switch to multigrain timestamps
On Thu, Jul 13, 2023 at 07:00:55PM -0400, Jeff Layton wrote: > Enable multigrain timestamps, which should ensure that there is an > apparent change to the timestamp whenever it has been written after > being actively observed via getattr. > > Also, anytime the mtime changes, the ctime must also change, and those > are now the only two options for xfs_trans_ichgtime. Have that function > unconditionally bump the ctime, and warn if XFS_ICHGTIME_CHG is ever not > set. > > Signed-off-by: Jeff Layton > --- > fs/xfs/libxfs/xfs_trans_inode.c | 6 +++--- > fs/xfs/xfs_iops.c | 4 ++-- > fs/xfs/xfs_super.c | 2 +- > 3 files changed, 6 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c > index 0c9df8df6d4a..86f5ffce2d89 100644 > --- a/fs/xfs/libxfs/xfs_trans_inode.c > +++ b/fs/xfs/libxfs/xfs_trans_inode.c > @@ -62,12 +62,12 @@ xfs_trans_ichgtime( > ASSERT(tp); > ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); > > - tv = current_time(inode); > + /* If the mtime changes, then ctime must also change */ > + WARN_ON_ONCE(!(flags & XFS_ICHGTIME_CHG)); Make that an ASSERT(flags & XFS_ICHGTIME_CHG), please. There's no need to verify this at runtime on production kernels. -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v7 19/20] fs: iomap: use bio_add_folio_nofail where possible
On Thu, Jun 01, 2023 at 06:20:21AM +0200, Christoph Hellwig wrote: > On Thu, Jun 01, 2023 at 08:36:59AM +1000, Dave Chinner wrote: > > We lose adjacent page merging with this change. > > This is only used for adding the first folio to a brand new bio, > so there is nothing to merge with yet at this point. Ah, sorry, missed that. Carry on. :) -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v7 19/20] fs: iomap: use bio_add_folio_nofail where possible
On Wed, May 31, 2023 at 04:50:42AM -0700, Johannes Thumshirn wrote: > When the iomap buffered-io code can't add a folio to a bio, it allocates a > new bio and adds the folio to that one. This is done using bio_add_folio(), > but doesn't check for errors. > > As adding a folio to a newly created bio can't fail, use the newly > introduced bio_add_folio_nofail() function. > > Reviewed-by: Christoph Hellwig > Reviewed-by: Matthew Wilcox (Oracle) > Signed-off-by: Johannes Thumshirn > --- > fs/iomap/buffered-io.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 063133ec77f4..0edab9deae2a 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -312,7 +312,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter > *iter, > ctx->bio->bi_opf |= REQ_RAHEAD; > ctx->bio->bi_iter.bi_sector = sector; > ctx->bio->bi_end_io = iomap_read_end_io; > - bio_add_folio(ctx->bio, folio, plen, poff); > + bio_add_folio_nofail(ctx->bio, folio, plen, poff); > } > > done: > @@ -539,7 +539,7 @@ static int iomap_read_folio_sync(loff_t block_start, > struct folio *folio, > > bio_init(, iomap->bdev, , 1, REQ_OP_READ); > bio.bi_iter.bi_sector = iomap_sector(iomap, block_start); > - bio_add_folio(, folio, plen, poff); > + bio_add_folio_nofail(, folio, plen, poff); > return submit_bio_wait(); > } > > @@ -1582,7 +1582,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t pos, > struct folio *folio, > > if (!bio_add_folio(wpc->ioend->io_bio, folio, len, poff)) { > wpc->ioend->io_bio = iomap_chain_bio(wpc->ioend->io_bio); > - bio_add_folio(wpc->ioend->io_bio, folio, len, poff); > + bio_add_folio_nofail(wpc->ioend->io_bio, folio, len, poff); > } We lose adjacent page merging with this change. We've had performance regressions in the past that have been attributed to either the page allocator not handing out sequential adjacent pages for sequential writes and/or bios not merging adjacent pages. Some hardware is much more performant when it only has to do a single large DMA instead of (potentially) hundreds of single page DMAs for the same amount of data... What performance regression testing has been done on this change? -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 16/17] block: use iomap for writes to block devices
On Fri, May 19, 2023 at 04:22:01PM +0200, Hannes Reinecke wrote: > On 4/24/23 07:49, Christoph Hellwig wrote: > > Use iomap in buffer_head compat mode to write to block devices. > > > > Signed-off-by: Christoph Hellwig > > --- > > block/Kconfig | 1 + > > block/fops.c | 33 + > > 2 files changed, 30 insertions(+), 4 deletions(-) > > > > diff --git a/block/Kconfig b/block/Kconfig > > index 941b2dca70db73..672b08f0096ab4 100644 > > --- a/block/Kconfig > > +++ b/block/Kconfig > > @@ -5,6 +5,7 @@ > > menuconfig BLOCK > > bool "Enable the block layer" if EXPERT > > default y > > + select IOMAP > > select SBITMAP > > help > > Provide block layer support for the kernel. > > diff --git a/block/fops.c b/block/fops.c > > index 318247832a7bcf..7910636f8df33b 100644 > > --- a/block/fops.c > > +++ b/block/fops.c > > @@ -15,6 +15,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include "blk.h" > > @@ -386,6 +387,27 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, > > struct iov_iter *iter) > > return __blkdev_direct_IO(iocb, iter, bio_max_segs(nr_pages)); > > } > > +static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t > > length, > > + unsigned int flags, struct iomap *iomap, struct iomap *srcmap) > > +{ > > + struct block_device *bdev = I_BDEV(inode); > > + loff_t isize = i_size_read(inode); > > + > > + iomap->bdev = bdev; > > + iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev)); > > + if (WARN_ON_ONCE(iomap->offset >= isize)) > > + return -EIO; > > I'm hitting this during booting: > [5.016324] > [5.030256] iomap_iter+0x11a/0x350 > [5.030264] iomap_readahead+0x1eb/0x2c0 > [5.030272] read_pages+0x5d/0x220 > [5.030279] page_cache_ra_unbounded+0x131/0x180 > [5.030284] filemap_get_pages+0xff/0x5a0 Why is filemap_get_pages() using unbounded readahead? Surely readahead should be limited to reading within EOF > [5.030292] filemap_read+0xca/0x320 > [5.030296] ? aa_file_perm+0x126/0x500 > [5.040216] ? touch_atime+0xc8/0x150 > [5.040224] blkdev_read_iter+0xb0/0x150 > [5.040228] vfs_read+0x226/0x2d0 > [5.040234] ksys_read+0xa5/0xe0 > [5.040238] do_syscall_64+0x5b/0x80 > > Maybe we should consider this patch: > > diff --git a/block/fops.c b/block/fops.c > index 524b8a828aad..d202fb663f25 100644 > --- a/block/fops.c > +++ b/block/fops.c > @@ -386,10 +386,13 @@ static int blkdev_iomap_begin(struct inode *inode, > loff_t offset, loff_t length, > > iomap->bdev = bdev; > iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev)); > - if (WARN_ON_ONCE(iomap->offset >= isize)) > - return -EIO; > - iomap->type = IOMAP_MAPPED; > - iomap->addr = iomap->offset; > + if (WARN_ON_ONCE(iomap->offset >= isize)) { > + iomap->type = IOMAP_HOLE; > + iomap->addr = IOMAP_NULL_ADDR; > + } else { > + iomap->type = IOMAP_MAPPED; > + iomap->addr = iomap->offset; > + } I think Christoph's code is correct. IMO, any attempt to read beyond the end of the device should throw out a warning and return an error, not silently return zeros. If readahead is trying to read beyond the end of the device, then it really seems to me like the problem here is readahead, not the iomap code detecting the OOB read request Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v2 00/23] fs-verity support for XFS
On Tue, Apr 11, 2023 at 07:33:19PM -0700, Eric Biggers wrote: > On Mon, Apr 10, 2023 at 10:19:46PM -0700, Christoph Hellwig wrote: > > Dave is going to hate me for this, but.. > > > > I've been looking over some of the interfaces here, and I'm starting > > to very seriously questioning the design decisions of storing the > > fsverity hashes in xattrs. > > > > Yes, storing them beyond i_size in the file is a bit of a hack, but > > it allows to reuse a lot of the existing infrastructure, and much > > of fsverity is based around it. So storing them in an xattrs causes > > a lot of churn in the interface. And the XFS side with special > > casing xattr indices also seems not exactly nice. > > It seems it's really just the Merkle tree caching interface that is causing > problems, as it's currently too closely tied to the page cache? That is just > an > implementation detail that could be reworked along the lines of what is being > discussed. > > But anyway, it is up to the XFS folks. Keep in mind there is also the option > of > doing what btrfs is doing, where it stores the Merkle tree separately from the > file data stream, but caches it past i_size in the page cache at runtime. Right. It's not entirely simple to store metadata on disk beyond EOF in XFS because of all the assumptions throughout the IO path and allocator interfaces that it can allocate space beyond EOF at will and something else will clean it up later if it is not needed. This impacts on truncate, delayed allocation, writeback, IO completion, EOF block removal on file close, background garbage collection, ENOSPC/EDQUOT driven space freeing, etc. Some of these things cross over into iomap infrastructure, too. AFAIC, it's far more intricate, complex and risky to try to store merkle tree data beyond EOF than it is to put it in an xattr namespace because IO path EOF handling bugs result in user data corruption. This happens over and over again, no matter how careful we are about these aspects of user data handling. OTOH, putting the merkle tree data in a different namespace avoids these issues completely. Yes, we now have to solve an API mismatch, but we aren't risking the addition of IO path data corruption bugs to every non-fsverity filesystem in production... Hence I think copying the btrfs approach (i.e. only caching the merkle tree data in the page cache beyond EOF) would be as far as I think we'd want to go. Realistically, there would be little practical difference between btrfs storing the merkle tree blocks in a separate internal btree and XFS storing them in an internal private xattr btree namespace. I would, however, prefer not to have to do this at all if we could simply map the blocks directly out of the xattr buffers as we already do internally for all the XFS code... > I guess there is also the issue of encryption, which hasn't come up yet since > we're talking about fsverity support only. The Merkle tree (including the > fsverity_descriptor) is supposed to be encrypted, just like the file contents > are. Having it be stored after the file contents accomplishes that easily... > Of course, it doesn't have to be that way; a separate key could be derived, or > the Merkle tree blocks could be encrypted with the file contents key using > indices past i_size, without them physically being stored in the data stream. I'm expecting that fscrypt for XFS will include encryption of the xattr names and values (just like we will need to do for directory names) except for the xattrs that hold the encryption keys themselves. That means the merkle tree blocks should get encrypted without any extra work needing to be done anywhere. This will simply require the fscrypt keys to be held in a private internal xattr namespace that isn't encrypted, but that's realtively trivial to do... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE
On Wed, Apr 05, 2023 at 10:54:06PM +, Eric Biggers wrote: > On Thu, Apr 06, 2023 at 08:26:46AM +1000, Dave Chinner wrote: > > > We could certainly think about moving to a design where fs/verity/ asks > > > the > > > filesystem to just *read* a Merkle tree block, without adding it to a > > > cache, and > > > then fs/verity/ implements the caching itself. That would require some > > > large > > > changes to each filesystem, though, unless we were to double-cache the > > > Merkle > > > tree blocks which would be inefficient. > > > > No, that's unnecessary. > > > > All we need if for fsverity to require filesystems to pass it byte > > addressable data buffers that are externally reference counted. The > > filesystem can take a page reference before mapping the page and > > passing the kaddr to fsverity, then unmap and drop the reference > > when the merkle tree walk is done as per Andrey's new drop callout. > > > > fsverity doesn't need to care what the buffer is made from, how it > > is cached, what it's life cycle is, etc. The caching mechanism and > > reference counting is entirely controlled by the filesystem callout > > implementations, and fsverity only needs to deal with memory buffers > > that are guaranteed to live for the entire walk of the merkle > > tree > > Sure. Just a couple notes: > > First, fs/verity/ does still need to be able to tell whether the buffer is > newly > instantiated or not. Boolean flag from the caller. > Second, fs/verity/ uses the ahash API to do the hashing. ahash is a > scatterlist-based API. Virtual addresses can still be used (see > sg_set_buf()), > but the memory cannot be vmalloc'ed memory, since virt_to_page() needs to > work. > Does XFS use vmalloc'ed memory for these buffers? Not vmalloc'ed, but vmapped. we allocate the pages individually, but then call vm_map_page() to present the higher level code with a single contiguous memory range if it is a multi-page buffer. We do have the backing info held in the buffer, and that's what we use for IO. If fsverity needs a page based scatter/gather list for hardware offload, it could ask the filesystem to provide it for that given buffer... > BTW, converting fs/verity/ from ahash to shash is an option; I've really never > been a fan of the scatterlist-based crypto APIs! The disadvantage of doing > this, though, would be that it would remove support for all the hardware > crypto > drivers. > > That *might* actually be okay, as that approach to crypto acceleration > has mostly fallen out of favor, in favor of CPU-based acceleration. But I do > worry about e.g. someone coming out of the woodwork and saying they need to > use > fsverity on a low-powered ARM board that has a crypto accelerator like CAAM, > and > they MUST use their crypto accelerator to get acceptable performance. True, but we are very unlikely to be using XFS on such small systems and I don't think we really care about XFS performance on android sized systems, either. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE
time the cache is accessed. The problem with what you have now is that > > every > > time a single 32-byte hash is accessed, a full page (potentially 64KB!) > > will be > > allocated and filled. That's not very efficient. The need to allocate the > > temporary page can also cause ENOMEM (which will get reported as EIO). > > > > Did you consider alternatives that would work more efficiently? I think it > > would be worth designing something that works properly with how XFS is > > planned > > to cache the Merkle tree, instead of designing a workaround. > > ->read_merkle_tree_page was not really designed for what you are doing here. > > > > How about replacing ->read_merkle_tree_page with a function that takes in a > > Merkle tree block index (not a page index!) and hands back a (page, offset) > > pair > > that identifies where the Merkle tree block's data is located? Or (folio, > > offset), I suppose. {kaddr, len}, please. > > > > With that, would it be possible to directly return the XFS cache? > > > > - Eric > > > > Yeah, I also don't like it, I didn't want to change fs-verity much > so went with this workaround. But as it's ok, I will look into trying > to pass xfs buffers to fs-verity without direct use of > ->read_merkle_tree_page(). I think it's possible with (folio, > offset), the xfs buffers aren't xattr value align so the 4k merkle > tree block is stored in two pages. I don't think this is necessary to actually merge the code. We want it to work correctly as the primary concern, performance is a secondary concern. Regardless, as you mention, the xfs_buf is not made up of contiguous pages so the merkle tree block data will be split across two (or more) pages. AFAICT, the fsverity code doesn't work with data structures that span multiple disjoint pages... Another problem is that the xfs-buf might be backed by heap memory (e.g. 4kB fs block size on 64kB PAGE_SIZE) and so it cannot be treated like a page cache page by the fsverity merkle tree code. We most definitely do not want to be passing pages containing heap memory to functions expecting to be passed lru-resident page cache pages That said, xfs-bufs do have a stable method of addressing the data in the buffers, and all the XFS code uses this to access and manipulate data directly in the buffers. That is, xfs_buf_offset() returns a mapped kaddr that points to the contiguous memory region containing the metadata in the buffer. If the xfs_buf spans multiple pages, it will return a kaddr pointing into the contiguous vmapped memory address that maps the entire buffer data range. If it is heap memory, it simply returns a pointer into that heap memory. If it's a single page, then it returns the kaddr for the data within the page. If you move all the assumptions about how the merkle tree data is managed out of fsverity and require the fielsystems to do the mapping to kaddrs and reference counting to guarantee life times, then the need for multiple different methods for reading merkle tree data go away... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE
On Wed, Apr 05, 2023 at 06:16:00PM +, Eric Biggers wrote: > On Wed, Apr 05, 2023 at 09:38:47AM -0700, Darrick J. Wong wrote: > > > The merkle tree pages are dropped after verification. When page is > > > dropped xfs_buf is marked as verified. If fs-verity wants to > > > verify again it will get the same verified buffer. If buffer is > > > evicted it won't have verified state. > > > > > > So, with enough memory pressure buffers will be dropped and need to > > > be reverified. > > > > Please excuse me if this was discussed and rejected long ago, but > > perhaps fsverity should try to hang on to the merkle tree pages that > > this function returns for as long as possible until reclaim comes for > > them? > > > > With the merkle tree page lifetimes extended, you then don't need to > > attach the xfs_buf to page->private, nor does xfs have to extend the > > buffer cache to stash XBF_VERITY_CHECKED. > > Well, all the other filesystems that support fsverity (ext4, f2fs, and btrfs) > just cache the Merkle tree pages in the inode's page cache. It's an approach > that I know some people aren't a fan of, but it's efficient and it works. Which puts pages beyond EOF in the page cache. Given that XFS also allows persistent block allocation beyond EOF, having both data in the page cache and blocks beyond EOF that contain unrelated information is a Real Bad Idea. Just because putting metadata in the file data address space works for one filesystem, it doesn't me it's a good idea or that it works for every filesystem. > We could certainly think about moving to a design where fs/verity/ asks the > filesystem to just *read* a Merkle tree block, without adding it to a cache, > and > then fs/verity/ implements the caching itself. That would require some large > changes to each filesystem, though, unless we were to double-cache the Merkle > tree blocks which would be inefficient. No, that's unnecessary. All we need if for fsverity to require filesystems to pass it byte addressable data buffers that are externally reference counted. The filesystem can take a page reference before mapping the page and passing the kaddr to fsverity, then unmap and drop the reference when the merkle tree walk is done as per Andrey's new drop callout. fsverity doesn't need to care what the buffer is made from, how it is cached, what it's life cycle is, etc. The caching mechanism and reference counting is entirely controlled by the filesystem callout implementations, and fsverity only needs to deal with memory buffers that are guaranteed to live for the entire walk of the merkle tree Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files
On Wed, Apr 05, 2023 at 06:02:47PM +, Eric Biggers wrote: > And I really hope that you don't want to do DIO to the *Merkle tree*, as that Not for XFS - the merkle tree is not held as file data. That said, the merkle tree in XFS is not page cache or block aligned metadata either, so we really want the same memory buffer based interface for the merkle tree reading so that the merkle tree code can read directly from the xfs-buf rather than requiring us to copy it out of the xfsbuf into temporary pages... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files
On Wed, Apr 05, 2023 at 08:09:27AM -0700, Darrick J. Wong wrote: > On Wed, Apr 05, 2023 at 05:01:42PM +0200, Andrey Albershteyn wrote: > > On Tue, Apr 04, 2023 at 09:10:47AM -0700, Darrick J. Wong wrote: > > > On Tue, Apr 04, 2023 at 04:53:15PM +0200, Andrey Albershteyn wrote: > > > > The direct path is not supported on verity files. Attempts to use direct > > > > I/O path on such files should fall back to buffered I/O path. > > > > > > > > Signed-off-by: Andrey Albershteyn > > > > --- > > > > fs/xfs/xfs_file.c | 14 +++--- > > > > 1 file changed, 11 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > > > > index 947b5c436172..9e072e82f6c1 100644 > > > > --- a/fs/xfs/xfs_file.c > > > > +++ b/fs/xfs/xfs_file.c > > > > @@ -244,7 +244,8 @@ xfs_file_dax_read( > > > > struct kiocb*iocb, > > > > struct iov_iter *to) > > > > { > > > > - struct xfs_inode*ip = > > > > XFS_I(iocb->ki_filp->f_mapping->host); > > > > + struct inode*inode = iocb->ki_filp->f_mapping->host; > > > > + struct xfs_inode*ip = XFS_I(inode); > > > > ssize_t ret = 0; > > > > > > > > trace_xfs_file_dax_read(iocb, to); > > > > @@ -297,10 +298,17 @@ xfs_file_read_iter( > > > > > > > > if (IS_DAX(inode)) > > > > ret = xfs_file_dax_read(iocb, to); > > > > - else if (iocb->ki_flags & IOCB_DIRECT) > > > > + else if (iocb->ki_flags & IOCB_DIRECT && > > > > !fsverity_active(inode)) > > > > ret = xfs_file_dio_read(iocb, to); > > > > - else > > > > + else { > > > > + /* > > > > +* In case fs-verity is enabled, we also fallback to the > > > > +* buffered read from the direct read path. Therefore, > > > > +* IOCB_DIRECT is set and need to be cleared > > > > +*/ > > > > + iocb->ki_flags &= ~IOCB_DIRECT; > > > > ret = xfs_file_buffered_read(iocb, to); > > > > > > XFS doesn't usually allow directio fallback to the pagecache. Why > > > would fsverity be any different? > > > > Didn't know that, this is what happens on ext4 so I did the same. > > Then it probably make sense to just error on DIRECT on verity > > sealed file. > > Thinking about this a little more -- I suppose we shouldn't just go > breaking directio reads from a verity file if we can help it. Is there > a way to ask fsverity to perform its validation against some arbitrary > memory buffer that happens to be fs-block aligned? The memory buffer doesn't even need to be fs-block aligned - it just needs to be a pointer to memory the kernel can read... We also need fsverity to be able to handle being passed mapped kernel memory rather than pages/folios for the merkle tree interfaces. That way we can just pass it the mapped buffer memory straight from the xfs-buf and we don't have to do the whacky "copy from xattr xfs_bufs into pages so fsverity can take temporary reference counts on what it thinks are page cache pages" as it walks the merkle tree. -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v2 16/23] xfs: add inode on-disk VERITY flag
On Tue, Apr 04, 2023 at 03:41:23PM -0700, Eric Biggers wrote: > Hi Andrey, > > On Tue, Apr 04, 2023 at 04:53:12PM +0200, Andrey Albershteyn wrote: > > Add flag to mark inodes which have fs-verity enabled on them (i.e. > > descriptor exist and tree is built). > > > > Signed-off-by: Andrey Albershteyn > > --- > > fs/ioctl.c | 4 > > fs/xfs/libxfs/xfs_format.h | 4 +++- > > fs/xfs/xfs_inode.c | 2 ++ > > fs/xfs/xfs_iops.c | 2 ++ > > include/uapi/linux/fs.h| 1 + > > 5 files changed, 12 insertions(+), 1 deletion(-) > [...] > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > > index b7b56871029c..5172a2eb902c 100644 > > --- a/include/uapi/linux/fs.h > > +++ b/include/uapi/linux/fs.h > > @@ -140,6 +140,7 @@ struct fsxattr { > > #define FS_XFLAG_FILESTREAM0x4000 /* use filestream > > allocator */ > > #define FS_XFLAG_DAX 0x8000 /* use DAX for IO */ > > #define FS_XFLAG_COWEXTSIZE0x0001 /* CoW extent size > > allocator hint */ > > +#define FS_XFLAG_VERITY0x0002 /* fs-verity sealed > > inode */ > > #define FS_XFLAG_HASATTR 0x8000 /* no DIFLAG for this */ > > > > I don't think "xfs: add inode on-disk VERITY flag" is an accurate description > of > a patch that involves adding something to the UAPI. Well it does that, but it also adds the UAPI for querying the on-disk flag via the FS_IOC_FSGETXATTR interface as well. It probably should be split up into two patches. > Should the other filesystems support this new flag too? I think they should get it automatically now that it has been defined for FS_IOC_FSGETXATTR and added to the generic fileattr flag fill functions in fs/ioctl.c. > I'd also like all ways of getting the verity flag to continue to be mentioned > in > Documentation/filesystems/fsverity.rst. The existing methods (FS_IOC_GETFLAGS > and statx) are already mentioned there. *nod* -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v2 06/23] fsverity: add drop_page() callout
On Tue, Apr 04, 2023 at 04:53:02PM +0200, Andrey Albershteyn wrote: > Allow filesystem to make additional processing on verified pages > instead of just dropping a reference. This will be used by XFS for > internal buffer cache manipulation in further patches. The btrfs, > ext4, and f2fs just drop the reference. > > Signed-off-by: Andrey Albershteyn > --- > fs/btrfs/verity.c | 12 > fs/ext4/verity.c | 6 ++ > fs/f2fs/verity.c | 6 ++ > fs/verity/read_metadata.c | 4 ++-- > fs/verity/verify.c| 6 +++--- > include/linux/fsverity.h | 10 ++ > 6 files changed, 39 insertions(+), 5 deletions(-) > > diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c > index c5ff16f9e9fa..4c2c09204bb4 100644 > --- a/fs/btrfs/verity.c > +++ b/fs/btrfs/verity.c > @@ -804,10 +804,22 @@ static int btrfs_write_merkle_tree_block(struct inode > *inode, const void *buf, > pos, buf, size); > } > > +/* > + * fsverity op that releases the reference obtained by > ->read_merkle_tree_page() > + * > + * @page: reference to the page which can be released > + * > + */ > +static void btrfs_drop_page(struct page *page) > +{ > + put_page(page); > +} > + > const struct fsverity_operations btrfs_verityops = { > .begin_enable_verity = btrfs_begin_enable_verity, > .end_enable_verity = btrfs_end_enable_verity, > .get_verity_descriptor = btrfs_get_verity_descriptor, > .read_merkle_tree_page = btrfs_read_merkle_tree_page, > .write_merkle_tree_block = btrfs_write_merkle_tree_block, > + .drop_page = _drop_page, > }; Ok, that's a generic put_page() call. > diff --git a/fs/verity/verify.c b/fs/verity/verify.c > index f50e3b5b52c9..c2fc4c86af34 100644 > --- a/fs/verity/verify.c > +++ b/fs/verity/verify.c > @@ -210,7 +210,7 @@ verify_data_block(struct inode *inode, struct > fsverity_info *vi, > if (is_hash_block_verified(vi, hpage, hblock_idx)) { > memcpy_from_page(_want_hash, hpage, hoffset, hsize); > want_hash = _want_hash; > - put_page(hpage); > + inode->i_sb->s_vop->drop_page(hpage); > goto descend; fsverity_drop_page(hpage); static inline void fsverity_drop_page(struct inode *inode, struct page *page) { if (inode->i_sb->s_vop->drop_page) inode->i_sb->s_vop->drop_page(page); else put_page(page); } And then you don't need to add the functions to each of the filesystems nor make an indirect call just to run put_page(). Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote: > 2. Do we need to revalidate mappings for directio writes? I think the > answer is no (for xfs) because the ->iomap_begin call will allocate > whatever blocks are needed and truncate/punch/reflink block on the > iolock while the directio writes are pending, so you'll never end up > with a stale mapping. But I don't know if that statement applies > generally... The issue is not truncate/punch/reflink for either DIO or buffered IO - the issue that leads to stale iomaps is async extent state. i.e. IO completion doing unwritten extent conversion. For DIO, AIO doesn't hold the IOLOCK at all when completion is run (like buffered writeback), but non-AIO DIO writes hold the IOLOCK shared while waiting for completion. This means that we can have DIO submission and completion still running concurrently, and so stale iomaps are a definite possibility. >From my notes when I looked at this: 1. the race condition for a DIO write mapping go stale is an overlapping DIO completion and converting the block from unwritten to written, and then the dio write incorrectly issuing sub-block zeroing because the mapping is now stale. 2. DIO read into a hole or unwritten extent zeroes the entire range in the user buffer in one operation. If this is a large range, this could race with small DIO writes within that range that have completed 3. There is a window between dio write completion doing unwritten extent conversion (by ->end_io) and the page cache being invalidated, providing a window where buffered read maps can be stale and incorrect read behaviour exposed to userpace before the page cache is invalidated. These all stem from IO having overlapping ranges, which is largely unsupported but can't be entirely prevented (e.g. backup applications running in the background). Largely the problems are confined to sub-block IOs. i.e. when sub-block DIO writes to the same block are being performed, we have the possiblity that one write completes whilst the other is deciding what to zero, unaware that the range is now MAPPED rather than UNWRITTEN. We currently avoid issues with sub-block dio writes by using IOMAP_DIO_OVERWRITE_ONLY with shared locking. This ensures that the unaligned IO fits entirely within a MAPPED extent so no sub-block zeroing is required. If allocation or sub-block zeroing is required, then we force the filesystem to fall back to exclusive IO locking and wait for all concurrent DIO in flight to complete so that it can't race with any other DIO write that might cause the map to become stale while we are doing the zeroing. This does not avoid potential issues with DIO write vs buffered read, nor DIO write vs mmap IO. It's not totally clear to me whether we need ->iomap_valid checks in the buffered read paths to avoid the completion races with DIO writes, but there are windows there where cached iomaps could be considered stale Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
On Wed, Jan 11, 2023 at 07:36:26PM +, Matthew Wilcox wrote: > On Tue, Jan 10, 2023 at 07:24:27AM -0800, Christoph Hellwig wrote: > > On Tue, Jan 10, 2023 at 01:34:16PM +, Matthew Wilcox wrote: > > > > Exactly. And as I already pointed out in reply to Dave's original > > > > patch what we really should be doing is returning an ERR_PTR from > > > > __filemap_get_folio instead of reverse-engineering the expected > > > > error code. > > > > > > Ouch, we have a nasty problem. > > > > > > If somebody passes FGP_ENTRY, we can return a shadow entry. And the > > > encodings for shadow entries overlap with the encodings for ERR_PTR, > > > meaning that some shadow entries will look like errors. The way I > > > solved this in the XArray code is by shifting the error values by > > > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example). > > > > > > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS, > > > but so far we haven't, and I'd like to make that decision intentionally. > > > > So what would be an alternative way to tell the callers why no folio > > was found instead of trying to reverse engineer that? Return an errno > > and the folio by reference? The would work, but the calling conventions > > would be awful. > > Agreed. How about an xa_filemap_get_folio()? > > (there are a number of things to fix here; haven't decided if XA_ERROR > should return void *, or whether i should use a separate 'entry' and > 'folio' until I know the entry is actually a folio ...) That's awful. Exposing internal implementation details in the API that is supposed to abstract away the internal implementation details from users doesn't seem like a great idea to me. Exactly what are we trying to fix here? Do we really need to punch a hole through the abstraction layers like this just to remove half a dozen lines of -slow path- context specific error handling from a single caller? If there's half a dozen cases that need this sort of handling, then maybe it's the right thing to do. But for a single calling context that only needs to add a null return check in one specific case? There's absolutely no need to make generic infrastructure violate layering abstractions to handle that... -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote: > On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner wrote: > > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote: > > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio() > > > handler and validating the mapping there. > > > > > > Signed-off-by: Andreas Gruenbacher > > > > I think this is wrong. > > > > The ->iomap_valid() function handles a fundamental architectural > > issue with cached iomaps: the iomap can become stale at any time > > whilst it is in use by the iomap core code. > > > > The current problem it solves in the iomap_write_begin() path has to > > do with writeback and memory reclaim races over unwritten extents, > > but the general case is that we must be able to check the iomap > > at any point in time to assess it's validity. > > > > Indeed, we also have this same "iomap valid check" functionality in the > > writeback code as cached iomaps can become stale due to racing > > writeback, truncated, etc. But you wouldn't know it by looking at the iomap > > writeback code - this is currently hidden by XFS by embedding > > the checks into the iomap writeback ->map_blocks function. > > > > That is, the first thing that xfs_map_blocks() does is check if the > > cached iomap is valid, and if it is valid it returns immediately and > > the iomap writeback code uses it without question. > > > > The reason that this is embedded like this is that the iomap did not > > have a validity cookie field in it, and so the validity information > > was wrapped around the outside of the iomap_writepage_ctx and the > > filesystem has to decode it from that private wrapping structure. > > > > However, the validity information iin the structure wrapper is > > indentical to the iomap validity cookie, > > Then could that part of the xfs code be converted to use > iomap->validity_cookie so that struct iomap_writepage_ctx can be > eliminated? Yes, that is the plan. > > > and so the direction I've > > been working towards is to replace this implicit, hidden cached > > iomap validity check with an explicit ->iomap_valid call and then > > only call ->map_blocks if the validity check fails (or is not > > implemented). > > > > I want to use the same code for all the iomap validity checks in all > > the iomap core code - this is an iomap issue, the conditions where > > we need to check for iomap validity are different for depending on > > the iomap context being run, and the checks are not necessarily > > dependent on first having locked a folio. > > > > Yes, the validity cookie needs to be decoded by the filesystem, but > > that does not dictate where the validity checking needs to be done > > by the iomap core. > > > > Hence I think removing ->iomap_valid is a big step backwards for the > > iomap core code - the iomap core needs to be able to formally verify > > the iomap is valid at any point in time, not just at the point in > > time a folio in the page cache has been locked... > > We don't need to validate an iomap "at any time". It's two specific > places in the code in which we need to check, and we're not going to > end up with ten more such places tomorrow. Not immediately, but that doesn't change the fact this is not a filesystem specific issue - it's an inherent characteristic of cached iomaps and unsynchronised extent state changes that occur outside exclusive inode->i_rwsem IO context (e.g. in writeback and IO completion contexts). Racing mmap + buffered writes can expose these state changes as the iomap bufferred write IO path is not serialised against the iomap mmap IO path except via folio locks. Hence a mmap page fault can invalidate a cached buffered write iomap by causing a hole -> unwritten, hole -> delalloc or hole -> written conversion in the middle of the buffered write range. The buffered write still has a hole mapping cached for that entire range, and it is now incorrect. If the mmap write happens to change extent state at the trailing edge of a partial buffered write, data corruption will occur if we race just right with writeback and memory reclaim. I'm pretty sure that this corruption can be reporduced on gfs2 if we try hard enough - generic/346 triggers the mmap/write race condition, all that is needed from that point is for writeback and reclaiming pages at exactly the right time... > I'd prefer to keep those > filesystem internals in the filesystem specific code instead of > exposing them to the iomap layer. But that's just me ... My point is that there is noth
Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler
On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote: > Eliminate the ->iomap_valid() handler by switching to a ->get_folio() > handler and validating the mapping there. > > Signed-off-by: Andreas Gruenbacher I think this is wrong. The ->iomap_valid() function handles a fundamental architectural issue with cached iomaps: the iomap can become stale at any time whilst it is in use by the iomap core code. The current problem it solves in the iomap_write_begin() path has to do with writeback and memory reclaim races over unwritten extents, but the general case is that we must be able to check the iomap at any point in time to assess it's validity. Indeed, we also have this same "iomap valid check" functionality in the writeback code as cached iomaps can become stale due to racing writeback, truncated, etc. But you wouldn't know it by looking at the iomap writeback code - this is currently hidden by XFS by embedding the checks into the iomap writeback ->map_blocks function. That is, the first thing that xfs_map_blocks() does is check if the cached iomap is valid, and if it is valid it returns immediately and the iomap writeback code uses it without question. The reason that this is embedded like this is that the iomap did not have a validity cookie field in it, and so the validity information was wrapped around the outside of the iomap_writepage_ctx and the filesystem has to decode it from that private wrapping structure. However, the validity information iin the structure wrapper is indentical to the iomap validity cookie, and so the direction I've been working towards is to replace this implicit, hidden cached iomap validity check with an explicit ->iomap_valid call and then only call ->map_blocks if the validity check fails (or is not implemented). I want to use the same code for all the iomap validity checks in all the iomap core code - this is an iomap issue, the conditions where we need to check for iomap validity are different for depending on the iomap context being run, and the checks are not necessarily dependent on first having locked a folio. Yes, the validity cookie needs to be decoded by the filesystem, but that does not dictate where the validity checking needs to be done by the iomap core. Hence I think removing ->iomap_valid is a big step backwards for the iomap core code - the iomap core needs to be able to formally verify the iomap is valid at any point in time, not just at the point in time a folio in the page cache has been locked... -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper
On Sun, Jan 08, 2023 at 08:40:28PM +0100, Andreas Gruenbacher wrote: > Add an iomap_get_folio() helper that gets a folio reference based on > an iomap iterator and an offset into the address space. Use it in > iomap_write_begin(). > > Signed-off-by: Andreas Gruenbacher > Reviewed-by: Darrick J. Wong > Reviewed-by: Christoph Hellwig > --- > fs/iomap/buffered-io.c | 39 ++- > include/linux/iomap.h | 1 + > 2 files changed, 31 insertions(+), 9 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index d4b444e44861..de4a8e5f721a 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -457,6 +457,33 @@ bool iomap_is_partially_uptodate(struct folio *folio, > size_t from, size_t count) > } > EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate); > > +/** > + * iomap_get_folio - get a folio reference for writing > + * @iter: iteration structure > + * @pos: start offset of write > + * > + * Returns a locked reference to the folio at @pos, or an error pointer if > the > + * folio could not be obtained. > + */ > +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos) > +{ > + unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS; > + struct folio *folio; > + > + if (iter->flags & IOMAP_NOWAIT) > + fgp |= FGP_NOWAIT; > + > + folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT, > + fgp, mapping_gfp_mask(iter->inode->i_mapping)); > + if (folio) > + return folio; > + > + if (iter->flags & IOMAP_NOWAIT) > + return ERR_PTR(-EAGAIN); > + return ERR_PTR(-ENOMEM); > +} > +EXPORT_SYMBOL_GPL(iomap_get_folio); H. This is where things start to get complex. I have sent a patch to fix a problem with iomap_zero_range() failing to zero cached dirty pages over UNWRITTEN extents, and that requires making FGP_CREAT optional. This is an iomap bug, and needs to be fixed in the core iomap code: https://lore.kernel.org/linux-xfs/20221201005214.3836105-1-da...@fromorbit.com/ Essentially, we need to pass fgp flags to iomap_write_begin() need so the callers can supply a 0 or FGP_CREAT appropriately. This allows iomap_write_begin() to act only on pre-cached pages rather than always instantiating a new page if one does not exist in cache. This allows that iomap_write_begin() to return a NULL folio successfully, and this is perfectly OK for callers that pass in fgp = 0 as they are expected to handle a NULL folio return indicating there was no cached data over the range... Exposing the folio allocation as an external interface makes bug fixes like this rather messy - it's taking a core abstraction (iomap hides all the folio and page cache manipulations from the filesystem) and punching a big hole in it by requiring filesystems to actually allocation page cache folios on behalf of the iomap core. Given that I recently got major push-back for fixing an XFS-only bug by walking the page cache directly instead of abstracting it via the iomap core, punching an even bigger hole in the abstraction layer to fix a GFS2-only problem is just as bad -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [RFC v2 0/3] Turn iomap_page_ops into iomap_folio_ops
On Fri, Dec 02, 2022 at 02:54:00AM +0100, Andreas Gruenbacher wrote: > On Thu, Dec 1, 2022 at 10:30 PM Dave Chinner wrote: > > On Thu, Dec 01, 2022 at 07:09:54PM +0100, Andreas Gruenbacher wrote: > > > Hi again, > > > > > > [Same thing, but with the patches split correctly this time.] > > > > > > we're seeing a race between journaled data writes and the shrinker on > > > gfs2. What's happening is that gfs2_iomap_page_done() is called after > > > the page has been unlocked, so try_to_free_buffers() can come in and > > > free the buffers while gfs2_iomap_page_done() is trying to add them to > > > the transaction. Not good. > > > > > > This is a proposal to change iomap_page_ops so that page_prepare() > > > prepares the write and grabs the locked page, and page_done() unlocks > > > and puts that page again. While at it, this also converts the hooks > > > from pages to folios. > > > > > > To move the pagecache_isize_extended() call in iomap_write_end() out of > > > the way, a new folio_may_straddle_isize() helper is introduced that > > > takes a locked folio. That is then used when the inode size is updated, > > > before the folio is unlocked. > > > > > > I've also converted the other applicable folio_may_straddle_isize() > > > users, namely generic_write_end(), ext4_write_end(), and > > > ext4_journalled_write_end(). > > > > > > Any thoughts? > > > > I doubt that moving page cache operations from the iomap core to > > filesystem specific callouts will be acceptible. I recently proposed > > patches that added page cache walking to an XFS iomap callout to fix > > a data corruption, but they were NAKd on the basis that iomap is > > supposed to completely abstract away the folio and page cache > > manipulations from the filesystem. > > Right. The resulting code is really quite disgusting, for a > fundamentalist dream of abstraction. > > > This patchset seems to be doing the same thing - moving page cache > > and folio management directly in filesystem specific callouts. Hence > > I'm going to assume that the same architectural demarcation is > > going to apply here, too... > > > > FYI, there is already significant change committed to the iomap > > write path in the current XFS tree as a result of the changes I > > mention - there is stale IOMAP detection which adds a new page ops > > method and adds new error paths with a locked folio in > > iomap_write_begin(). > > That would have belonged on the iomap-for-next branch rather than in > the middle of a bunch of xfs commits. Damned if you do, damned if you don't. There were non-trivial cross dependencies between XFS and iomap in that patch set. The initial IOMAP_F_STALE infrastructure needed XFS changes first, otherwise it could deadlock at ENOSPC on write page faults. i.e. the iomap change in isolation broke stuff, so we're forced to either carry XFs changes in iomap or iomap changes in XFS so that there are no regressions in a given tree. Then we had to move XFS functionality to iomap to fix another data corruption that the IOMAP_F_STALE infrastructure exposed in XFS via generic/346. Once the code was moved, then we could build it up into the page cache scanning functionality in iomap. And only then could we add the XFS IOMAP_F_STALE validation to XFS to solve the original data corruption that started all this off. IOWs, there were so many cross dependencies between XFs and iomap that it was largely impossible to break it up into two separate sets of indpendent patches that didn't cause regressions in one or the other tree. And in the end, we'd still have to merge the iomap tree into XFS or vice versa to actually test that the data corruption fix worked. In situations like this, we commonly take the entire series into one of the two trees rather than make a whole lot more work for ourselves by trying to separate them out. And in this case, because it was XFS data corruption and race conditions that needed fixing, it made sense to take it through the XFS tree so that it gets coverage from all the XFS testing that happens - the iomap tree gets a lot less early coverage than the XFS tree... > > And this other data corruption (and performance) fix for handling > > zeroing over unwritten extents properly: > > > > https://lore.kernel.org/linux-xfs/20221201005214.3836105-1-da...@fromorbit.com/ > > > > changes the way folios are looked up and instantiated in the page > > cache in iomap_write_begin(). It also adds new error conditions that > > need to be returned to callers so to implement conditional "folio > > must be present and dirty"
Re: [Cluster-devel] [RFC v2 0/3] Turn iomap_page_ops into iomap_folio_ops
On Thu, Dec 01, 2022 at 07:09:54PM +0100, Andreas Gruenbacher wrote: > Hi again, > > [Same thing, but with the patches split correctly this time.] > > we're seeing a race between journaled data writes and the shrinker on > gfs2. What's happening is that gfs2_iomap_page_done() is called after > the page has been unlocked, so try_to_free_buffers() can come in and > free the buffers while gfs2_iomap_page_done() is trying to add them to > the transaction. Not good. > > This is a proposal to change iomap_page_ops so that page_prepare() > prepares the write and grabs the locked page, and page_done() unlocks > and puts that page again. While at it, this also converts the hooks > from pages to folios. > > To move the pagecache_isize_extended() call in iomap_write_end() out of > the way, a new folio_may_straddle_isize() helper is introduced that > takes a locked folio. That is then used when the inode size is updated, > before the folio is unlocked. > > I've also converted the other applicable folio_may_straddle_isize() > users, namely generic_write_end(), ext4_write_end(), and > ext4_journalled_write_end(). > > Any thoughts? I doubt that moving page cache operations from the iomap core to filesystem specific callouts will be acceptible. I recently proposed patches that added page cache walking to an XFS iomap callout to fix a data corruption, but they were NAKd on the basis that iomap is supposed to completely abstract away the folio and page cache manipulations from the filesystem. This patchset seems to be doing the same thing - moving page cache and folio management directly in filesystem specific callouts. Hence I'm going to assume that the same architectural demarcation is going to apply here, too... FYI, there is already significant change committed to the iomap write path in the current XFS tree as a result of the changes I mention - there is stale IOMAP detection which adds a new page ops method and adds new error paths with a locked folio in iomap_write_begin(). And this other data corruption (and performance) fix for handling zeroing over unwritten extents properly: https://lore.kernel.org/linux-xfs/20221201005214.3836105-1-da...@fromorbit.com/ changes the way folios are looked up and instantiated in the page cache in iomap_write_begin(). It also adds new error conditions that need to be returned to callers so to implement conditional "folio must be present and dirty" page cache zeroing from iomap_zero_iter(). Those semantics would also have to be supported by gfs2, and that greatly complicates modifying and testing iomap core changes. To avoid all this, can we simple move the ->page_done() callout in the error path and iomap_write_end() to before we unlock the folio? You've already done that for pagecache_isize_extended(), and I can't see anything obvious in the gfs2 ->page_done callout that would cause issues if it is called with a locked dirty folio... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [syzbot] WARNING in iomap_read_inline_data
> _raw_spin_unlock_irq+0x3c/0x70 kernel/locking/spinlock.c:202 > hardirqs last disabled at (104804): [] el1_dbg+0x24/0x80 > arch/arm64/kernel/entry-common.c:405 > softirqs last enabled at (104756): [] > local_bh_enable+0x10/0x34 include/linux/bottom_half.h:32 > softirqs last disabled at (104754): [] > local_bh_disable+0x10/0x34 include/linux/bottom_half.h:19 > ---[ end trace ]--- > > > --- > This report is generated by a bot. It may contain errors. > See https://goo.gl/tpsmEJ for more information about syzbot. > syzbot engineers can be reached at syzkal...@googlegroups.com. > > syzbot will keep track of this issue. See: > https://goo.gl/tpsmEJ#status for how to communicate with syzbot. > syzbot can test patches for this issue, for details see: > https://goo.gl/tpsmEJ#testing-patches > -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()
On Thu, Nov 03, 2022 at 07:45:01PM -0700, Darrick J. Wong wrote: > On Fri, Nov 04, 2022 at 11:32:35AM +1100, Dave Chinner wrote: > > On Thu, Nov 03, 2022 at 03:28:05PM -0700, Vishal Moola wrote: > > > On Wed, Oct 19, 2022 at 08:01:52AM +1100, Dave Chinner wrote: > > > > On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote: > > > > > - BUG_ON(PageWriteback(page)); > > > > > - if (!clear_page_dirty_for_io(page)) > > > > > + BUG_ON(folio_test_writeback(folio)); > > > > > + if (!folio_clear_dirty_for_io(folio)) > > > > > goto continue_unlock; > > > > > > > > > > trace_wbc_writepage(wbc, > > > > > inode_to_bdi(mapping->host)); > > > > > - error = (*writepage)(page, wbc, data); > > > > > + error = writepage(>page, wbc, data); > > > > > > > > Yet, IIUC, this treats all folios as if they are single page folios. > > > > i.e. it passes the head page of a multi-page folio to a callback > > > > that will treat it as a single PAGE_SIZE page, because that's all > > > > the writepage callbacks are currently expected to be passed... > > > > > > > > So won't this break writeback of dirty multipage folios? > > > > > > Yes, it appears it would. But it wouldn't because its already 'broken'. > > > > It is? Then why isn't XFS broken on existing kernels? Oh, we don't > > know because it hasn't been tested? > > > > Seriously - if this really is broken, and this patchset further > > propagating the brokeness, then somebody needs to explain to me why > > this is not corrupting data in XFS. > > It looks like iomap_do_writepage finds the folio size correctly > > end_pos = folio_pos(folio) + folio_size(folio); > > and iomap_writpage_map will map out the correct number of blocks > > unsigned nblocks = i_blocks_per_folio(inode, folio); > > for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) { > > right? Yup, that's how I read it, too. But my recent experience with folios involved being repeatedly burnt by edge case corruptions due to multipage folios showing up when and where I least expected them. Hence doing a 1:1 conversion of page based code to folio based code and just assuming large folios will work without any testing seems akin to playing russian roulette with loose cannons that have been doused with napalm and then set on fire by an air-dropped barrel bomb... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()
On Thu, Nov 03, 2022 at 03:28:05PM -0700, Vishal Moola wrote: > On Wed, Oct 19, 2022 at 08:01:52AM +1100, Dave Chinner wrote: > > On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote: > > > Converted function to use folios throughout. This is in preparation for > > > the removal of find_get_pages_range_tag(). > > > > > > Signed-off-by: Vishal Moola (Oracle) > > > --- > > > mm/page-writeback.c | 44 +++- > > > 1 file changed, 23 insertions(+), 21 deletions(-) > > > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > > > index 032a7bf8d259..087165357a5a 100644 > > > --- a/mm/page-writeback.c > > > +++ b/mm/page-writeback.c > > > @@ -2285,15 +2285,15 @@ int write_cache_pages(struct address_space > > > *mapping, > > > int ret = 0; > > > int done = 0; > > > int error; > > > - struct pagevec pvec; > > > - int nr_pages; > > > + struct folio_batch fbatch; > > > + int nr_folios; > > > pgoff_t index; > > > pgoff_t end;/* Inclusive */ > > > pgoff_t done_index; > > > int range_whole = 0; > > > xa_mark_t tag; > > > > > > - pagevec_init(); > > > + folio_batch_init(); > > > if (wbc->range_cyclic) { > > > index = mapping->writeback_index; /* prev offset */ > > > end = -1; > > > @@ -2313,17 +2313,18 @@ int write_cache_pages(struct address_space > > > *mapping, > > > while (!done && (index <= end)) { > > > int i; > > > > > > - nr_pages = pagevec_lookup_range_tag(, mapping, , end, > > > - tag); > > > - if (nr_pages == 0) > > > + nr_folios = filemap_get_folios_tag(mapping, , end, > > > + tag, ); > > > > This can find and return dirty multi-page folios if the filesystem > > enables them in the mapping at instantiation time, right? > > Yup, it will. > > > > + > > > + if (nr_folios == 0) > > > break; > > > > > > - for (i = 0; i < nr_pages; i++) { > > > - struct page *page = pvec.pages[i]; > > > + for (i = 0; i < nr_folios; i++) { > > > + struct folio *folio = fbatch.folios[i]; > > > > > > - done_index = page->index; > > > + done_index = folio->index; > > > > > > - lock_page(page); > > > + folio_lock(folio); > > > > > > /* > > >* Page truncated or invalidated. We can freely skip it > > > @@ -2333,30 +2334,30 @@ int write_cache_pages(struct address_space > > > *mapping, > > >* even if there is now a new, dirty page at the same > > >* pagecache address. > > >*/ > > > - if (unlikely(page->mapping != mapping)) { > > > + if (unlikely(folio->mapping != mapping)) { > > > continue_unlock: > > > - unlock_page(page); > > > + folio_unlock(folio); > > > continue; > > > } > > > > > > - if (!PageDirty(page)) { > > > + if (!folio_test_dirty(folio)) { > > > /* someone wrote it for us */ > > > goto continue_unlock; > > > } > > > > > > - if (PageWriteback(page)) { > > > + if (folio_test_writeback(folio)) { > > > if (wbc->sync_mode != WB_SYNC_NONE) > > > - wait_on_page_writeback(page); > > > + folio_wait_writeback(folio); > > > else > > > goto continue_unlock; > > > } > > > > > > - BUG_ON(PageWriteback(page)); > > > - if (!clear_page_dirty_for_io(page)) > > > + BUG_ON(folio_test_writeback(folio)); > > > + if (!folio_clear_dirty_for_io(folio)) > > > goto continue_unlock; > > > > > &
Re: [Cluster-devel] [PATCH v4 00/23] Convert to filemap_get_folios_tag()
On Thu, Nov 03, 2022 at 09:38:48AM -0700, Vishal Moola wrote: > On Thu, Nov 3, 2022 at 12:08 AM Dave Chinner wrote: > > > > On Wed, Nov 02, 2022 at 09:10:08AM -0700, Vishal Moola (Oracle) wrote: > > > This patch series replaces find_get_pages_range_tag() with > > > filemap_get_folios_tag(). This also allows the removal of multiple > > > calls to compound_head() throughout. > > > It also makes a good chunk of the straightforward conversions to folios, > > > and takes the opportunity to introduce a function that grabs a folio > > > from the pagecache. > > > > > > F2fs and Ceph have quite a lot of work to be done regarding folios, so > > > for now those patches only have the changes necessary for the removal of > > > find_get_pages_range_tag(), and only support folios of size 1 (which is > > > all they use right now anyways). > > > > > > I've run xfstests on btrfs, ext4, f2fs, and nilfs2, but more testing may > > > be > > > beneficial. The page-writeback and filemap changes implicitly work. > > > Testing > > > and review of the other changes (afs, ceph, cifs, gfs2) would be > > > appreciated. > > > > Same question as last time: have you tested this with multipage > > folios enabled? If you haven't tested XFS, then I'm guessing the > > answer is no, and you haven't fixed the bug I pointed out in > > the write_cache_pages() implementation > > > > I haven't tested the series with multipage folios or XFS. > > I don't seem to have gotten your earlier comments, and I > can't seem to find them on the mailing lists. Could you > please send them again so I can take a look? They are in the lore -fsdevel archive - no idea why you couldn't find them https://lore.kernel.org/linux-fsdevel/20221018210152.gh2703...@dread.disaster.area/ https://lore.kernel.org/linux-fsdevel/20221018214544.gi2703...@dread.disaster.area/ -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v4 00/23] Convert to filemap_get_folios_tag()
On Wed, Nov 02, 2022 at 09:10:08AM -0700, Vishal Moola (Oracle) wrote: > This patch series replaces find_get_pages_range_tag() with > filemap_get_folios_tag(). This also allows the removal of multiple > calls to compound_head() throughout. > It also makes a good chunk of the straightforward conversions to folios, > and takes the opportunity to introduce a function that grabs a folio > from the pagecache. > > F2fs and Ceph have quite a lot of work to be done regarding folios, so > for now those patches only have the changes necessary for the removal of > find_get_pages_range_tag(), and only support folios of size 1 (which is > all they use right now anyways). > > I've run xfstests on btrfs, ext4, f2fs, and nilfs2, but more testing may be > beneficial. The page-writeback and filemap changes implicitly work. Testing > and review of the other changes (afs, ceph, cifs, gfs2) would be appreciated. Same question as last time: have you tested this with multipage folios enabled? If you haven't tested XFS, then I'm guessing the answer is no, and you haven't fixed the bug I pointed out in the write_cache_pages() implementation -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 00/23] Convert to filemap_get_folios_tag()
On Thu, Sep 01, 2022 at 03:01:15PM -0700, Vishal Moola (Oracle) wrote: > This patch series replaces find_get_pages_range_tag() with > filemap_get_folios_tag(). This also allows the removal of multiple > calls to compound_head() throughout. > It also makes a good chunk of the straightforward conversions to folios, > and takes the opportunity to introduce a function that grabs a folio > from the pagecache. > > F2fs and Ceph have quite alot of work to be done regarding folios, so > for now those patches only have the changes necessary for the removal of > find_get_pages_range_tag(), and only support folios of size 1 (which is > all they use right now anyways). > > I've run xfstests on btrfs, ext4, f2fs, and nilfs2, but more testing may be > beneficial. Well, that answers my question about how filesystems that enable multi-page folios were tested: they weren't. I'd suggest that anyone working on further extending the filemap/folio infrastructure really needs to be testing XFS as a first priority, and then other filesystems as a secondary concern. That's because XFS (via the fs/iomap infrastructure) is one of only 3 filesystems in the kernel (AFS and tmpfs are the others) that interact with the page cache and page cache "pages" solely via folio interfaces. As such they are able to support multi-page folios in the page cache. All of the tested filesystems still use the fixed PAGE_SIZE page interfaces to interact with the page cache, so they don't actually exercise interactions with multi-page folios at all. Hence if you are converting generic infrastructure that looks up pages in the page cache to look up folios in the page cache, the code that processes the returned folios also needs to be updated and validated to ensure that it correctly handles multi-page folios. And the only way you can do that fully at this point in time is via testing XFS or AFS... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()
On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote: > Converted function to use folios throughout. This is in preparation for > the removal of find_get_pages_range_tag(). > > Signed-off-by: Vishal Moola (Oracle) > --- > mm/page-writeback.c | 44 +++- > 1 file changed, 23 insertions(+), 21 deletions(-) > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 032a7bf8d259..087165357a5a 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -2285,15 +2285,15 @@ int write_cache_pages(struct address_space *mapping, > int ret = 0; > int done = 0; > int error; > - struct pagevec pvec; > - int nr_pages; > + struct folio_batch fbatch; > + int nr_folios; > pgoff_t index; > pgoff_t end;/* Inclusive */ > pgoff_t done_index; > int range_whole = 0; > xa_mark_t tag; > > - pagevec_init(); > + folio_batch_init(); > if (wbc->range_cyclic) { > index = mapping->writeback_index; /* prev offset */ > end = -1; > @@ -2313,17 +2313,18 @@ int write_cache_pages(struct address_space *mapping, > while (!done && (index <= end)) { > int i; > > - nr_pages = pagevec_lookup_range_tag(, mapping, , end, > - tag); > - if (nr_pages == 0) > + nr_folios = filemap_get_folios_tag(mapping, , end, > + tag, ); This can find and return dirty multi-page folios if the filesystem enables them in the mapping at instantiation time, right? > + > + if (nr_folios == 0) > break; > > - for (i = 0; i < nr_pages; i++) { > - struct page *page = pvec.pages[i]; > + for (i = 0; i < nr_folios; i++) { > + struct folio *folio = fbatch.folios[i]; > > - done_index = page->index; > + done_index = folio->index; > > - lock_page(page); > + folio_lock(folio); > > /* >* Page truncated or invalidated. We can freely skip it > @@ -2333,30 +2334,30 @@ int write_cache_pages(struct address_space *mapping, >* even if there is now a new, dirty page at the same >* pagecache address. >*/ > - if (unlikely(page->mapping != mapping)) { > + if (unlikely(folio->mapping != mapping)) { > continue_unlock: > - unlock_page(page); > + folio_unlock(folio); > continue; > } > > - if (!PageDirty(page)) { > + if (!folio_test_dirty(folio)) { > /* someone wrote it for us */ > goto continue_unlock; > } > > - if (PageWriteback(page)) { > + if (folio_test_writeback(folio)) { > if (wbc->sync_mode != WB_SYNC_NONE) > - wait_on_page_writeback(page); > + folio_wait_writeback(folio); > else > goto continue_unlock; > } > > - BUG_ON(PageWriteback(page)); > - if (!clear_page_dirty_for_io(page)) > + BUG_ON(folio_test_writeback(folio)); > + if (!folio_clear_dirty_for_io(folio)) > goto continue_unlock; > > trace_wbc_writepage(wbc, inode_to_bdi(mapping->host)); > - error = (*writepage)(page, wbc, data); > + error = writepage(>page, wbc, data); Yet, IIUC, this treats all folios as if they are single page folios. i.e. it passes the head page of a multi-page folio to a callback that will treat it as a single PAGE_SIZE page, because that's all the writepage callbacks are currently expected to be passed... So won't this break writeback of dirty multipage folios? -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] remove iomap_writepage v2
On Thu, Jul 28, 2022 at 03:18:03PM +0100, Matthew Wilcox wrote: > On Thu, Jul 28, 2022 at 01:10:16PM +0200, Jan Kara wrote: > > Hi Christoph! > > > > On Tue 19-07-22 06:13:07, Christoph Hellwig wrote: > > > this series removes iomap_writepage and it's callers, following what xfs > > > has been doing for a long time. > > > > So this effectively means "no writeback from page reclaim for these > > filesystems" AFAICT (page migration of dirty pages seems to be handled by > > iomap_migrate_page()) which is going to make life somewhat harder for > > memory reclaim when memory pressure is high enough that dirty pages are > > reaching end of the LRU list. I don't expect this to be a problem on big > > machines but it could have some undesirable effects for small ones > > (embedded, small VMs). I agree per-page writeback has been a bad idea for > > efficiency reasons for at least last 10-15 years and most filesystems > > stopped dealing with more complex situations (like block allocation) from > > ->writepage() already quite a few years ago without any bug reports AFAIK. > > So it all seems like a sensible idea from FS POV but are MM people on board > > or at least aware of this movement in the fs land? > > I mentioned it during my folio session at LSFMM, but didn't put a huge > emphasis on it. > > For XFS, writeback should already be in progress on other pages if > we're getting to the point of trying to call ->writepage() in vmscan. > Surely this is also true for other filesystems? Yes. It's definitely true for btrfs, too, because btrfs_writepage does: static int btrfs_writepage(struct page *page, struct writeback_control *wbc) { struct inode *inode = page->mapping->host; int ret; if (current->flags & PF_MEMALLOC) { redirty_page_for_writepage(wbc, page); unlock_page(page); return 0; } It also rejects all calls to write dirty pages from memory reclaim contexts. ext4 will also reject writepage calls from memory allocation if block allocation is required (due to delayed allocation) or unwritten extents need converting to written. i.e. if it has to run blocking transactions. So all three major filesystems will either partially or wholly reject ->writepage calls from memory reclaim context. IOWs, if memory reclaim is depending on ->writepage() to make reclaim progress, it's not working as advertised on the vast majority of production Linux systems The reality is that ->writepage is a relic of a bygone era of OS and filesystem design. It was useful in the days where writing a dirty page just involved looking up the bufferhead attached to the page to get the disk mapping and then submitting it for IO. Those days are long gone - filesystems have complex IO submission paths now that have to handle delayed allocation, copy-on-write, unwritten extents, have unbound memory demand, etc. All the filesystems that support these 1990s era filesystem technologies simply turn off ->writepage in memory reclaim contexts. Hence for the vast majority of linux users (i.e. everyone using ext4, btrfs and XFS), ->writepage no longer plays any part in memory reclaim on their systems. So why should we try to maintain the fiction that ->writepage is required functionality in a filesystem when it clearly isn't? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH] iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL
On Wed, Mar 02, 2022 at 02:03:28PM +0100, Andreas Gruenbacher wrote: > On Wed, Mar 2, 2022 at 11:17 AM Filipe Manana wrote: > > On Tue, Mar 01, 2022 at 09:38:30AM +1100, Dave Chinner wrote: > > > On Mon, Feb 28, 2022 at 02:32:03PM +, fdman...@kernel.org wrote: > > > > From: Filipe Manana > > > . > > > > > > > 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K > > > > and return 4K (the amount of io done) to iomap_dio_complete_work(); > > > > > > > > 12) iomap_dio_complete_work() calls the iocb completion callback, > > > > iocb->ki_complete() with a second argument value of 4K (total io > > > > done) and the iocb with the adjust ki_pos of X + 4K. This results > > > > in completing the read request for io_uring, leaving it with a > > > > result of 4K bytes read, and only the first page of the buffer > > > > filled in, while the remaining 3 pages, corresponding to the other > > > > 3 extents, were not filled; > > > > > > > > 13) For the application, the result is unexpected because if we ask > > > > to read N bytes, it expects to get N bytes read as long as those > > > > N bytes don't cross the EOF (i_size). > > > > > > Yeah, that's exactly the sort of problem we were having with XFS > > > with partial DIO completions due to needing multiple iomap iteration > > > loops to complete a single IO. Hence IOMAP_NOWAIT now triggers the > > > above range check and aborts before we start... > > > > Interesting. > > Dave, this seems to affect all users of iomap_dio_rw in the same way, > so would it make sense to move this check there? Perhaps, but I'm not sure it makes sense because filesystems need to abort ->iomap_begin with -EAGAIN in IOMAP_NOWAIT contexts before they make any changes. Hence detecting short extents in the generic code becomes ... difficult because we might now need to undo changes that have been made in ->iomap_begin. e.g. if the filesystem allocates space and the iomap core says "not long enough" because IOMAP_NOWAIT is set, then we may have to punch out that allocation in ->iomap_end beforei returning -EAGAIN. That means filesystems like XFS may now need to supply a ->iomap_end function to undo stuff the core decides it shouldn't have done, instead of the filesystem ensuring it never does the operation it in the first place... IOWs, the correct behaviour here is for the filesystem ->iomap_begin method to see that it needs to allocate and return -EAGAIN if IOMAP_NOWAIT is set, not do the allocation and hope it that it ends up being long enough to cover the entire IO we have to do. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH] iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL
fragile, IMO. I'd much prefer that *_iomap_begin_write() implementations simply follow IOMAP_NOWAIT requirements to ensure that any DIO that needs multiple mappings if punted to a context that can block... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
On Wed, Feb 23, 2022 at 10:50:09PM -0500, Theodore Ts'o wrote: > On Thu, Feb 24, 2022 at 12:48:42PM +1100, Dave Chinner wrote: > > > Fair enough; on the other hand, we could also view this as making ext4 > > > more robust against buggy code in other subsystems, and while other > > > file systems may be losing user data if they are actually trying to do > > > remote memory access to file-backed memory, apparently other file > > > systems aren't noticing and so they're not crashing. > > > > Oh, we've noticed them, no question about that. We've got bug > > reports going back years for systems being crashed, triggering BUGs > > and/or corrupting data on both XFS and ext4 filesystems due to users > > trying to run RDMA applications with file backed pages. > > Is this issue causing XFS to crash? I didn't know that. I have no idea if crashes nowdays - go back a few years before and search for XFS BUGging out in ->invalidate_page (or was it ->release_page?) because of unexpected dirty pages. I think it could also trigger BUGs in writeback when ->writepages tripped over a dirty page without a delayed allocation mapping over the hole... We were pretty aggressive about telling people reporting such issues that they get to keep all the borken bits to themselves and to stop wasting our time with unsolvable problems caused by their broken-by-design RDMA applications. Hence people have largely stopped bothering us with random filesystem crashes on systems using RDMA on file-backed pages... > I tried the Syzbot reproducer with XFS mounted, and it didn't trigger > any crashes. I'm sure data was getting corrupted, but I figured I > should bring ext4 to the XFS level of "at least we're not reliably > killing the kernel". Oh, well, good to know XFS didn't die a horrible death immediately. Thanks for checking, Ted. > On ext4, an unprivileged process can use process_vm_writev(2) to crash > the system. I don't know how quickly we can get a fix into mm/gup.c, > but if some other kernel path tries calling set_page_dirty() on a > file-backed page without first asking permission from the file system, > it seems to be nice if the file system doesn't BUG() --- as near as I > can tell, xfs isn't crashing in this case, but ext4 is. iomap is probably refusing to map holes for writepage - we've cleaned up most of the weird edge cases to return errors, so I'm guessing iomap is just ignoring such pages these days. Yeah, see iomap_writepage_map(): error = wpc->ops->map_blocks(wpc, inode, pos); if (error) break; if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE)) continue; if (wpc->iomap.type == IOMAP_HOLE) continue; Yeah, so if writeback maps a hole rather than converts a delalloc region to IOMAP_MAPPED, it'll just skip over the block/page. IIRC, they essentially become uncleanable pages, and I think eventually inode reclaim will just toss them out of memory. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
On Wed, Feb 23, 2022 at 06:35:54PM -0500, Theodore Ts'o wrote: > On Fri, Feb 18, 2022 at 08:51:54AM +0100, Greg Kroah-Hartman wrote: > > > The challenge is that fixing this "the right away" is probably not > > > something we can backport into an LTS kernel, whether it's 5.15 or > > > 5.10... or 4.19. > > > > Don't worry about stable backports to start with. Do it the "right way" > > first and then we can consider if it needs to be backported or not. > > Fair enough; on the other hand, we could also view this as making ext4 > more robust against buggy code in other subsystems, and while other > file systems may be losing user data if they are actually trying to do > remote memory access to file-backed memory, apparently other file > systems aren't noticing and so they're not crashing. Oh, we've noticed them, no question about that. We've got bug reports going back years for systems being crashed, triggering BUGs and/or corrupting data on both XFS and ext4 filesystems due to users trying to run RDMA applications with file backed pages. Most of the people doing this now know that we won't support such applications until the RDMA stack/hardware can trigger on-demand write page faults the same way CPUs do when they first write to a clean page. They don't have this, so mostly these people don't bother reporting these class of problems to us anymore. The gup/RDMA infrastructure to make this all work is slowly moving forwards, but it's not here yet. > Issuing a > warning and then not crashing is arguably a better way for ext4 to > react, especially if there are other parts of the kernel that are > randomly calling set_page_dirty() on file-backed memory without > properly first informing the file system in a context where it can > block and potentially do I/O to do things like allocate blocks. I'm not sure that replacing the BUG() with a warning is good enough - it's still indicative of an application doing something dangerous that could result in silent data corruption and/or other problems. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v2.1 11/30] iomap: add the new iomap_iter model
On Wed, Aug 11, 2021 at 12:17:20PM -0700, Darrick J. Wong wrote: > From: Christoph Hellwig > > The iomap_iter struct provides a convenient way to package up and > maintain all the arguments to the various mapping and operation > functions. It is operated on using the iomap_iter() function that > is called in loop until the whole range has been processed. Compared > to the existing iomap_apply() function this avoid an indirect call > for each iteration. > > For now iomap_iter() calls back into the existing ->iomap_begin and > ->iomap_end methods, but in the future this could be further optimized > to avoid indirect calls entirely. > > Based on an earlier patch from Matthew Wilcox . > > Signed-off-by: Christoph Hellwig > [djwong: add to apply.c to preserve git history of iomap loop control] > Reviewed-by: Darrick J. Wong > Signed-off-by: Darrick J. Wong Looks like a straight translation of Christoph's original. Seems fine to me as a simepl step towards preserving the git history. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 31/30] iomap: move iomap iteration code to iter.c
On Wed, Aug 11, 2021 at 12:19:26PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong > > Now that we've moved iomap to the iterator model, rename this file to be > in sync with the functions contained inside of it. > > Signed-off-by: Darrick J. Wong > --- > fs/iomap/Makefile |2 +- > fs/iomap/iter.c |0 > 2 files changed, 1 insertion(+), 1 deletion(-) > rename fs/iomap/{apply.c => iter.c} (100%) > > diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile > index e46f936dde81..bb64215ae256 100644 > --- a/fs/iomap/Makefile > +++ b/fs/iomap/Makefile > @@ -26,9 +26,9 @@ ccflags-y += -I $(srctree)/$(src) # needed for > trace events > obj-$(CONFIG_FS_IOMAP) += iomap.o > > iomap-y += trace.o \ > -apply.o \ > buffered-io.o \ > direct-io.o \ > fiemap.o \ > +iter.o \ > seek.o > iomap-$(CONFIG_SWAP) += swapfile.o > diff --git a/fs/iomap/apply.c b/fs/iomap/iter.c > similarity index 100% > rename from fs/iomap/apply.c > rename to fs/iomap/iter.c LGTM, Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v2.1 24/30] iomap: remove iomap_apply
On Wed, Aug 11, 2021 at 12:18:26PM -0700, Darrick J. Wong wrote: > From: Christoph Hellwig > > iomap_apply is unused now, so remove it. > > Signed-off-by: Christoph Hellwig > [djwong: rebase this patch to preserve git history of iomap loop control] > Reviewed-by: Darrick J. Wong > Signed-off-by: Darrick J. Wong > --- > fs/iomap/apply.c | 91 > - > fs/iomap/trace.h | 40 -- > include/linux/iomap.h | 10 - > 3 files changed, 141 deletions(-) Looks good. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v2.1 19/30] iomap: switch iomap_bmap to use iomap_iter
On Wed, Aug 11, 2021 at 12:18:00PM -0700, Darrick J. Wong wrote: > From: Christoph Hellwig > > Rewrite the ->bmap implementation based on iomap_iter. > > Signed-off-by: Christoph Hellwig > [djwong: restructure the loop to make its behavior a little clearer] > Reviewed-by: Darrick J. Wong > Signed-off-by: Darrick J. Wong > --- > fs/iomap/fiemap.c | 31 +-- > 1 file changed, 13 insertions(+), 18 deletions(-) Looks good. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 11/30] iomap: add the new iomap_iter model
On Mon, Aug 09, 2021 at 08:12:25AM +0200, Christoph Hellwig wrote: > The iomap_iter struct provides a convenient way to package up and > maintain all the arguments to the various mapping and operation > functions. It is operated on using the iomap_iter() function that > is called in loop until the whole range has been processed. Compared > to the existing iomap_apply() function this avoid an indirect call > for each iteration. > > For now iomap_iter() calls back into the existing ->iomap_begin and > ->iomap_end methods, but in the future this could be further optimized > to avoid indirect calls entirely. > > Based on an earlier patch from Matthew Wilcox . > > Signed-off-by: Christoph Hellwig > --- > fs/iomap/Makefile | 1 + > fs/iomap/core.c | 79 +++ > fs/iomap/trace.h | 37 +++- > include/linux/iomap.h | 56 ++ > 4 files changed, 172 insertions(+), 1 deletion(-) > create mode 100644 fs/iomap/core.c > > diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile > index eef2722d93a183..6b56b10ded347a 100644 > --- a/fs/iomap/Makefile > +++ b/fs/iomap/Makefile > @@ -10,6 +10,7 @@ obj-$(CONFIG_FS_IOMAP) += iomap.o > > iomap-y += trace.o \ > apply.o \ > +core.o \ This creates a discontinuity in the iomap git history. Can you add these new functions to iomap/apply.c, then when the old apply code is removed later in the series rename the file to core.c? At least that way 'git log --follow fs/iomap/core.c' will walk back into the current history of fs/iomap/apply.c and the older pre-disaggregation fs/iomap.c without having to take the tree back in time to find those files... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 08/27] iomap: add the new iomap_iter model
On Mon, Jul 19, 2021 at 12:35:01PM +0200, Christoph Hellwig wrote: > The iomap_iter struct provides a convenient way to package up and > maintain all the arguments to the various mapping and operation > functions. It is operated on using the iomap_iter() function that > is called in loop until the whole range has been processed. Compared > to the existing iomap_apply() function this avoid an indirect call > for each iteration. > > For now iomap_iter() calls back into the existing ->iomap_begin and > ->iomap_end methods, but in the future this could be further optimized > to avoid indirect calls entirely. > > Based on an earlier patch from Matthew Wilcox . > > Signed-off-by: Christoph Hellwig > --- > fs/iomap/Makefile | 1 + > fs/iomap/iter.c | 74 +++ > fs/iomap/trace.h | 37 +- > include/linux/iomap.h | 56 > 4 files changed, 167 insertions(+), 1 deletion(-) > create mode 100644 fs/iomap/iter.c > > diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile > index eef2722d93a183..85034deb5a2f19 100644 > --- a/fs/iomap/Makefile > +++ b/fs/iomap/Makefile > @@ -10,6 +10,7 @@ obj-$(CONFIG_FS_IOMAP) += iomap.o > > iomap-y += trace.o \ > apply.o \ > +iter.o \ Can we break this cycle of creating new files and removing old files when changing the iomap core code? It breaks the ability to troll git history easily through git blame and other techniques that are file based. If we are going to create a new file, then the core iomap code that every thing depends on should just be in a neutrally names file such as "iomap.c" so that we don't need to play these games in future. > +/** > + * iomap_iter - iterate over a ranges in a file > + * @iter: iteration structue > + * @ops: iomap ops provided by the file system > + * > + * Iterate over file system provided contiguous ranges of blocks with the > same > + * state. Should be called in a loop that continues as long as this function > + * returns a positive value. If 0 or a negative value is returned the caller > + * should break out of the loop - a negative value is an error either from > the > + * file system or from the last iteration stored in @iter.copied. > + */ > +int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops) > +{ We should avoid namespace conflicts where function names shadow object types. iomap_iterate() is fine as the function name - there's no need for abbreviation here because it's not an overly long name. This will makes it clearly different to the struct iomap_iter that is passed to it and it will also make grep, cscope and other code searching tools much more precise... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 20/27] fsdax: switch dax_iomap_rw to use iomap_iter
On Mon, Jul 19, 2021 at 12:35:13PM +0200, Christoph Hellwig wrote: > Switch the dax_iomap_rw implementation to use iomap_iter. > > Signed-off-by: Christoph Hellwig > --- > fs/dax.c | 49 - > 1 file changed, 24 insertions(+), 25 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 4d63040fd71f56..51da45301350a6 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1103,20 +1103,21 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct > iomap *iomap) > return size; > } > > -static loff_t > -dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > - struct iomap *iomap, struct iomap *srcmap) > +static loff_t dax_iomap_iter(const struct iomap_iter *iomi, > + struct iov_iter *iter) At first I wondered "iomi? Strange name, why is this one-off name used?" and then I realised it's because this function also takes an struct iov_iter named "iter". That's going to cause confusion in the long run - iov_iter and iomap_iter both being generally named "iter", and then one or the other randomly changing when both are used in the same function. Would it be better to avoid any possible confusion simply by using "iomi" for all iomap_iter variables throughout the patchset from the start? That way nobody is going to confuse iov_iter with iomap_iter iteration variables and code that uses both types will naturally have different, well known names... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] RFC: iomap write invalidation
On Mon, Jul 13, 2020 at 09:46:31AM +0200, Christoph Hellwig wrote: > Hi all, > > this series has two parts: the first one picks up Dave's patch to avoid > invalidation entierly for reads, picked up deep down from the btrfs iomap > thread. The second one falls back to buffered writes if invalidation fails > instead of leaving a stale cache around. Let me know what you think about > this approch. Either we maintain application level concurrency for direct IOs and ignore the stale data in the page cache, or we kill application IO concurrency and keep the page cache coherent. It's a lose-lose choice and I'm on the fence as to which is the lesser of two evils. The main factor is whether the buffered IO fallback can be diagnosed. There's a new tracepoint for that case, so at least we will be able to tell if the fallback co-incides with application performance cratering. Hopefully this will only be a rare event. So, to hoist myself on my own petard: correctness first, performance second. Acked-by: Dave Chinner Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
On Thu, Jul 09, 2020 at 10:10:38AM -0700, Darrick J. Wong wrote: > On Thu, Jul 09, 2020 at 06:05:19PM +0100, Matthew Wilcox wrote: > > On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote: > > > On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote: > > > > -*/ > > > > - ret = invalidate_inode_pages2_range(mapping, > > > > - pos >> PAGE_SHIFT, end >> PAGE_SHIFT); > > > > - if (ret) > > > > - dio_warn_stale_pagecache(iocb->ki_filp); > > > > - ret = 0; > > > > + if (iov_iter_rw(iter) == WRITE) { > > > > + /* > > > > +* Try to invalidate cache pages for the range we're > > > > direct > > > > +* writing. If this invalidation fails, tough, the > > > > write will > > > > +* still work, but racing two incompatible write paths > > > > is a > > > > +* pretty crazy thing to do, so we don't support it > > > > 100%. > > > > +*/ > > > > + ret = invalidate_inode_pages2_range(mapping, > > > > + pos >> PAGE_SHIFT, end >> PAGE_SHIFT); > > > > + if (ret) > > > > + dio_warn_stale_pagecache(iocb->ki_filp); > > > > + ret = 0; > > > > > > > > - if (iov_iter_rw(iter) == WRITE && !wait_for_completion && > > > > - !inode->i_sb->s_dio_done_wq) { > > > > - ret = sb_init_dio_done_wq(inode->i_sb); > > > > - if (ret < 0) > > > > - goto out_free_dio; > > > > + if (!wait_for_completion && > > > > + !inode->i_sb->s_dio_done_wq) { > > > > + ret = sb_init_dio_done_wq(inode->i_sb); > > > > + if (ret < 0) > > > > + goto out_free_dio; > > ...and yes I did add in the closing brace here. :P Doh! I forgot to refresh the patch after fixing that. :/ Thanks! Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote: > On Wed, Jul 08, 2020 at 04:51:27PM +1000, Dave Chinner wrote: > > On Tue, Jul 07, 2020 at 03:00:30PM +0200, Christoph Hellwig wrote: > > > On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote: > > > > Indeed, I'm in favour of not invalidating > > > > the page cache at all for direct I/O. For reads, I think the page cache > > > > should be used to satisfy any portion of the read which is currently > > > > cached. For writes, I think we should write into the page cache pages > > > > which currently exist, and then force those pages to be written back, > > > > but left in cache. > > > > > > Something like that, yes. > > > > So are we really willing to take the performance regression that > > occurs from copying out of the page cache consuming lots more CPU > > than an actual direct IO read? Or that direct IO writes suddenly > > serialise because there are page cache pages and now we have to do > > buffered IO? > > > > Direct IO should be a deterministic, zero-copy IO path to/from > > storage. Using the CPU to copy data during direct IO is the complete > > opposite of the intended functionality, not to mention the behaviour > > that many applications have been careful designed and tuned for. > > Direct I/O isn't deterministic though. When all the application is doing is direct IO, it is as deterministic as the underlying storage behaviour. This is the best we can possibly do from the perspective of the filesystem, and this is largely what Direct IO requires from the filesystem. Direct IO starts from delegating all responsibility for IO synchronisation data coherency and integrity to userspace, and then follows up by requiring the filesystem and kernel to stay out of the IO path except where it is absolutely necessary to read or write data to/from the underlying storage hardware. Serving Direct IO from the page cache violates the second of these requirements. > If the file isn't shared, then > it works great, but as soon as you get mixed buffered and direct I/O, > everything is already terrible. Right, but that's *the rare exception* for applications using direct IO, not the common fast path. It is the slow path where -shit has already gone wrong on the production machine-, and it most definitely does not change the DIO requirements that the filesystem needs to provide userspace via the direct IO path. Optimising the slow exception path is fine if it does not affect the guarantees we try to provide through the common/fast path. If it is does affect behaviour of the fast path, then we must be able to either turn it off or provide our own alternative implementation that does not have that cost. > Direct I/Os perform pagecache lookups > already, but instead of using the data that we found in the cache, we > (if it's dirty) write it back, wait for the write to complete, remove > the page from the pagecache and then perform another I/O to get the data > that we just wrote out! And then the app that's using buffered I/O has > to read it back in again. Yup, that's because we have a history going back 20+ years of people finding performance regressions in applications using direct IO when we leave incorrectly left pages in the page cache rather than invalidating them and continuing to do direct IO. > Nobody's proposing changing Direct I/O to exclusively work through the > pagecache. The proposal is to behave less weirdly when there's already > data in the pagecache. No, the proposal it to make direct IO behave *less deterministically* if there is data in the page cache. e.g. Instead of having a predicatable submission CPU overhead and read latency of 100us for your data, this proposal makes the claim that it is always better to burn 10x the IO submission CPU for a single IO to copy the data and give that specific IO 10x lower latency than it is to submit 10 async IOs to keep the IO pipeline full. What it fails to take into account is that in spending that CPU time to copy the data, we haven't submitted 10 other IOs and so the actual in-flight IO for the application has decreased. If performance comes from keeping the IO pipeline as close to 100% full as possible, then copying the data out of the page cache will cause performance regressions. i.e. Hit 5 page cache pages in 5 IOs in a row, and the IO queue depth craters because we've only fulfilled 5 complete IOs instead of submitting 50 entire IOs. This is the hidden cost of synchronous IO via CPU data copying vs async IO via hardware offload, and if we take that into account we must look at future hardware performance trends to determine if this cost is going to increase or decrease in future. That is: CPUs are not getting faster anytime soon. IO subsyst
Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails
On Tue, Jul 07, 2020 at 03:00:30PM +0200, Christoph Hellwig wrote: > On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote: > > On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote: > > > On 9:53 01/07, Christoph Hellwig wrote: > > > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote: > > > > > From: Goldwyn Rodrigues > > > > > > > > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to > > > > > indicate > > > > > that if the page invalidation fails, return back control to the > > > > > filesystem so it may fallback to buffered mode. > > > > > > > > > > Reviewed-by: Darrick J. Wong > > > > > Signed-off-by: Goldwyn Rodrigues > > > > > > > > I'd like to start a discussion of this shouldn't really be the > > > > default behavior. If we have page cache that can't be invalidated it > > > > actually makes a whole lot of sense to not do direct I/O, avoid the > > > > warnings, etc. > > > > > > > > Adding all the relevant lists. > > > > > > Since no one responded so far, let me see if I can stir the cauldron :) > > > > > > What error should be returned in case of such an error? I think the > > > > Christoph's message is ambiguous. I don't know if he means "fail the > > I/O with an error" or "satisfy the I/O through the page cache". I'm > > strongly in favour of the latter. > > Same here. Sorry if my previous mail was unclear. > > > Indeed, I'm in favour of not invalidating > > the page cache at all for direct I/O. For reads, I think the page cache > > should be used to satisfy any portion of the read which is currently > > cached. For writes, I think we should write into the page cache pages > > which currently exist, and then force those pages to be written back, > > but left in cache. > > Something like that, yes. So are we really willing to take the performance regression that occurs from copying out of the page cache consuming lots more CPU than an actual direct IO read? Or that direct IO writes suddenly serialise because there are page cache pages and now we have to do buffered IO? Direct IO should be a deterministic, zero-copy IO path to/from storage. Using the CPU to copy data during direct IO is the complete opposite of the intended functionality, not to mention the behaviour that many applications have been careful designed and tuned for. Hence I think that forcing iomap to use cached pages for DIO is a non-starter. I have no problems with providing infrastructure that allows filesystems to -opt in- to using buffered IO for the direct IO path. However, the change in IO behaviour caused by unpredicably switching between direct IO and buffered IO (e.g. suddening DIO writes serialise -all IO-) will cause unacceptible performance regressions for many applications and be -very difficult to diagnose- in production systems. IOWs, we need to let the individual filesystems decide how they want to use the page cache for direct IO. Just because we have new direct IO infrastructure (i.e. iomap) it does not mean we can just make wholesale changes to the direct IO path behaviour... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
On Tue, Feb 18, 2020 at 10:04:15PM -0800, Matthew Wilcox wrote: > On Wed, Feb 19, 2020 at 02:29:00PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote: > > > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t > > > pos, loff_t length, > > > } > > > ret = iomap_readpage_actor(inode, pos + done, length - done, > > > ctx, iomap, srcmap); > > > + if (WARN_ON(ret == 0)) > > > + break; > > > > This error case now leaks ctx->cur_page > > Yes ... and I see the consequence. I mean, this is a "shouldn't happen", > so do we want to put effort into cleanup here ... Well, the normal thing for XFS is that a production kernel cleans up and handles the error gracefully with a WARN_ON_ONCE, while a debug kernel build will chuck a tanty and burn the house down so to make the developers aware that there is a "should not happen" situation occurring > > > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, > > > struct list_head *pages, > > > done: > > > if (ctx.bio) > > > submit_bio(ctx.bio); > > > - if (ctx.cur_page) { > > > - if (!ctx.cur_page_in_bio) > > > - unlock_page(ctx.cur_page); > > > - put_page(ctx.cur_page); > > > - } > > > + BUG_ON(ctx.cur_page); > > > > And so will now trigger both a warn and a bug > > ... or do we just want to run slap bang into this bug? > > Option 1: Remove the check for 'ret == 0' altogether, as we had it before. > That puts us into endless loop territory for a failure mode, and it's not > parallel with iomap_readpage(). > > Option 2: Remove the WARN_ON from the check. Then we just hit the BUG_ON, > but we don't know why we did it. > > Option 3: Set cur_page to NULL. We'll hit the WARN_ON, avoid the BUG_ON, > might end up with a page in the page cache which is never unlocked. None of these are appealing. > Option 4: Do the unlock/put page dance before setting the cur_page to NULL. > We might double-unlock the page. why would we double unlock the page? Oh, the readahead cursor doesn't handle the case of partial page submission, which would result in IO completion unlocking the page. Ok, that's what the ctx.cur_page_in_bio check is used to detect i.e. if we've got a page that the readahead cursor points at, and we haven't actually added it to a bio, then we can leave it to the read_pages() to unlock and clean up. If it's in a bio, then IO completion will unlock it and so we only have to drop the submission reference and move the readahead cursor forwards so read_pages() doesn't try to unlock this page. i.e: /* clean up partial page submission failures */ if (ctx.cur_page && ctx.cur_page_in_bio) { put_page(ctx.cur_page); readahead_next(rac); } looks to me like it will handle the case of "ret == 0" in the actor function just fine. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API
On Tue, Feb 18, 2020 at 07:48:32PM -0800, Matthew Wilcox wrote: > On Wed, Feb 19, 2020 at 02:45:25PM +1100, Dave Chinner wrote: > > On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote: > > > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > > > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > > > > Latest version in your git tree: > > > > > > > > > > $ ▶ glo -n 5 willy/readahead > > > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > > > > ff63497fcb98 iomap: Convert from readpages to readahead > > > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > > > > 8115bcca7312 fuse: Convert from readpages to readahead > > > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > > > > $ > > > > > > > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > > > > > > > [2.423116] [ cut here ] > > > > > [2.424957] list_add double add: new=ea000efff4c8, > > > > > prev=8883bfffee60, next=ea000efff4c8. > > > > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > > > > __list_add_valid+0x67/0x70 > > > > > [2.457484] Call Trace: > > > > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > > > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > > > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > > > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > > > > [2.462787] lru_add_drain+0x18/0x20 > > > > > > > > Are you sure that was 4be497096c04 ? I ask because there was a > > > > > > Yes, because it's the only version I've actually merged into my > > > working tree, compiled and tried to run. :P > > > > > > > version pushed to that git tree that did contain a list double-add > > > > (due to a mismerge when shuffling patches). I noticed it and fixed > > > > it, and 4be497096c04 doesn't have that problem. I also test with > > > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > > > > probabilistic because it'll depend on the timing between whatever other > > > > list is being used and the page actually being added to the LRU. > > > > > > I'll see if I can reproduce it. > > > > Just updated to a current TOT Linus kernel and your latest branch, > > and so far this is 100% reproducable. > > > > Not sure how I'm going to debug it yet, because it's init that is > > triggering it > > Eric found it ... Yeah, just saw that and am applying his patch to test it... > still not sure why I don't see it. No readahead configured on your device? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API
On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote: > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > > Latest version in your git tree: > > > > > > $ ▶ glo -n 5 willy/readahead > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > > ff63497fcb98 iomap: Convert from readpages to readahead > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > > 8115bcca7312 fuse: Convert from readpages to readahead > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > > $ > > > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > > > [2.423116] [ cut here ] > > > [2.424957] list_add double add: new=ea000efff4c8, > > > prev=8883bfffee60, next=ea000efff4c8. > > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > > __list_add_valid+0x67/0x70 > > > [2.457484] Call Trace: > > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > > [2.462787] lru_add_drain+0x18/0x20 > > > > Are you sure that was 4be497096c04 ? I ask because there was a > > Yes, because it's the only version I've actually merged into my > working tree, compiled and tried to run. :P > > > version pushed to that git tree that did contain a list double-add > > (due to a mismerge when shuffling patches). I noticed it and fixed > > it, and 4be497096c04 doesn't have that problem. I also test with > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > > probabilistic because it'll depend on the timing between whatever other > > list is being used and the page actually being added to the LRU. > > I'll see if I can reproduce it. Just updated to a current TOT Linus kernel and your latest branch, and so far this is 100% reproducable. Not sure how I'm going to debug it yet, because it's init that is triggering it Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 18/19] iomap: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:12AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in iomap. Convert XFS and ZoneFS to > use it. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/iomap/buffered-io.c | 91 +++--- > fs/iomap/trace.h | 2 +- > fs/xfs/xfs_aops.c | 13 +++--- > fs/zonefs/super.c | 7 ++-- > include/linux/iomap.h | 3 +- > 5 files changed, 42 insertions(+), 74 deletions(-) All pretty straight forward... > @@ -416,6 +384,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > break; > done += ret; > if (offset_in_page(pos + done) == 0) { > + readahead_next(ctx->rac); > if (!ctx->cur_page_in_bio) > unlock_page(ctx->cur_page); > put_page(ctx->cur_page); Though now I look at the addition here, this might be better restructured to mention how we handle partial page submission such as: done += ret; /* * Keep working on a partially complete page, otherwise ready * the ctx for the next page to be acted on. */ if (offset_in_page(pos + done)) continue; if (!ctx->cur_page_in_bio) unlock_page(ctx->cur_page); put_page(ctx->cur_page); ctx->cur_page = NULL; readahead_next(ctx->rac); } Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor
On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > By putting the 'have we reached the end of the page' condition at the end > of the loop instead of the beginning, we can remove the 'submit the last > page' code from iomap_readpages(). Also check that iomap_readpage_actor() > didn't return 0, which would lead to an endless loop. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/iomap/buffered-io.c | 25 - > 1 file changed, 12 insertions(+), 13 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index cb3511eb152a..44303f370b2d 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > void *data, struct iomap *iomap, struct iomap *srcmap) > { > struct iomap_readpage_ctx *ctx = data; > - loff_t done, ret; > + loff_t ret, done = 0; > > - for (done = 0; done < length; done += ret) { > - if (ctx->cur_page && offset_in_page(pos + done) == 0) { > - if (!ctx->cur_page_in_bio) > - unlock_page(ctx->cur_page); > - put_page(ctx->cur_page); > - ctx->cur_page = NULL; > - } > + while (done < length) { > if (!ctx->cur_page) { > ctx->cur_page = iomap_next_page(inode, ctx->pages, > pos, length, ); > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, > loff_t length, > } > ret = iomap_readpage_actor(inode, pos + done, length - done, > ctx, iomap, srcmap); > + if (WARN_ON(ret == 0)) > + break; This error case now leaks ctx->cur_page > + done += ret; > + if (offset_in_page(pos + done) == 0) { > + if (!ctx->cur_page_in_bio) > + unlock_page(ctx->cur_page); > + put_page(ctx->cur_page); > + ctx->cur_page = NULL; > + } > } > > return done; > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct > list_head *pages, > done: > if (ctx.bio) > submit_bio(ctx.bio); > - if (ctx.cur_page) { > - if (!ctx.cur_page_in_bio) > - unlock_page(ctx.cur_page); > - put_page(ctx.cur_page); > - } > + BUG_ON(ctx.cur_page); And so will now trigger both a warn and a bug Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 16/19] fuse: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:09AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in fuse. Switching away from the > read_cache_pages() helper gets rid of an implicit call to put_page(), > so we can get rid of the get_page() call in fuse_readpages_fill(). > > Signed-off-by: Matthew Wilcox (Oracle) Looks OK. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 14/19] ext4: Convert from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:05AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in ext4 > > Signed-off-by: Matthew Wilcox (Oracle) There's nothing I can see in this that would cause that list corruption I saw with ext4. I'll re-introduce the patch and see if it falls over again. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 13/19] erofs: Convert compressed files from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:03AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in erofs. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/erofs/zdata.c | 29 + > 1 file changed, 9 insertions(+), 20 deletions(-) Looks fine. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 12/19] erofs: Convert uncompressed files from readpages to readahead
On Mon, Feb 17, 2020 at 10:46:01AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Use the new readahead operation in erofs > > Signed-off-by: Matthew Wilcox (Oracle) > --- > fs/erofs/data.c | 39 +--- > fs/erofs/zdata.c | 2 +- > include/trace/events/erofs.h | 6 +++--- > 3 files changed, 18 insertions(+), 29 deletions(-) Looks fine from the perspective of page iteration and error handling. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 11/19] btrfs: Convert from readpages to readahead
On Tue, Feb 18, 2020 at 01:12:28PM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:57:58PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:59AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > - if (nr) { > > > - u64 contig_start = page_offset(pagepool[0]); > > > + ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end); > > > > Ok, yes it does. :) > > > > I don't see how readahead_for_each_batch() guarantees that, though. > > I ... don't see how it doesn't? We start at rac->_start and iterate > through the consecutive pages in the page cache. readahead_for_each_batch() > does assume that __do_page_cache_readahead() has its current behaviour > of putting the pages in the page cache in order, and kicks off a new > call to ->readahead() every time it has to skip an index for whatever > reason (eg page already in page cache). And there is the comment I was looking for while reading readahead_for_each_batch() :) > > > > - if (bio) > > > - return submit_one_bio(bio, 0, bio_flags); > > > - return 0; > > > + if (bio) { > > > + if (submit_one_bio(bio, 0, bio_flags)) > > > + return; > > > + } > > > } > > > > Shouldn't that just be > > > > if (bio) > > submit_one_bio(bio, 0, bio_flags); > > It should, but some overzealous person decided to mark submit_one_bio() > as __must_check, so I have to work around that. /me looks at code Ngggh. I rather dislike functions that are named in a way that look like they belong to core kernel APIs but in reality are local static functions. I'd ask for this to be fixed if it was generic code, but it's btrfs specific code so they can deal with the ugliness of their own creation. :/ > > Confusing when put alongside rac->_batch_count counting the number > > of pages in the batch, and "batch" being the index into the page > > array, and they aren't the same counts > > Yes. Renamed to 'i'. > > > > + XA_STATE(xas, >mapping->i_pages, rac->_start); > > > + struct page *page; > > > + > > > + rac->_batch_count = 0; > > > + xas_for_each(, page, rac->_start + rac->_nr_pages - 1) { > > > > That just iterates pages in the start,end doesn't it? What > > guarantees that this fills the array with a contiguous page range? > > The behaviour of __do_page_cache_readahead(). Dave Howells also has a > usecase for xas_for_each_contig(), so I'm going to add that soon. > > > > + VM_BUG_ON_PAGE(!PageLocked(page), page); > > > + VM_BUG_ON_PAGE(PageTail(page), page); > > > + array[batch++] = page; > > > + rac->_batch_count += hpage_nr_pages(page); > > > + if (PageHead(page)) > > > + xas_set(, rac->_start + rac->_batch_count); > > > > What on earth does this do? Comments please! > > /* >* The page cache isn't using multi-index entries yet, >* so xas_for_each() won't do the right thing for >* large pages. This can be removed once the page cache >* is converted. >*/ Oh, it's changing the internal xarray lookup cursor position to point at the correct next page index? Perhaps it's better to say that instead of "won't do the right thing"? > > > +#define readahead_for_each_batch(rac, array, size, nr) > > > \ > > > + for (; (nr = readahead_page_batch(rac, array, size)); \ > > > + readahead_next(rac)) > > > > I had to go look at the caller to work out what "size" refered to > > here. > > > > This is complex enough that it needs proper API documentation. > > How about just: > > -#define readahead_for_each_batch(rac, array, size, nr) \ > - for (; (nr = readahead_page_batch(rac, array, size)); \ > +#define readahead_for_each_batch(rac, array, array_sz, nr) \ > + for (; (nr = readahead_page_batch(rac, array, array_sz)); \ Yup, that's fine - now the macro documents itself. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 09/19] mm: Add page_cache_readahead_limit
On Tue, Feb 18, 2020 at 11:54:04AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:31:10PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > ext4 and f2fs have duplicated the guts of the readahead code so > > > they can read past i_size. Instead, separate out the guts of the > > > readahead code so they can call it directly. > > > > Gross and nasty (hosting non-stale data beyond EOF in the page > > cache, that is). > > I thought you meant sneaking changes into the VFS (that were rejected) by > copying VFS code and modifying it ... Well, now that you mention it... :P > > > +/** > > > + * page_cache_readahead_limit - Start readahead beyond a file's i_size. > > > + * @mapping: File address space. > > > + * @file: This instance of the open file; used for authentication. > > > + * @offset: First page index to read. > > > + * @end_index: The maximum page index to read. > > > + * @nr_to_read: The number of pages to read. > > > + * @lookahead_size: Where to start the next readahead. > > > + * > > > + * This function is for filesystems to call when they want to start > > > + * readahead potentially beyond a file's stated i_size. If you want > > > + * to start readahead on a normal file, you probably want to call > > > + * page_cache_async_readahead() or page_cache_sync_readahead() instead. > > > + * > > > + * Context: File is referenced by caller. Mutexes may be held by caller. > > > + * May sleep, but will not reenter filesystem to reclaim memory. > > > */ > > > -void __do_page_cache_readahead(struct address_space *mapping, > > > - struct file *filp, pgoff_t offset, unsigned long nr_to_read, > > > - unsigned long lookahead_size) > > > +void page_cache_readahead_limit(struct address_space *mapping, > > > > ... I don't think the function name conveys it's purpose. It's > > really a ranged readahead that ignores where i_size lies. i.e > > > > page_cache_readahead_range(mapping, start, end, nr_to_read) > > > > seems like a better API to me, and then you can drop the "start > > readahead beyond i_size" comments and replace it with "Range is not > > limited by the inode's i_size and hence can be used to read data > > stored beyond EOF into the page cache." > > I'm concerned that calling it 'range' implies "I want to read between > start and end" rather than "I want to read nr_to_read at start, oh but > don't go past end". > > Maybe the right way to do this is have the three callers cap nr_to_read. > Well, the one caller ... after all, f2fs and ext4 have no desire to > cap the length. Then we can call it page_cache_readahead_exceed() or > page_cache_readahead_dangerous() or something else like that to make it > clear that you shouldn't be calling it. Fair point. And in reading this, it occurred to me that what we are enabling is an "out of bounds" readahead function. so page_cache_readahead_OOB() or *_unbounded() might be a better name > * Like add_to_page_cache_locked, but used to add newly allocated pages: > diff --git a/mm/readahead.c b/mm/readahead.c > index 9dd431fa16c9..cad26287ad8b 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -142,45 +142,43 @@ static void read_pages(struct readahead_control *rac, > struct list_head *pages) > blk_finish_plug(); > } > > -/* > - * __do_page_cache_readahead() actually reads a chunk of disk. It allocates > - * the pages first, then submits them for I/O. This avoids the very bad > - * behaviour which would occur if page allocations are causing VM writeback. > - * We really don't want to intermingle reads and writes like that. > +/** > + * page_cache_readahead_exceed - Start unchecked readahead. > + * @mapping: File address space. > + * @file: This instance of the open file; used for authentication. > + * @index: First page index to read. > + * @nr_to_read: The number of pages to read. > + * @lookahead_size: Where to start the next readahead. > + * > + * This function is for filesystems to call when they want to start > + * readahead beyond a file's stated i_size. This is almost certainly > + * not the function you want to call. Use page_cache_async_readahead() > + * or page_cache_sync_readahead() instead. > + * > + * Context: File is referenced by caller. Mutexes may be held by caller. > + * May sleep, but will not reenter filesystem to reclaim memory. Yup, looks much better. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 08/19] mm: Add readahead address space operation
On Tue, Feb 18, 2020 at 08:10:04AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:21:47PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > This replaces ->readpages with a saner interface: > > > - Return void instead of an ignored error code. > > > - Pages are already in the page cache when ->readahead is called. > > > > Might read better as: > > > > - Page cache is already populates with locked pages when > >->readahead is called. > > Will do. > > > > - Implementation looks up the pages in the page cache instead of > > >having them passed in a linked list. > > > > Add: > > > > - cleanup of unused readahead handled by ->readahead caller, not > >the method implementation. > > The readpages caller does that cleanup too, so it's not an advantage > to the readahead interface. Right. I kinda of read the list as "the reasons the new API is sane" not as "how readahead is different to readpages" > > > ``readpages`` > > > called by the VM to read pages associated with the address_space > > > object. This is essentially just a vector version of readpage. > > > Instead of just one page, several pages are requested. > > > readpages is only used for read-ahead, so read errors are > > > ignored. If anything goes wrong, feel free to give up. > > > + This interface is deprecated; implement readahead instead. > > > > What is the removal schedule for the deprecated interface? > > I mentioned that in the cover letter; once Dave Howells has the fscache > branch merged, I'll do the remaining filesystems. Should be within the > next couple of merge windows. Sure, but I like to see actual release tags with the deprecation notice so that it's obvious to the reader as to whether this is something new and/or when they can expect it to go away. > > > +/* The byte offset into the file of this readahead block */ > > > +static inline loff_t readahead_offset(struct readahead_control *rac) > > > +{ > > > + return (loff_t)rac->_start * PAGE_SIZE; > > > +} > > > > Urk. Didn't an early page use "offset" for the page index? That > > was was "mm: Remove 'page_offset' from readahead loop" did, right? > > > > That's just going to cause confusion to have different units for > > readahead "offsets" > > We are ... not consistent anywhere in the VM/VFS with our naming. > Unfortunately. > > $ grep -n offset mm/filemap.c > 391: * @start:offset in bytes where the range starts > ... > 815: pgoff_t offset = old->index; > ... > 2020: unsigned long offset; /* offset into pagecache page */ > ... > 2257: *ppos = ((loff_t)index << PAGE_SHIFT) + offset; > > That last one's my favourite. Not to mention the fine distinction you > and I discussed recently between offset_in_page() and page_offset(). > > Best of all, even our types encode the ambiguity of an 'offset'. We have > pgoff_t and loff_t (replacing the earlier off_t). > > So, new rule. 'pos' is the number of bytes into a file. 'index' is the > number of PAGE_SIZE pages into a file. We don't use the word 'offset' > at all. 'length' as a byte count and 'count' as a page count seem like > fine names to me. That sounds very reasonable to me. Another patchset in the making? :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 07/19] mm: Put readahead pages in cache earlier
On Tue, Feb 18, 2020 at 07:42:22AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 05:14:59PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:52AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > At allocation time, put the pages in the cache unless we're using > > > ->readpages. Add the readahead_for_each() iterator for the benefit of > > > the ->readpage fallback. This iterator supports huge pages, even though > > > none of the filesystems to be converted do yet. > > > > This could be better written - took me some time to get my head > > around it and the code. > > > > "When populating the page cache for readahead, mappings that don't > > use ->readpages need to have their pages added to the page cache > > before ->readpage is called. Do this insertion earlier so that the > > pages can be looked up immediately prior to ->readpage calls rather > > than passing them on a linked list. This early insert functionality > > is also required by the upcoming ->readahead method that will > > replace ->readpages. > > > > Optimise and simplify the readpage loop by adding a > > readahead_for_each() iterator to provide the pages we need to read. > > This iterator also supports huge pages, even though none of the > > filesystems have been converted to use them yet." > > Thanks, I'll use that. > > > > +static inline struct page *readahead_page(struct readahead_control *rac) > > > +{ > > > + struct page *page; > > > + > > > + if (!rac->_nr_pages) > > > + return NULL; > > > > H. > > > > > + > > > + page = xa_load(>mapping->i_pages, rac->_start); > > > + VM_BUG_ON_PAGE(!PageLocked(page), page); > > > + rac->_batch_count = hpage_nr_pages(page); > > > > So we could have rac->_nr_pages = 2, and then we get an order 2 > > large page returned, and so rac->_batch_count = 4. > > Well, no, we couldn't. rac->_nr_pages is incremented by 4 when we add > an order-2 page to the readahead. I don't see any code that does that. :) i.e. we aren't actually putting high order pages into the page cache here - page_alloc() allocates order-0 pages) - so there's nothing in the patch that tells me how rac->_nr_pages behaves when allocating large pages... IOWs, we have an undocumented assumption in the implementation... > I can put a > BUG_ON(rac->_batch_count > rac->_nr_pages) > in here to be sure to catch any logic error like that. Definitely necessary given that we don't insert large pages for readahead yet. A comment explaining the assumptions that the code makes for large pages is probably in order, too. > > > - page->index = offset; > > > - list_add(>lru, _pool); > > > + if (use_list) { > > > + page->index = offset; > > > + list_add(>lru, _pool); > > > + } else if (add_to_page_cache_lru(page, mapping, offset, > > > + gfp_mask) < 0) { > > > + put_page(page); > > > + goto read; > > > + } > > > > Ok, so that's why you put read code at the end of the loop. To turn > > the code into spaghetti :/ > > > > How much does this simplify down when we get rid of ->readpages and > > can restructure the loop? This really seems like you're trying to > > flatten two nested loops into one by the use of goto > > I see it as having two failure cases in this loop. One for "page is > already present" (which already existed) and one for "allocated a page, > but failed to add it to the page cache" (which used to be done later). > I didn't want to duplicate the "call read_pages()" code. So I reshuffled > the code rather than add a nested loop. I don't think the nested loop > is easier to read (we'll be at 5 levels of indentation for some statements). > Could do it this way ... Can we move the update of @rac inside read_pages()? The next start offset^Windex we start at is rac._start + rac._nr_pages, right? so read_pages() could do: { if (readahead_count(rac)) { /* do readahead */ } /* advance the readahead cursor */ rac->_start += rac->_nr_pages; rac._nr_pages = 0; } and then we only need to call read_pages() in these cases and so the requirement for avoiding duplicating code is avoided... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 04/19] mm: Rearrange readahead loop
On Tue, Feb 18, 2020 at 05:57:36AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 04:08:24PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote: > > > From: "Matthew Wilcox (Oracle)" > > > > > > Move the declaration of 'page' to inside the loop and move the 'kick > > > off a fresh batch' code to the end of the function for easier use in > > > subsequent patches. > > > > Stale? the "kick off" code is moved to the tail of the loop, not the > > end of the function. > > Braino; I meant to write end of the loop. > > > > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space > > > *mapping, > > > page = xa_load(>i_pages, page_offset); > > > if (page && !xa_is_value(page)) { > > > /* > > > - * Page already present? Kick off the current batch of > > > - * contiguous pages before continuing with the next > > > - * batch. > > > + * Page already present? Kick off the current batch > > > + * of contiguous pages before continuing with the > > > + * next batch. This page may be the one we would > > > + * have intended to mark as Readahead, but we don't > > > + * have a stable reference to this page, and it's > > > + * not worth getting one just for that. > > >*/ > > > - if (readahead_count()) > > > - read_pages(, _pool, gfp_mask); > > > - rac._nr_pages = 0; > > > - continue; > > > + goto read; > > > } > > > > > > page = __page_cache_alloc(gfp_mask); > > > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space > > > *mapping, > > > if (page_idx == nr_to_read - lookahead_size) > > > SetPageReadahead(page); > > > rac._nr_pages++; > > > + continue; > > > +read: > > > + if (readahead_count()) > > > + read_pages(, _pool, gfp_mask); > > > + rac._nr_pages = 0; > > > } > > > > Also, why? This adds a goto from branched code that continues, then > > adds a continue so the unbranched code doesn't execute the code the > > goto jumps to. In absence of any explanation, this isn't an > > improvement and doesn't make any sense... > > I thought I was explaining it ... "for easier use in subsequent patches". Sorry, my braino there. :) I commented on the problem with the first part of the sentence, then the rest of the sentence completely failed to sink in. -Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 03/19] mm: Use readahead_control to pass arguments
On Tue, Feb 18, 2020 at 05:56:18AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 04:03:00PM +1100, Dave Chinner wrote: > > On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote: > > > +static void read_pages(struct readahead_control *rac, struct list_head > > > *pages, > > > + gfp_t gfp) > > > { > > > + const struct address_space_operations *aops = rac->mapping->a_ops; > > > struct blk_plug plug; > > > unsigned page_idx; > > > > Splitting out the aops rather than the mapping here just looks > > weird, especially as you need the mapping later in the function. > > Using aops doesn't even reduce the code side > > It does in subsequent patches ... I agree it looks a little weird here, > but I think in the final form, it makes sense: Ok. Perhaps just an additional commit comment to say "read_pages() is changed to be aops centric as @rac abstracts away all other implementation details by the end of the patchset." > > > + if (readahead_count()) > > > + read_pages(, _pool, gfp_mask); > > > + rac._nr_pages = 0; > > > > Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead > > control structure - if we have to pass the gfp_mask down all the > > way along side the rac, then I think it makes sense to do that... > > So we end up removing it later on in this series, but I do wonder if > it would make sense anyway. By the end of the series, we still have > this in iomap: > > if (ctx->rac) /* same as readahead_gfp_mask */ > gfp |= __GFP_NORETRY | __GFP_NOWARN; > > and we could get rid of that by passing gfp flags down in the rac. On the > other hand, I don't know why it doesn't just use readahead_gfp_mask() > here anyway ... Christoph? mapping->gfp_mask is awful. Is it a mask, or is it a valid set of allocation flags? Or both? Some callers to mapping_gfp_constraint() uses it as a mask, some callers to mapping_gfp_constraint() use it as base flags that context specific flags get masked out of, readahead_gfp_mask() callers use it as the entire set of gfp flags for allocation. That whole API sucks - undocumented as to what it's suposed to do and how it's supposed to be used. Hence it's difficult to use correctly or understand whether it's being used correctly. And reading callers only leads to more confusion and crazy code like in do_mpage_readpage() where readahead returns a mask that are used as base flags and normal reads return a masked set of base flags... The iomap code is obviously correct when it comes to gfp flag manipulation. We start with GFP_KERNEL context, then constrain it via the mask held in mapping->gfp_mask, then if it's readahead we allow the allocation to silently fail. Simple to read and understand code, versus having weird code that requires the reader to decipher an undocumented and inconsistent API to understand how the gfp flags have been calculated and are valid. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API
On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote: > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote: > > Latest version in your git tree: > > > > $ ▶ glo -n 5 willy/readahead > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path > > ff63497fcb98 iomap: Convert from readpages to readahead > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor > > 8115bcca7312 fuse: Convert from readpages to readahead > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead > > $ > > > > merged into a 5.6-rc2 tree fails at boot on my test vm: > > > > [2.423116] [ cut here ] > > [2.424957] list_add double add: new=ea000efff4c8, > > prev=8883bfffee60, next=ea000efff4c8. > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 > > __list_add_valid+0x67/0x70 > > [2.457484] Call Trace: > > [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 > > [2.459376] pagevec_lru_move_fn+0x87/0xd0 > > [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 > > [2.461712] lru_add_drain_cpu+0x8d/0x160 > > [2.462787] lru_add_drain+0x18/0x20 > > Are you sure that was 4be497096c04 ? I ask because there was a Yes, because it's the only version I've actually merged into my working tree, compiled and tried to run. :P > version pushed to that git tree that did contain a list double-add > (due to a mismerge when shuffling patches). I noticed it and fixed > it, and 4be497096c04 doesn't have that problem. I also test with > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be > probabilistic because it'll depend on the timing between whatever other > list is being used and the page actually being added to the LRU. I'll see if I can reproduce it. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 11/19] btrfs: Convert from readpages to readahead
VM_BUG_ON_PAGE(!PageLocked(page), page); > + VM_BUG_ON_PAGE(PageTail(page), page); > + array[batch++] = page; > + rac->_batch_count += hpage_nr_pages(page); > + if (PageHead(page)) > + xas_set(, rac->_start + rac->_batch_count); What on earth does this do? Comments please! > + > + if (batch == size) > + break; > + } > + > + return batch; > +} Seems a bit big for an inline function. > + > +#define readahead_for_each_batch(rac, array, size, nr) > \ > + for (; (nr = readahead_page_batch(rac, array, size)); \ > + readahead_next(rac)) I had to go look at the caller to work out what "size" refered to here. This is complex enough that it needs proper API documentation. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead
On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Implement the new readahead aop and convert all callers (block_dev, > exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6, > reiserfs & udf). The callers are all trivial except for GFS2 & OCFS2. > > Signed-off-by: Matthew Wilcox (Oracle) > Reviewed-by: Junxiao Bi # ocfs2 > --- > drivers/staging/exfat/exfat_super.c | 7 +++--- > fs/block_dev.c | 7 +++--- > fs/ext2/inode.c | 10 +++- > fs/fat/inode.c | 7 +++--- > fs/gfs2/aops.c | 23 ++--- > fs/hpfs/file.c | 7 +++--- > fs/iomap/buffered-io.c | 2 +- > fs/isofs/inode.c| 7 +++--- > fs/jfs/inode.c | 7 +++--- > fs/mpage.c | 38 + > fs/nilfs2/inode.c | 15 +++- > fs/ocfs2/aops.c | 34 ++ > fs/omfs/file.c | 7 +++--- > fs/qnx6/inode.c | 7 +++--- > fs/reiserfs/inode.c | 8 +++--- > fs/udf/inode.c | 7 +++--- > include/linux/mpage.h | 4 +-- > mm/migrate.c| 2 +- > 18 files changed, 73 insertions(+), 126 deletions(-) That's actually pretty simple changeover. Nothing really scary there. :) Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 09/19] mm: Add page_cache_readahead_limit
On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > ext4 and f2fs have duplicated the guts of the readahead code so > they can read past i_size. Instead, separate out the guts of the > readahead code so they can call it directly. Gross and nasty (hosting non-stale data beyond EOF in the page cache, that is). Code is pretty simple, but... > } > > -/* > - * __do_page_cache_readahead() actually reads a chunk of disk. It allocates > - * the pages first, then submits them for I/O. This avoids the very bad > - * behaviour which would occur if page allocations are causing VM writeback. > - * We really don't want to intermingle reads and writes like that. > +/** > + * page_cache_readahead_limit - Start readahead beyond a file's i_size. > + * @mapping: File address space. > + * @file: This instance of the open file; used for authentication. > + * @offset: First page index to read. > + * @end_index: The maximum page index to read. > + * @nr_to_read: The number of pages to read. > + * @lookahead_size: Where to start the next readahead. > + * > + * This function is for filesystems to call when they want to start > + * readahead potentially beyond a file's stated i_size. If you want > + * to start readahead on a normal file, you probably want to call > + * page_cache_async_readahead() or page_cache_sync_readahead() instead. > + * > + * Context: File is referenced by caller. Mutexes may be held by caller. > + * May sleep, but will not reenter filesystem to reclaim memory. > */ > -void __do_page_cache_readahead(struct address_space *mapping, > - struct file *filp, pgoff_t offset, unsigned long nr_to_read, > - unsigned long lookahead_size) > +void page_cache_readahead_limit(struct address_space *mapping, ... I don't think the function name conveys it's purpose. It's really a ranged readahead that ignores where i_size lies. i.e page_cache_readahead_range(mapping, start, end, nr_to_read) seems like a better API to me, and then you can drop the "start readahead beyond i_size" comments and replace it with "Range is not limited by the inode's i_size and hence can be used to read data stored beyond EOF into the page cache." Also: "This is almost certainly not the function you want to call. Use page_cache_async_readahead or page_cache_sync_readahead() instead." Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 08/19] mm: Add readahead address space operation
agemap.h > index 3613154e79e4..bd4291f78f41 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -665,6 +665,24 @@ static inline void readahead_next(struct > readahead_control *rac) > #define readahead_for_each(rac, page) > \ > for (; (page = readahead_page(rac)); readahead_next(rac)) > > +/* The byte offset into the file of this readahead block */ > +static inline loff_t readahead_offset(struct readahead_control *rac) > +{ > + return (loff_t)rac->_start * PAGE_SIZE; > +} Urk. Didn't an early page use "offset" for the page index? That was was "mm: Remove 'page_offset' from readahead loop" did, right? That's just going to cause confusion to have different units for readahead "offsets" > + > +/* The number of bytes in this readahead block */ > +static inline loff_t readahead_length(struct readahead_control *rac) > +{ > + return (loff_t)rac->_nr_pages * PAGE_SIZE; > +} > + > +/* The index of the first page in this readahead block */ > +static inline unsigned int readahead_index(struct readahead_control *rac) > +{ > + return rac->_start; > +} Based on this, I suspect the earlier patch should use "index" rather than "offset" when walking the page cache indexes... > + > /* The number of pages in this readahead block */ > static inline unsigned int readahead_count(struct readahead_control *rac) > { > diff --git a/mm/readahead.c b/mm/readahead.c > index 9e430daae42f..975ff5e387be 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -121,7 +121,13 @@ static void read_pages(struct readahead_control *rac, > struct list_head *pages) > > blk_start_plug(); > > - if (aops->readpages) { > + if (aops->readahead) { > + aops->readahead(rac); > + readahead_for_each(rac, page) { > + unlock_page(page); > + put_page(page); > + } This needs a comment to explain the unwinding that needs to be done here. I'm not going to remember in a year's time that this is just for the pages that weren't submitted by ->readahead Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 07/19] mm: Put readahead pages in cache earlier
page->index = offset; > - list_add(>lru, _pool); > + if (use_list) { > + page->index = offset; > + list_add(>lru, _pool); > + } else if (add_to_page_cache_lru(page, mapping, offset, > + gfp_mask) < 0) { > + put_page(page); > + goto read; > + } Ok, so that's why you put read code at the end of the loop. To turn the code into spaghetti :/ How much does this simplify down when we get rid of ->readpages and can restructure the loop? This really seems like you're trying to flatten two nested loops into one by the use of goto Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 06/19] mm: rename readahead loop variable to 'i'
On Mon, Feb 17, 2020 at 10:45:50AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Change the type of page_idx to unsigned long, and rename it -- it's > just a loop counter, not a page index. > > Suggested-by: John Hubbard > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/readahead.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) Looks fine. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop
On Mon, Feb 17, 2020 at 10:45:48AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Eliminate the page_offset variable which was confusing with the > 'offset' parameter and record the start of each consecutive run of > pages in the readahead_control. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/readahead.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) Looks ok, but having the readahead dispatch out of line from the case that triggers it makes it hard to follow. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 04/19] mm: Rearrange readahead loop
On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > Move the declaration of 'page' to inside the loop and move the 'kick > off a fresh batch' code to the end of the function for easier use in > subsequent patches. Stale? the "kick off" code is moved to the tail of the loop, not the end of the function. > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space > *mapping, > page = xa_load(>i_pages, page_offset); > if (page && !xa_is_value(page)) { > /* > - * Page already present? Kick off the current batch of > - * contiguous pages before continuing with the next > - * batch. > + * Page already present? Kick off the current batch > + * of contiguous pages before continuing with the > + * next batch. This page may be the one we would > + * have intended to mark as Readahead, but we don't > + * have a stable reference to this page, and it's > + * not worth getting one just for that. >*/ > - if (readahead_count()) > - read_pages(, _pool, gfp_mask); > - rac._nr_pages = 0; > - continue; > + goto read; > } > > page = __page_cache_alloc(gfp_mask); > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space > *mapping, > if (page_idx == nr_to_read - lookahead_size) > SetPageReadahead(page); > rac._nr_pages++; > + continue; > +read: > + if (readahead_count()) > + read_pages(, _pool, gfp_mask); > + rac._nr_pages = 0; > } Also, why? This adds a goto from branched code that continues, then adds a continue so the unbranched code doesn't execute the code the goto jumps to. In absence of any explanation, this isn't an improvement and doesn't make any sense... -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 03/19] mm: Use readahead_control to pass arguments
On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > In this patch, only between __do_page_cache_readahead() and > read_pages(), but it will be extended in upcoming patches. Also add > the readahead_count() accessor. > > Signed-off-by: Matthew Wilcox (Oracle) > --- > include/linux/pagemap.h | 17 + > mm/readahead.c | 36 +--- > 2 files changed, 38 insertions(+), 15 deletions(-) > > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h > index ccb14b6a16b5..982ecda2d4a2 100644 > --- a/include/linux/pagemap.h > +++ b/include/linux/pagemap.h > @@ -630,6 +630,23 @@ static inline int add_to_page_cache(struct page *page, > return error; > } > > +/* > + * Readahead is of a block of consecutive pages. > + */ > +struct readahead_control { > + struct file *file; > + struct address_space *mapping; > +/* private: use the readahead_* accessors instead */ > + pgoff_t _start; > + unsigned int _nr_pages; > +}; > + > +/* The number of pages in this readahead block */ > +static inline unsigned int readahead_count(struct readahead_control *rac) > +{ > + return rac->_nr_pages; > +} > + > static inline unsigned long dir_pages(struct inode *inode) > { > return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >> > diff --git a/mm/readahead.c b/mm/readahead.c > index 12d13b7792da..15329309231f 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -113,26 +113,29 @@ int read_cache_pages(struct address_space *mapping, > struct list_head *pages, > > EXPORT_SYMBOL(read_cache_pages); > > -static void read_pages(struct address_space *mapping, struct file *filp, > - struct list_head *pages, unsigned int nr_pages, gfp_t gfp) > +static void read_pages(struct readahead_control *rac, struct list_head > *pages, > + gfp_t gfp) > { > + const struct address_space_operations *aops = rac->mapping->a_ops; > struct blk_plug plug; > unsigned page_idx; Splitting out the aops rather than the mapping here just looks weird, especially as you need the mapping later in the function. Using aops doesn't even reduce the code side > > blk_start_plug(); > > - if (mapping->a_ops->readpages) { > - mapping->a_ops->readpages(filp, mapping, pages, nr_pages); > + if (aops->readpages) { > + aops->readpages(rac->file, rac->mapping, pages, > + readahead_count(rac)); > /* Clean up the remaining pages */ > put_pages_list(pages); > goto out; > } > > - for (page_idx = 0; page_idx < nr_pages; page_idx++) { > + for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) { > struct page *page = lru_to_page(pages); > list_del(>lru); > - if (!add_to_page_cache_lru(page, mapping, page->index, gfp)) > - mapping->a_ops->readpage(filp, page); > + if (!add_to_page_cache_lru(page, rac->mapping, page->index, > + gfp)) > + aops->readpage(rac->file, page); ... it just makes this less readable by splitting the if() over two lines... > put_page(page); > } > > @@ -155,9 +158,13 @@ void __do_page_cache_readahead(struct address_space > *mapping, > unsigned long end_index;/* The last page we want to read */ > LIST_HEAD(page_pool); > int page_idx; > - unsigned int nr_pages = 0; > loff_t isize = i_size_read(inode); > gfp_t gfp_mask = readahead_gfp_mask(mapping); > + struct readahead_control rac = { > + .mapping = mapping, > + .file = filp, > + ._nr_pages = 0, > + }; No need to initialise _nr_pages to zero, leaving it out will do the same thing. > > if (isize == 0) > return; > @@ -180,10 +187,9 @@ void __do_page_cache_readahead(struct address_space > *mapping, >* contiguous pages before continuing with the next >* batch. >*/ > - if (nr_pages) > - read_pages(mapping, filp, _pool, nr_pages, > - gfp_mask); > - nr_pages = 0; > + if (readahead_count()) > + read_pages(, _pool, gfp_mask); > + rac._nr_pages = 0; Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead control structure - if we have to pass the gfp_mask down all the way along side the rac, then I think it makes sense to do that... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API
On Mon, Feb 17, 2020 at 10:45:41AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > This series adds a readahead address_space operation to eventually > replace the readpages operation. The key difference is that > pages are added to the page cache as they are allocated (and > then looked up by the filesystem) instead of passing them on a > list to the readpages operation and having the filesystem add > them to the page cache. It's a net reduction in code for each > implementation, more efficient than walking a list, and solves > the direct-write vs buffered-read problem reported by yu kuai at > https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yuku...@huawei.com/ > > The only unconverted filesystems are those which use fscache. > Their conversion is pending Dave Howells' rewrite which will make the > conversion substantially easier. Latest version in your git tree: $ ▶ glo -n 5 willy/readahead 4be497096c04 mm: Use memalloc_nofs_save in readahead path ff63497fcb98 iomap: Convert from readpages to readahead 26aee60e89b5 iomap: Restructure iomap_readpages_actor 8115bcca7312 fuse: Convert from readpages to readahead 3db3d10d9ea1 f2fs: Convert from readpages to readahead $ merged into a 5.6-rc2 tree fails at boot on my test vm: [2.423116] [ cut here ] [2.424957] list_add double add: new=ea000efff4c8, prev=8883bfffee60, next=ea000efff4c8. [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 __list_add_valid+0x67/0x70 [2.430617] CPU: 4 PID: 1 Comm: sh Not tainted 5.6.0-rc2-dgc+ #1800 [2.432405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [2.434744] RIP: 0010:__list_add_valid+0x67/0x70 [2.436107] Code: c6 4c 89 ca 48 c7 c7 10 41 58 82 e8 55 29 89 ff 0f 0b 31 c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 60 41 58 82 e8 3b 29 89 ff <0f> 0b 31 c7 [2.441161] RSP: :c900018a3bb0 EFLAGS: 00010082 [2.442548] RAX: RBX: ea000efff4c0 RCX: 0256 [2.32] RDX: 0001 RSI: 0086 RDI: 8288a8b0 [2.446315] RBP: ea000efff4c8 R08: c900018a3a65 R09: 0256 [2.448199] R10: 0008 R11: c900018a3a65 R12: ea000efff4c8 [2.450072] R13: 8883bfffee60 R14: 0010 R15: 0001 [2.451959] FS: () GS:8883b9c0() knlGS: [2.454083] CS: 0010 DS: ES: CR0: 80050033 [2.455604] CR2: CR3: 0003b9a37002 CR4: 00060ee0 [2.457484] Call Trace: [2.458171] __pagevec_lru_add_fn+0x15f/0x2c0 [2.459376] pagevec_lru_move_fn+0x87/0xd0 [2.460500] ? pagevec_move_tail_fn+0x2d0/0x2d0 [2.461712] lru_add_drain_cpu+0x8d/0x160 [2.462787] lru_add_drain+0x18/0x20 [2.463757] shift_arg_pages+0xb8/0x180 [2.464789] ? vprintk_emit+0x101/0x1c0 [2.465813] ? printk+0x58/0x6f [2.466659] setup_arg_pages+0x205/0x240 [2.467716] load_elf_binary+0x34a/0x1560 [2.468789] ? get_user_pages_remote+0x159/0x280 [2.470024] ? selinux_inode_permission+0x10d/0x1e0 [2.471323] ? _raw_read_unlock+0xa/0x20 [2.472375] ? load_misc_binary+0x2b2/0x410 [2.473492] search_binary_handler+0x60/0xe0 [2.474634] __do_execve_file.isra.0+0x512/0x850 [2.475888] ? rest_init+0xc6/0xc6 [2.476801] do_execve+0x21/0x30 [2.477671] try_to_run_init_process+0x10/0x34 [2.478855] kernel_init+0xe2/0xfa [2.479776] ret_from_fork+0x1f/0x30 [2.480737] ---[ end trace e77079de9b22dc6a ]--- I just dropped the ext4 conversion from my local tree so I can boot the machine and test XFS. Might have some more info when that crashes and burns... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 02/19] mm: Ignore return value of ->readpages
On Mon, Feb 17, 2020 at 10:45:43AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > We used to assign the return value to a variable, which we then ignored. > Remove the pretence of caring. > > Signed-off-by: Matthew Wilcox (Oracle) > Reviewed-by: Christoph Hellwig > --- > mm/readahead.c | 8 ++-- > 1 file changed, 2 insertions(+), 6 deletions(-) Simple enough. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v6 01/19] mm: Return void from various readahead functions
On Mon, Feb 17, 2020 at 10:45:42AM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > ondemand_readahead has two callers, neither of which use the return value. > That means that both ra_submit and __do_page_cache_readahead() can return > void, and we don't need to worry that a present page in the readahead > window causes us to return a smaller nr_pages than we ought to have. > > Signed-off-by: Matthew Wilcox (Oracle) Looks good. Reviewed-by: Dave Chinner -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v5 04/13] mm: Add readahead address space operation
On Tue, Feb 11, 2020 at 04:54:13AM -0800, Matthew Wilcox wrote: > On Tue, Feb 11, 2020 at 03:52:30PM +1100, Dave Chinner wrote: > > > +struct readahead_control { > > > + struct file *file; > > > + struct address_space *mapping; > > > +/* private: use the readahead_* accessors instead */ > > > + pgoff_t start; > > > + unsigned int nr_pages; > > > + unsigned int batch_count; > > > +}; > > > + > > > +static inline struct page *readahead_page(struct readahead_control *rac) > > > +{ > > > + struct page *page; > > > + > > > + if (!rac->nr_pages) > > > + return NULL; > > > + > > > + page = xa_load(>mapping->i_pages, rac->start); > > > + VM_BUG_ON_PAGE(!PageLocked(page), page); > > > + rac->batch_count = hpage_nr_pages(page); > > > + rac->start += rac->batch_count; > > > > There's no mention of large page support in the patch description > > and I don't recall this sort of large page batching in previous > > iterations. > > > > This seems like new functionality to me, not directly related to > > the initial ->readahead API change? What have I missed? > > I had a crisis of confidence when I was working on this -- the loop > originally looked like this: > > #define readahead_for_each(rac, page) \ > for (; (page = readahead_page(rac)); rac->nr_pages--) > > and then I started thinking about what I'd need to do to support large > pages, and that turned into > > #define readahead_for_each(rac, page) \ > for (; (page = readahead_page(rac)); \ > rac->nr_pages -= hpage_nr_pages(page)) > > but I realised that was potentially a use-after-free because 'page' has > certainly had put_page() called on it by then. I had a brief period > where I looked at moving put_page() away from being the filesystem's > responsibility and into the iterator, but that would introduce more > changes into the patchset, as well as causing problems for filesystems > that want to break out of the loop. > > By this point, I was also looking at the readahead_for_each_batch() > iterator that btrfs uses, and so we have the batch count anyway, and we > might as well use it to store the number of subpages of the large page. > And so it became easier to just put the whole ball of wax into the initial > patch set, rather than introduce the iterator now and then fix it up in > the patch set that I'm basing on this. > > So yes, there's a certain amount of excess functionality in this patch > set ... I can remove it for the next release. I'd say "Just document it" as that was the main reason I noticed it. Or perhaps add the batching function as a stand-alone patch so it's clear that the batch interface solves two problems at once - large pages and the btrfs page batching implementation... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v5 04/13] mm: Add readahead address space operation
On Mon, Feb 10, 2020 at 05:03:39PM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > This replaces ->readpages with a saner interface: > - Return void instead of an ignored error code. > - Pages are already in the page cache when ->readahead is called. > - Implementation looks up the pages in the page cache instead of >having them passed in a linked list. > > Signed-off-by: Matthew Wilcox (Oracle) > > +/* > + * Readahead is of a block of consecutive pages. > + */ > +struct readahead_control { > + struct file *file; > + struct address_space *mapping; > +/* private: use the readahead_* accessors instead */ > + pgoff_t start; > + unsigned int nr_pages; > + unsigned int batch_count; > +}; > + > +static inline struct page *readahead_page(struct readahead_control *rac) > +{ > + struct page *page; > + > + if (!rac->nr_pages) > + return NULL; > + > + page = xa_load(>mapping->i_pages, rac->start); > + VM_BUG_ON_PAGE(!PageLocked(page), page); > + rac->batch_count = hpage_nr_pages(page); > + rac->start += rac->batch_count; There's no mention of large page support in the patch description and I don't recall this sort of large page batching in previous iterations. This seems like new functionality to me, not directly related to the initial ->readahead API change? What have I missed? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] RFC: hold i_rwsem until aio completes
On Mon, Feb 03, 2020 at 06:46:41PM +0100, Christoph Hellwig wrote: > On Sat, Jan 18, 2020 at 08:28:38PM +1100, Dave Chinner wrote: > > I think it's pretty gross, actually. It makes the same mistake made > > with locking in the old direct IO code - it encodes specific lock > > operations via flags into random locations in the DIO path. This is > > a very slippery slope, and IMO it is an layering violation to encode > > specific filesystem locking smeantics into a layer that is supposed > > to be generic and completely filesystem agnostic. i.e. this > > mechanism breaks if a filesystem moves to a different type of lock > > (e.g. range locks), and history teaches us that we'll end up making > > a horrible, unmaintainable mess to support different locking > > mechanisms and contexts. > > > > I think that we should be moving to a model where the filesystem > > provides an unlock method in the iomap operations structure, and if > > the method is present in iomap_dio_complete() it gets called for the > > filesystem to unlock the inode at the appropriate point. This also > > allows the filesystem to provide a different method for read or > > write unlock, depending on what type of lock it held at submission. > > This gets rid of the need for the iomap code to know what type of > > lock the caller holds, too. > > I'd rather avoid yet another method. But I think with a little > tweaking we can move the unlock into the ->end_io method. That would work, too :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH 04/12] mm: Add readahead address space operation
On Fri, Jan 24, 2020 at 05:35:45PM -0800, Matthew Wilcox wrote: > From: "Matthew Wilcox (Oracle)" > > This replaces ->readpages with a saner interface: > - Return the number of pages not read instead of an ignored error code. > - Pages are already in the page cache when ->readahead is called. > - Implementation looks up the pages in the page cache instead of >having them passed in a linked list. > diff --git a/mm/readahead.c b/mm/readahead.c > index 5a6676640f20..6d65dae6dad0 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -121,7 +121,18 @@ static void read_pages(struct address_space *mapping, > struct file *filp, > > blk_start_plug(); > > - if (mapping->a_ops->readpages) { > + if (mapping->a_ops->readahead) { > + unsigned left = mapping->a_ops->readahead(filp, mapping, > + start, nr_pages); > + > + while (left) { > + struct page *page = readahead_page(mapping, > + start + nr_pages - left - 1); Off by one? start = 2, nr_pages = 2, left = 1, this looks up the page at index 2, which is the one we issued IO on, not the one we "left behind" which is at index 3. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] RFC: hold i_rwsem until aio completes
On Tue, Jan 14, 2020 at 05:12:13PM +0100, Christoph Hellwig wrote: > Hi all, > > Asynchronous read/write operations currently use a rather magic locking > scheme, were access to file data is normally protected using a rw_semaphore, > but if we are doing aio where the syscall returns to userspace before the > I/O has completed we also use an atomic_t to track the outstanding aio > ops. This scheme has lead to lots of subtle bugs in file systems where > didn't wait to the count to reach zero, and due to its adhoc nature also > means we have to serialize direct I/O writes that are smaller than the > file system block size. > > All this is solved by releasing i_rwsem only when the I/O has actually > completed, but doings so is against to mantras of Linux locking primites: > > (1) no unlocking by another process than the one that acquired it > (2) no return to userspace with locks held > > It actually happens we have various places that work around this. A few > callers do non-owner unlocks of rwsems, which are pretty nasty for > PREEMPT_RT as the owner tracking doesn't work. OTOH the file system > freeze code has both problems and works around them a little better, > although in a somewhat awkward way, in that it releases the lockdep > object when returning to userspace, and reacquires it when done, and > also clears the rwsem owner when returning to userspace, and then sets > the new onwer before unlocking. > > This series tries to follow that scheme, also it doesn't fully work. The > first issue is that the rwsem code has a bug where it doesn't properly > handle clearing the owner. This series has a patch to fix that, but it > is ugly and might not be correct so some help is needed. Second I/O > completions often come from interrupt context, which means the re-acquire > is recorded as from irq context, leading to warnings about incorrect > contexts. I wonder if we could just have a bit in lockdep that says > returning to userspace is ok for this particular lock? That would also > clean up the fsfreeze situation a lot. > > Let me know what you think of all this. While I converted all the iomap > using file systems only XFS is actually tested. I think it's pretty gross, actually. It makes the same mistake made with locking in the old direct IO code - it encodes specific lock operations via flags into random locations in the DIO path. This is a very slippery slope, and IMO it is an layering violation to encode specific filesystem locking smeantics into a layer that is supposed to be generic and completely filesystem agnostic. i.e. this mechanism breaks if a filesystem moves to a different type of lock (e.g. range locks), and history teaches us that we'll end up making a horrible, unmaintainable mess to support different locking mechanisms and contexts. I think that we should be moving to a model where the filesystem provides an unlock method in the iomap operations structure, and if the method is present in iomap_dio_complete() it gets called for the filesystem to unlock the inode at the appropriate point. This also allows the filesystem to provide a different method for read or write unlock, depending on what type of lock it held at submission. This gets rid of the need for the iomap code to know what type of lock the caller holds, too. This way we always have a consistent point in the AIO/DIO completion model where the inode gets unlocked, it only gets called for the IO contexts where the filesystem wants/needs unlock on IO competion semantics, and it's completely filesystem IO-lock implementation agnostic. And for filesystems that use the inode i_rwsem, we can just provide simple helper functions for their read/write unlock methods. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] Interest in DAX for OCFS2 and/or GFS2?
On Thu, Oct 10, 2019 at 04:41:56PM +, Hayes, Bill wrote: > Has there been any discussion or interest in DAX support in OCFS2? > Is there interest from the OCFS2 development community to see DAX support > developed and put upstream? > > Has there been any discussion or interest in DAX support in GFS2? > Is there interest from the GFS2 development community to see DAX support > developed and put upstream? You're probably best from a DAX implementation POV to head down the GFS2 path, as FS-DAX requires the filesystem to use the fs/iomap/ extent mapping implementation. GFS2 is already partially ported to use the iomap infrastructure, though more work is needed to provide the iomap functionality DAX requires. OCFS2 would require a lot more work on this front Cheers, Dave. -- Dave Chinner dchin...@redhat.com
Re: [Cluster-devel] [Q] gfs2: mmap write vs. punch_hole consistency
On Fri, Sep 06, 2019 at 11:48:31PM +0200, Andreas Gruenbacher wrote: > On Fri, Sep 6, 2019 at 11:28 PM Dave Chinner wrote: > > On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote: > > > Hi, > > > > > > I've just fixed a mmap write vs. truncate consistency issue on gfs on > > > filesystems with a block size smaller that the page size [1]. > > > > > > It turns out that the same problem exists between mmap write and hole > > > punching, and since xfstests doesn't seem to cover that, > > > > AFAIA, fsx exercises it pretty often. Certainly it's found problems > > with XFS in the past w.r.t. these operations. > > > > > I've written a > > > new test [2]. > > > > I suspect that what we really want is a test that runs > > _test_generic_punch using mmap rather than pwrite... > > > > > Ext4 and xfs both pass that test; they both apparently > > > mark the pages that have a hole punched in them as read-only so that > > > page_mkwrite is called before those pages can be written to again. > > > > XFS invalidates the range being hole punched (see > > xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any > > attempt to fault that page back in will block on the MMAPLOCK until > > the hole punch finishes. > > This isn't about writes during the hole punching, this is about writes > once the hole is punched. How do you prevent this: hole punch process: other process clean page invalidate page write page fault instantiate new page page_mkwrite page dirty do hole punch overwrite hole Firstly, you end up with a dirty page over a hole with no backing store, and second, you get no page fault on overwrite because the pages are already dirty. That's the problem the MMAPLOCK in XFS solves - it's integral to ensuring the page faults on mmap write are correctly serialised with both the page cache invalidation and the underlying extent manipulation that the hole punch operation then does. i.e. if you want a page fault /after/ a hole punch, you have to prevent it occurring during the hole punch after the page has already been marked clean and/or invalidated. > For example, the test case I've posted > creates the following file layout with 1k blocksize: > > > > Then it punches a hole like this: > > DDHH HHDD > > Then it fills the hole again with mwrite: > > > > As far as I can tell, that needs to trigger page faults on all three > pages. Yes, but only /after/ the hole has been punched. > I did get these on ext4; judging from the fact that xfs works, > the also seem to occur there; but on gfs2, page_mkwrite isn't called > for the two partially mapped pages, only for the page in the middle > that's entirely within the hole. And I don't see where those pages are > marked read-only; it appears like pagecache_isize_extended isn't > called on ext4 or xfs. So how does this work there? As I said in my last response, the answer is in xfs_flush_unmap_range(). That uses truncate_pagecache_range() to do the necessary page cache manipulation. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [Q] gfs2: mmap write vs. punch_hole consistency
On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote: > Hi, > > I've just fixed a mmap write vs. truncate consistency issue on gfs on > filesystems with a block size smaller that the page size [1]. > > It turns out that the same problem exists between mmap write and hole > punching, and since xfstests doesn't seem to cover that, AFAIA, fsx exercises it pretty often. Certainly it's found problems with XFS in the past w.r.t. these operations. > I've written a > new test [2]. I suspect that what we really want is a test that runs _test_generic_punch using mmap rather than pwrite... > Ext4 and xfs both pass that test; they both apparently > mark the pages that have a hole punched in them as read-only so that > page_mkwrite is called before those pages can be written to again. XFS invalidates the range being hole punched (see xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any attempt to fault that page back in will block on the MMAPLOCK until the hole punch finishes. > gfs2 fails that: for some reason, the partially block-mapped pages are > not marked read-only on gfs2, and so page_mkwrite is not called for the > partially block-mapped pages, and the hole is not filled in correctly. > > The attached patch fixes the problem, but this really doesn't look right > as neither ext4 nor xfs require this kind of hack. So what am I > overlooking, how does this work on ext4 and xfs? XFS uses XFS_MMAPLOCK_* to serialise page faults against extent manipulations (shift, hole punch, truncate, swap, etc) and ext4 uses a similar locking mechanism to do the same thing. Fundamentally, the page cache does not provide the necessary mechanisms to detect and prevent invalidation races inside EOF > > Signed-off-by: Andreas Gruenbacher > --- > fs/gfs2/bmap.c | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c > index 9ef543dd38e2..e677e813be4c 100644 > --- a/fs/gfs2/bmap.c > +++ b/fs/gfs2/bmap.c > @@ -2475,6 +2475,13 @@ int __gfs2_punch_hole(struct file *file, loff_t > offset, loff_t length) > if (error) > goto out; > } > + /* > + * If the first or last page partially lies in the hole, mark > + * the page read-only so that memory-mapped writes will trigger > + * page_mkwrite. > + */ > + pagecache_isize_extended(inode, offset, inode->i_size); > + pagecache_isize_extended(inode, offset + length, inode->i_size); See xfs_flush_unmap_range(), which is run under XFS_MMAPLOCK_EXCL to serialise against incoming page faults... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] RFC: use the iomap writepage path in gfs2
On Mon, Jul 01, 2019 at 11:54:24PM +0200, Christoph Hellwig wrote: > Hi all, > > in this straight from the jetplane edition I present the series to > convert gfs2 to full iomap usage for the ordered and writeback mode, > that is we use iomap_page everywhere and entirely get rid of > buffer_heads in the data path. This has only seen basic testing > which ensured neither 4k or 1k blocksize in ordered mode regressed > vs the xfstests baseline, although that baseline tends to look > pretty bleak. > > The series is to be applied on top of my "lift the xfs writepage code > into iomap v2" series. Ok, this doesn't look too bad from the iomap perspective, though it does raise more questions. :) gfs2 now has two iopaths, right? One that uses bufferheads for journalled data, and the other that uses iomap? That seems like it's only a partial conversion - what needs to be done to iomap and gfs2 to support the journalled data path so there's a single data IO path? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH] fs: Move mark_inode_dirty out of __generic_write_end
On Tue, Jun 18, 2019 at 04:47:16PM +0200, Andreas Gruenbacher wrote: > Remove the mark_inode_dirty call from __generic_write_end and add it to > generic_write_end and the high-level iomap functions where necessary. > That way, large writes will only dirty inodes at the end instead of > dirtying them once per page. This fixes a performance bottleneck on > gfs2. > > Signed-off-by: Andreas Gruenbacher > --- > fs/buffer.c | 26 ++ > fs/iomap.c | 42 ++ > 2 files changed, 56 insertions(+), 12 deletions(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index 23ef63fd1669..9454568a7f5e 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -881,6 +881,13 @@ iomap_file_buffered_write(struct kiocb *iocb, struct > iov_iter *iter, > { > struct inode *inode = iocb->ki_filp->f_mapping->host; > loff_t pos = iocb->ki_pos, ret = 0, written = 0; > + loff_t old_size; > + > +/* > + * No need to use i_size_read() here, the i_size cannot change under us > + * because we hold i_rwsem. > + */ > + old_size = inode->i_size; > > while (iov_iter_count(iter)) { > ret = iomap_apply(inode, pos, iov_iter_count(iter), > @@ -891,6 +898,9 @@ iomap_file_buffered_write(struct kiocb *iocb, struct > iov_iter *iter, > written += ret; > } > > + if (old_size != inode->i_size) > + mark_inode_dirty(inode); > + > return written ? written : ret; > } > EXPORT_SYMBOL_GPL(iomap_file_buffered_write); > @@ -961,18 +971,30 @@ int > iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len, > const struct iomap_ops *ops) > { > + loff_t old_size; > loff_t ret; > > +/* > + * No need to use i_size_read() here, the i_size cannot change under us > + * because we hold i_rwsem. > + */ > + old_size = inode->i_size; > + > while (len) { > ret = iomap_apply(inode, pos, len, IOMAP_WRITE, ops, NULL, > iomap_dirty_actor); > if (ret <= 0) > - return ret; > + goto out; > pos += ret; > len -= ret; > } > + ret = 0; > > - return 0; > +out: > + if (old_size != inode->i_size) > + mark_inode_dirty(inode); I don't think we want to do this. The patches I have that add range locking for XFS allow buffered writes to run concurrently with operations that change the inode size as long as the ranges don't overlap. To do this, XFS will not hold the i_rwsem over any iomap call it makes in future - it will hold a range lock instead. Hence we can have writes and other IO operations occurring at the same time some other operation is changing the size of the file, and that means this code no longer does what you are intending it to do because the inode->i_size is no longer constant across these operations... Hence I think adding code that depends on i_rwsem to be held to function correctly is the wrong direction to be taking the iomap infrastructure. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v7 0/5] iomap and gfs2 fixes
On Mon, Apr 29, 2019 at 07:50:28PM -0700, Darrick J. Wong wrote: > On Tue, Apr 30, 2019 at 12:09:29AM +0200, Andreas Gruenbacher wrote: > > Here's another update of this patch queue, hopefully with all wrinkles > > ironed out now. > > > > Darrick, I think Linus would be unhappy seeing the first four patches in > > the gfs2 tree; could you put them into the xfs tree instead like we did > > some time ago already? > > Sure. When I'm done reviewing them I'll put them in the iomap tree, > though, since we now have a separate one. :) I'd just keep the iomap stuff in the xfs tree as a separate set of branches and merge them into the XFS for-next when composing it. That way it still gets plenty of test coverage from all the XFS devs and linux next without anyone having to think about. You really only need to send separate pull requests for the iomap and XFS branches - IMO, there's no really need to have a complete new tree for it... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] gfs2 iomap dealock, IOMAP_F_UNBALANCED
On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote: > Hi Christoph, > > we need your help fixing a gfs2 deadlock involving iomap. What's going > on is the following: > > * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush > lock and keeps it until gfs2_iomap_end. It currently always does that > even though there is no point other than for journaled data writes. > > * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited. > If that ends up calling gfs2_write_inode, gfs2 will try to grab the > log flush lock again and deadlock. > > We can stop gfs2_iomap_begin from keeping the log flush lock held for > non-journaled data writes, but that still leaves us with the deadlock > for journaled data writes: for them, we need to add the data pages to > the running transaction, so dropping the log flush lock wouldn't work. > > I've tried to work around the unwanted balance_dirty_pages_ratelimited > in gfs2_write_inode as originally suggested by Ross. That works to a > certain degree, but only if we always skip inodes in the WB_SYNC_NONE > case, and that means that gfs2_write_inode becomes quite ineffective. > > So it seems that it would be more reasonable to avoid the dirty page > balancing under the log flush lock in the first place. > > The attached patch changes gfs2_iomap_begin to only hold on to the log > flush lock for journaled data writes. In that case, we also make sure > to limit the write size to not overrun dirty limits using a similar > logic as in balance_dirty_pages_ratelimited; there is precedent for that > approach in btrfs_buffered_write. Then, we prevent iomap from balancing > dirty pages via the new IOMAP_F_UNBALANCED flag. > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c > index 58a768e59712e..542bd149c4aa3 100644 > --- a/fs/gfs2/file.c > +++ b/fs/gfs2/file.c > @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, > struct iov_iter *from) > iocb->ki_pos += ret; > } > > + balance_dirty_pages_ratelimited(file->f_mapping); > + > out2: > current->backing_dev_info = NULL; > out: The problem is calling balance_dirty_pages() inside the ->iomap_begin/->iomap_end calls and not that it is called by the iomap infrastructure itself, right? Is so, I'd prefer to see this in iomap_apply() after the call to ops->iomap_end because iomap_file_buffered_write() can iterate and call iomap_apply() multiple times. This would keep the balancing to a per-iomap granularity, rather than a per-syscall granularity. i.e. if we do write(2GB), we want more than one balancing call during that syscall, so it woul dbe up to the filesystem to a) limit the size of write mappings to something smaller (e.g. 1024 pages) so that there are still frequent balancing calls for large writes. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH v4 06/11] iomap: Add write_{begin, end} iomap operations
On Mon, May 14, 2018 at 05:36:19PM +0200, Andreas Gruenbacher wrote: > Add write_begin and write_end operations to struct iomap_ops to provide > a way of overriding the default behavior of iomap_write_begin and > iomap_write_end. This is needed for implementing data journaling: in > the data journaling case, pages are written into the journal before > being written back to their proper on-disk locations. I'm not sure adding a per-page filesystem callout abstraction is what we want in the iomap code. The commit message does not explain why gfs2 needs per-page callouts, nor why the per-page work cannot be done/avoided by running code in the ->iomap_begin callout (e.g. by unstuffing the inode so nothing needs to be done per-page). My main concern is that the callouts to the filesystem in iomap are - by design - per IO, not per page. The whole point of using iomap was to get away from doing per-page filesystem operations on multi-page IOs - it's highly inefficient when we only need to call into the filesystem on a per-extent basis, and we simplified the code a *lot* by avoiding such behaviours. And to that point: this change has serious conflicts with the next step for the iomap infrastructure: Christoph's recent "remove bufferheads from iomap" series. Apart from the obvious code conflicts, there's a fairly major architectural/functional conflict. https://marc.info/?l=linux-fsdevel=152585212000411=2 That is, Christoph's patchset pushes us further away from filesystems doing their own per-page processing of page state. It converts the iomap IO path to store it's own per-page state tracking information in page->private, greatly reducing memory overhead (16 bytes on 4k page instead of 104 bytes per bufferhead) and streamlining the code. This, however, essentially makes the buffered IO path through the iomap infrastructure incompatible with existing bufferhead based filesystems. Depsite this, we can still provide limited support for buffered writes on bufferhead based filesystems through the iomap infrastructure, but I only see that as a mechanism for filesystems moving away from dependencies on bufferheads. It would require a different refactoring of the iomap code - I'd much prefer to keep bufferhead dependent IO paths completely separate (e.g. via a new actor function) to minimise impact and dependencies on the internal bufferhead free iomap IO path Let's see what Christoph thinks first, though Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH] fs: gfs2: Adding new return type vm_fault_t
On Mon, Apr 16, 2018 at 11:20:59PM +0530, Souptick Joarder wrote: > > Hi, > > > > This patch is straightforward enough, but there are a lot of other > > file systems that need similar patches. Shouldn't you do one big > > patch set that fixes several file systems at once and run it through > > Viro's kernel or Linus's kernel or something? > > Adding Viro and linux-fsdevel for more opinions. > > The plan for these patches is to introduce the typedef, initially just > as documentation ("These functions should return a VM_FAULT_ status"). > We'll trickle the patches to individual drivers/filesystems in through > the maintainers, as far as possible. Then we'll change the typedef to > an unsigned int and break the compilation of any unconverted > drivers/filesystems. > > We have already started sending out drivers/filesystems changes > to different maintainers. Yes, we can see that. The response you are getting is "this is not how we do cross-subsystem API changes. Why are you doing it this way?" i.e. the problem being pointed out is that your process has not followed the correct/normal process for proposing, reviewing and mering cross-subsystem API changes. Bob has raised the same questions as both Christoph and Darrick have asked in response to the XFS patch. I only implied these questions by asking about introducing useless typedefs with no context for reviewers... I'd really like to have Darrick's questions answered(*) in a constructive, non-abusive manner - I'll quote it here to get it all in one thread on -fsdevel: | ...hm, the original mm patch wasn't cc'd to fsdevel either, so that's | probably why I never heard of any of this until now. | | So, uh, why wasn't this whole series (all the mm changes and all the | required fs changes) sent out for review prior to the merge window? We're not asking for a description of what you are doing - we are asking why the normal processes for proposing and merging such a change is not being followed, and how you plan to rectify that. Cheers, Dave. https://marc.info/?l=linux-xfs=152389824107375=2 -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] [PATCH] gfs2: Fsync parent directories
On Tue, Feb 20, 2018 at 09:53:59PM +0100, Andreas Gruenbacher wrote: > On 20 February 2018 at 20:46, Christoph Hellwig <h...@infradead.org> wrote: > > On Tue, Feb 20, 2018 at 12:22:01AM +0100, Andreas Gruenbacher wrote: > >> When fsyncing a new file, also fsync the directory the files is in, > >> recursively. This is how Linux filesystems should behave nowadays, > >> even if not mandated by POSIX. > > > > I think that is bullshit. Maybe it is what google wants for ext4 > > non-journal mode which no one else uses anyway. but it certainly > > is anything but normal Linux semantics. > > Here's some code from xfstest generic/322: > > _mount_flakey > $XFS_IO_PROG -f -c "pwrite 0 1M" -c "fsync" $SCRATCH_MNT/foo \ > > $seqres.full 2>&1 || _fail "xfs_io failed" > mv $SCRATCH_MNT/foo $SCRATCH_MNT/bar > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar > md5sum $SCRATCH_MNT/bar | _filter_scratch > > _flakey_drop_and_remount > > md5sum $SCRATCH_MNT/bar | _filter_scratch > _unmount_flakey > > Note that there is no fsync for the parent directory ($SCRATCH_MNT), > yet the test obviously expects the directory to be synced as well. > This isn't implemented as in this patch on all filesystems, but the > major ones all show this behavior. So where's the bullshit? This test is for filesystems that have strictly ordered metadata journalling. All the filesystems that fstests supports via _require_metadata_journalling() have strictly ordered metadata journalling/crash recovery semantics. (i.e. xfs, ext4, btrfs, and f2fs (IIRC)). IOWs, if the filesystem is designed with strictly ordered metadata, then fsync()ing a new file also implies that all references to the new file are also on stable storage because they happened before the fsync on the file was issued. i.e. the directory is fsync'd implicitly because it was modified by the same operation that created the file. Hence if the file creation is made stable, so must be the directory modification done during file creation. This has nothing to do with POSIX or what the "linux standard" is - this is testing whether the implementation of strictly ordered metadata journalling is correct or not. If gfs2 does not have strictly ordered metadata journalling, then it probably shouldn't run these tests Cheers, Dave. -- Dave Chinner dchin...@redhat.com