from:"Dave Chinner"

Re: [Cluster-devel] [PATCH 07/11] vfs: add nowait parameter for file_accessed()

2023-09-10 Thread Dave Chinner

On Fri, Sep 08, 2023 at 01:29:55AM +0100, Pavel Begunkov wrote:
> On 9/3/23 23:30, Dave Chinner wrote:
> > On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote:
> > > On 8/29/23 19:53, Matthew Wilcox wrote:
> > > > On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote:
> > > > > On 8/28/23 05:32, Matthew Wilcox wrote:
> > > > > > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> > > > > > > From: Hao Xu 
> > > > > > > 
> > > > > > > Add a boolean parameter for file_accessed() to support nowait 
> > > > > > > semantics.
> > > > > > > Currently it is true only with io_uring as its initial caller.
> > > > > > 
> > > > > > So why do we need to do this as part of this series?  Apparently it
> > > > > > hasn't caused any problems for filemap_read().
> > > > > > 
> > > > > 
> > > > > We need this parameter to indicate if nowait semantics should be 
> > > > > enforced in
> > > > > touch_atime(), There are locks and maybe IOs in it.
> > > > 
> > > > That's not my point.  We currently call file_accessed() and
> > > > touch_atime() for nowait reads and nowait writes.  You haven't done
> > > > anything to fix those.
> > > > 
> > > > I suspect you can trim this patchset down significantly by avoiding
> > > > fixing the file_accessed() problem.  And then come back with a later
> > > > patchset that fixes it for all nowait i/o.  Or do a separate prep series
> > > 
> > > I'm ok to do that.
> > > 
> > > > first that fixes it for the existing nowait users, and then a second
> > > > series to do all the directory stuff.
> > > > 
> > > > I'd do the first thing.  Just ignore the problem.  Directory atime
> > > > updates cause I/O so rarely that you can afford to ignore it.  Almost
> > > > everyone uses relatime or nodiratime.
> > > 
> > > Hi Matthew,
> > > The previous discussion shows this does cause issues in real
> > > producations: 
> > > https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a3051...@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write
> > > 
> > 
> > Then separate it out into it's own patch set so we can have a
> > discussion on the merits of requiring using noatime, relatime or
> > lazytime for really latency sensitive IO applications. Changing code
> > is not always the right solution...
> 
> Separation sounds reasonable, but it can hardly be said that only
> latency sensitive apps would care about >1s nowait/async submission
> delays. Presumably, btrfs can improve on that, but it still looks
> like it's perfectly legit for filesystems do heavy stuff in
> timestamping like waiting for IO. Right?

Yes, it is, no-one is denying that. And some filesystems are worse
than others, but none of that means it has to be fixed so getdents
can be converted to NOWAIT semantics.

ie. this patchset is about the getdents NOWAIT machinery, and
fiddling around with timestamps has much, much wider scope than just
NOWAIT getdents machinery. We'll have this discussion about NOWAIT
timestamp updates when a RFC is proposed to address the wider
problem of how timestamp updates should behave in NOWAIT context.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 02/11] xfs: add NOWAIT semantics for readdir

2023-09-03 Thread Dave Chinner

On Sun, Aug 27, 2023 at 09:28:26PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> Implement NOWAIT semantics for readdir. Return EAGAIN error to the
> caller if it would block, like failing to get locks, or going to
> do IO.
> 
> Co-developed-by: Dave Chinner 

Not really.

"Co-developed" implies equal development input between all the
parties, which is not the case here - this patch is based on
prototype I wrote, whilst you're doing the refining, testing and
correctness work.

In these cases with XFS code, we add a line in the commit message to
say:

"This is based on a patch originally written by Dave Chinner."

> Signed-off-by: Dave Chinner 
> Signed-off-by: Hao Xu 
> [fixes deadlock issue, tweak code style]

With a signoff chain like you already have.

In the end you'll also get a RVB from me, which seems rather wrong
to me if I've apparently been "co-developing" the code

> @@ -156,7 +157,9 @@ xfs_dir2_block_getdents(
>   if (xfs_dir2_dataptr_to_db(geo, ctx->pos) > geo->datablk)
>   return 0;
>  
> - error = xfs_dir3_block_read(args->trans, dp, );
> + if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
> + flags |= XFS_DABUF_NOWAIT;
> + error = xfs_dir3_block_read(args->trans, dp, flags, );
>   if (error)
>   return error;
>  

Given we do this same check in both block and leaf formats to set
XFS_DABUF_NOWAIT, and we do the DIR_CONTEXT_F_NOWAIT check in
xfs_readdir() as well, we should probably do this check once at the
higher level and pass flags down from there with XFS_DABUF_NOWAIT
already set.

> @@ -240,6 +243,7 @@ xfs_dir2_block_getdents(
>  STATIC int
>  xfs_dir2_leaf_readbuf(
>   struct xfs_da_args  *args,
> + struct dir_context  *ctx,
>   size_t  bufsize,
>   xfs_dir2_off_t  *cur_off,
>   xfs_dablk_t *ra_blk,
> @@ -258,10 +262,15 @@ xfs_dir2_leaf_readbuf(
>   struct xfs_iext_cursor  icur;
>   int ra_want;
>   int error = 0;
> -
> - error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
> - if (error)
> - goto out;
> + unsigned intflags = 0;
> +
> + if (ctx->flags & DIR_CONTEXT_F_NOWAIT) {
> + flags |= XFS_DABUF_NOWAIT;
> + } else {
> + error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
> + if (error)
> + goto out;
> + }

Especially as, in hindsight, this doesn't make a whole lot of sense.
If XFS_DABUF_NOWAIT is set, we keep going until
xfs_ilock_data_map_shared_nowait() where we call
xfs_need_iread_extents() to see if we need to read the extents in
and abort at that point.

So, really, we shouldn't get this far with nowait semantics if
we haven't read the extents in yet - we're supposed to already have
the inode locked here and so we should have already checked this
condition before we bother locking the inode...

i.e. all we should be doing here is this:

if (!(flags & XFS_DABUF_NOWAIT)) {
error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
if (error)
goto out;
}

And then we don't need to pass the VFS dir_context down into low
level XFS functions unnecessarily.

>  
>   /*
>* Look for mapped directory blocks at or above the current offset.
> @@ -280,7 +289,7 @@ xfs_dir2_leaf_readbuf(
>   new_off = xfs_dir2_da_to_byte(geo, map.br_startoff);
>   if (new_off > *cur_off)
>   *cur_off = new_off;
> - error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, 0, );
> + error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, flags, 
> );
>   if (error)
>   goto out;
>  
> @@ -360,6 +369,7 @@ xfs_dir2_leaf_getdents(
>   int byteoff;/* offset in current block */
>   unsigned intoffset = 0;
>   int error = 0;  /* error return value */
> + int written = 0;
>  
>   /*
>* If the offset is at or past the largest allowed value,
> @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents(
>   bp = NULL;
>   }
>  
> - if (*lock_mode == 0)
> - *lock_mode = xfs_ilock_data_map_shared(dp);
> - error = xfs_dir2_leaf_readbuf(args, bufsize, ,
> - , );
> + if (*lock_mode == 0) {
> + *lock_mode =
> + xfs_ilock_data_map_shared_generic(dp,
>

Re: [Cluster-devel] [PATCH 07/11] vfs: add nowait parameter for file_accessed()

2023-09-03 Thread Dave Chinner

On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote:
> On 8/29/23 19:53, Matthew Wilcox wrote:
> > On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote:
> > > On 8/28/23 05:32, Matthew Wilcox wrote:
> > > > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> > > > > From: Hao Xu 
> > > > > 
> > > > > Add a boolean parameter for file_accessed() to support nowait 
> > > > > semantics.
> > > > > Currently it is true only with io_uring as its initial caller.
> > > > 
> > > > So why do we need to do this as part of this series?  Apparently it
> > > > hasn't caused any problems for filemap_read().
> > > > 
> > > 
> > > We need this parameter to indicate if nowait semantics should be enforced 
> > > in
> > > touch_atime(), There are locks and maybe IOs in it.
> > 
> > That's not my point.  We currently call file_accessed() and
> > touch_atime() for nowait reads and nowait writes.  You haven't done
> > anything to fix those.
> > 
> > I suspect you can trim this patchset down significantly by avoiding
> > fixing the file_accessed() problem.  And then come back with a later
> > patchset that fixes it for all nowait i/o.  Or do a separate prep series
> 
> I'm ok to do that.
> 
> > first that fixes it for the existing nowait users, and then a second
> > series to do all the directory stuff.
> > 
> > I'd do the first thing.  Just ignore the problem.  Directory atime
> > updates cause I/O so rarely that you can afford to ignore it.  Almost
> > everyone uses relatime or nodiratime.
> 
> Hi Matthew,
> The previous discussion shows this does cause issues in real
> producations: 
> https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a3051...@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write
> 

Then separate it out into it's own patch set so we can have a
discussion on the merits of requiring using noatime, relatime or
lazytime for really latency sensitive IO applications. Changing code
is not always the right solution...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH RFC v5 00/29] io_uring getdents

2023-08-25 Thread Dave Chinner

On Fri, Aug 25, 2023 at 09:54:02PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> This series introduce getdents64 to io_uring, the code logic is similar
> with the snychronized version's. It first try nowait issue, and offload
> it to io-wq threads if the first try fails.
> 
> Patch1 and Patch2 are some preparation
> Patch3 supports nowait for xfs getdents code
> Patch4-11 are vfs change, include adding helpers and trylock for locks
> Patch12-29 supports nowait for involved xfs journal stuff
> note, Patch24 and 27 are actually two questions, might be removed later.
> an xfs test may come later.

You need to drop all the XFS journal stuff. It's fundamentally
broken as it stands, and we cannot support non-blocking
transactional changes without first putting a massive investment in
transaction and intent chain rollback to allow correctly undoing
partially complete modifications.

Regardless, non-blocking transactions are completely unnecessary for
a non-blocking readdir implementation. readdir should only be
touching atime, and with relatime it should only occur once every 24
hours per inode. If that's a problem, then we have noatime mount
options. Hence I just don't see any point in worrying about having a
timestamp update block occasionally...

I also don't really don't see why you need to fiddle with xfs buffer
cache semantics - it already has the functionality "nowait" buffer
reads require (i.e.  XBF_INCORE|XBF_TRYLOCK).

However, the readahead IO that the xfs readdir code issues cannot
use your defined NOWAIT semantics - it must be able to allocate
memory and issue IO. Readahead already avoids blocking on memory
allocation and blocking on IO via the XBF_READ_AHEAD flag. This sets
__GFP_NORETRY for buffer allocation and REQ_RAHEAD for IO. Hence
readahead only needs the existing XBF_TRYLOCK flag to be set to be
compatible with the required NOWAIT semantics

As for the NOIO memory allocation restrictions io_uring requires,
that should be enforced at the io_uring layer before calling into
the VFS using memalloc_noio_save/restore.  At that point no memory
allocation will trigger IO and none of the code running under NOWAIT
conditions even needs to be aware that io_uring has a GFP_NOIO
restriction on memory allocation

Please go back to the simple "do non-blocking buffer IO"
implementation we started with and don't try to solve every little
blocking problem that might exist in the VFS and filesystems...

-Dave
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 25/29] xfs: support nowait for xfs_buf_item_init()

2023-08-25 Thread Dave Chinner

On Fri, Aug 25, 2023 at 09:54:27PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> support nowait for xfs_buf_item_init() and error out -EAGAIN to
> _xfs_trans_bjoin() when it would block.
> 
> Signed-off-by: Hao Xu 
> ---
>  fs/xfs/xfs_buf_item.c |  9 +++--
>  fs/xfs/xfs_buf_item.h |  2 +-
>  fs/xfs/xfs_buf_item_recover.c |  2 +-
>  fs/xfs/xfs_trans_buf.c| 16 +---
>  4 files changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 023d4e0385dd..b1e63137d65b 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -827,7 +827,8 @@ xfs_buf_item_free_format(
>  int
>  xfs_buf_item_init(
>   struct xfs_buf  *bp,
> - struct xfs_mount *mp)
> + struct xfs_mount *mp,
> + bool   nowait)
>  {
>   struct xfs_buf_log_item *bip = bp->b_log_item;
>   int chunks;
> @@ -847,7 +848,11 @@ xfs_buf_item_init(
>   return 0;
>   }
>  
> - bip = kmem_cache_zalloc(xfs_buf_item_cache, GFP_KERNEL | __GFP_NOFAIL);
> + bip = kmem_cache_zalloc(xfs_buf_item_cache,
> + GFP_KERNEL | (nowait ? 0 : __GFP_NOFAIL));
> + if (!bip)
> + return -EAGAIN;
> +
>   xfs_log_item_init(mp, >bli_item, XFS_LI_BUF, _buf_item_ops);
>   bip->bli_buf = bp;

I see filesystem shutdowns

> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 016371f58f26..a1e4f2e8629a 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -57,13 +57,14 @@ xfs_trans_buf_item_match(
>   * If the buffer does not yet have a buf log item associated with it,
>   * then allocate one for it.  Then add the buf item to the transaction.
>   */
> -STATIC void
> +STATIC int
>  _xfs_trans_bjoin(
>   struct xfs_trans*tp,
>   struct xfs_buf  *bp,
>   int reset_recur)
>  {
>   struct xfs_buf_log_item *bip;
> + int ret;
>  
>   ASSERT(bp->b_transp == NULL);
>  
> @@ -72,7 +73,11 @@ _xfs_trans_bjoin(
>* it doesn't have one yet, then allocate one and initialize it.
>* The checks to see if one is there are in xfs_buf_item_init().
>*/
> - xfs_buf_item_init(bp, tp->t_mountp);
> + ret = xfs_buf_item_init(bp, tp->t_mountp,
> + tp->t_flags & XFS_TRANS_NOWAIT);
> + if (ret < 0)
> + return ret;
> +
>   bip = bp->b_log_item;
>   ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
>   ASSERT(!(bip->__bli_format.blf_flags & XFS_BLF_CANCEL));
> @@ -92,6 +97,7 @@ _xfs_trans_bjoin(
>   xfs_trans_add_item(tp, >bli_item);
>   bp->b_transp = tp;
>  
> + return 0;
>  }
>  
>  void
> @@ -309,7 +315,11 @@ xfs_trans_read_buf_map(
>   }
>  
>   if (tp) {
> - _xfs_trans_bjoin(tp, bp, 1);
> + error = _xfs_trans_bjoin(tp, bp, 1);
> + if (error) {
> + xfs_buf_relse(bp);
> + return error;
> + }
>   trace_xfs_trans_read_buf(bp->b_log_item);

So what happens at the callers when we have a dirty transaction and
joining a buffer fails with -EAGAIN?

Apart from the fact this may well propagate -EAGAIN up to userspace,
cancelling a dirty transaction at this point will result in a
filesystem shutdown

Indeed, this can happen in the "simple" timestamp update case that
this "nowait" semantic is being aimed at. We log the inode in the
timestamp update, which dirties the log item and registers a
precommit operation to be run. We commit the
transaction, which then runs xfs_inode_item_precommit() and that
may need to attach the inode to the inode cluster buffer. This
results in:

xfs_inode_item_precommit
  xfs_imap_to_bp
xfs_trans_read_buf_map
  _xfs_trans_bjoin
xfs_buf_item_init(XFS_TRANS_NOWAIT)
  kmem_cache_zalloc(GFP_NOFS)

  gets -EAGAIN error
propagates -EAGAIN
  fails due to -EAGAIN

And now xfs_trans_commit() fails with a dirty transaction and the
filesystem shuts down.

IOWs, XFS_TRANS_NOWAIT as it stands is fundamentally broken. Once we
dirty an item in a transaction, we *cannot* back out of the
transaction. We *must block* in every place that could fail -
locking, memory allocation and/or IO - until the transaction
completes because we cannot undo the changes we've already made to
the dirty items in the transaction

It's even worse than that - once we have committed intents, the
whole chain of intent processing must be run to completionr. Hence
we can't tolerate backing out of that defered processing chain half
way through because we might have to block.

Until we can roll back partial dirty transactions and partially
completed defered intent chains at any random point of completion,
XFS_TRANS_NOWAIT will not work.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 28/29] xfs: support nowait semantics for xc_ctx_lock in xlog_cil_commit()

2023-08-25 Thread Dave Chinner

On Fri, Aug 25, 2023 at 09:54:30PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> Apply trylock logic for xc_ctx_lock in xlog_cil_commit() in nowait
> case and error out -EAGAIN for xlog_cil_commit().

Again, fundamentally broken. Any error from xlog_cil_commit() will
result in a filesystem shutdown as we cannot back out from failure
with dirty log items gracefully at this point.

-Dave.

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 26/29] xfs: return -EAGAIN when nowait meets sync in transaction commit

2023-08-25 Thread Dave Chinner

On Fri, Aug 25, 2023 at 09:54:28PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> if the log transaction is a sync one, let's fail the nowait try and
> return -EAGAIN directly since sync transaction means blocked by IO.
> 
> Signed-off-by: Hao Xu 
> ---
>  fs/xfs/xfs_trans.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 7988b4c7f36e..f1f84a3dd456 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -968,12 +968,24 @@ __xfs_trans_commit(
>   xfs_csn_t   commit_seq = 0;
>   int error = 0;
>   int sync = tp->t_flags & XFS_TRANS_SYNC;
> + boolnowait = tp->t_flags & XFS_TRANS_NOWAIT;
> + boolperm_log = tp->t_flags & XFS_TRANS_PERM_LOG_RES;
>  
>   trace_xfs_trans_commit(tp, _RET_IP_);
>  
> + if (nowait && sync) {
> + /*
> +  * Currently nowait is only from xfs_vn_update_time()
> +  * so perm_log is always false here, but let's make
> +  * code general.
> +  */
> + if (perm_log)
> + xfs_defer_cancel(tp);
> + goto out_unreserve;
> + }

This is fundamentally broken.  We cannot about a transaction commit
with dirty items at this point with shutting down the filesystem.

This points to XFS_TRANS_NOWAIT being completely broken, too,
because we don't call xfs_trans_set_sync() until just before we
commit the transaction. At this point, it is -too late- for
nowait+sync to be handled gracefully, and it will *always* go bad.

IOWs, the whole transaction "nowait" semantics as designed and
implemented is not a workable solution

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 24/29] xfs: support nowait for xfs_buf_read_map()

2023-08-25 Thread Dave Chinner

On Fri, Aug 25, 2023 at 09:54:26PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> This causes xfstests generic/232 hung in umount process, waiting for ail
> push, so I comment it for now, need some hints from xfs folks.
> Not a real patch.
> 
> Signed-off-by: Hao Xu 
> ---
>  fs/xfs/xfs_buf.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index cdad80e1ae25..284962a9f31a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -828,6 +828,13 @@ xfs_buf_read_map(
>   trace_xfs_buf_read(bp, flags, _RET_IP_);
>  
>   if (!(bp->b_flags & XBF_DONE)) {
> +//   /*
> +//* Let's bypass the _xfs_buf_read() for now
> +//*/
> +//   if (flags & XBF_NOWAIT) {
> +//   xfs_buf_relse(bp);
> +//   return -EAGAIN;
> +//   }

This is *fundamentally broken*, and apart from anything else breaks
readahead.

IF we asked for a read, we cannot instantiate the buffer and then
*not issue any IO on it* and release it. That leaves an
uninitialised buffer in memory, and there's every chance that
something then trips over it and bad things happen.

A buffer like this *must* be errored out and marked stale so that
the next access to it will then re-initialise the buffer state and
trigger any preparatory work that needs to be done for the new
operation.

This comes back to my first comments that XBF_TRYLOCK cannot simpy
be replaced with XBF_NOWAIT semantics...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 02/29] xfs: rename XBF_TRYLOCK to XBF_NOWAIT

2023-08-25 Thread Dave Chinner

45,7 @@ struct xfs_buf;
>  
>  /* flags used only as arguments to access routines */
>  #define XBF_INCORE(1u << 29)/* lookup only, return if found in cache */
> -#define XBF_TRYLOCK   (1u << 30)/* lock requested, but do not wait */
> +#define XBF_NOWAIT(1u << 30)/* mem/lock requested, but do not wait */

That's now a really poor comment. It doesn't describe the semantics
or constraints that NOWAIT might imply.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v4 46/48] mm: shrinker: make memcg slab shrink lockless

2023-08-07 Thread Dave Chinner

ly(!shrinker || !shrinker_try_get(shrinker))) 
> {
> + clear_bit(offset, unit->map);
> + rcu_read_unlock();
>   continue;
>   }
> + rcu_read_unlock();
>  
>   /* Call non-slab shrinkers even though kmem is disabled 
> */
>   if (!memcg_kmem_online() &&
> @@ -523,15 +546,20 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>   set_shrinker_bit(memcg, nid, 
> shrinker_id);
>   }
>   freed += ret;
> -
> - if (rwsem_is_contended(_rwsem)) {
> - freed = freed ? : 1;
> - goto unlock;
> - }
> + shrinker_put(shrinker);

Ok, so why is this safe to call without holding the rcu read lock?
The global shrinker has to hold the rcu_read_lock() whilst calling
shrinker_put() to guarantee the validity of the list next pointer,
but we don't hold off RCU here so what guarantees a racing global
shrinker walk doesn't trip over this shrinker_put() call dropping
the refcount to zero and freeing occuring in a different context...


> + /*
> +  * We have already exited the read-side of rcu critical section
> +  * before calling do_shrink_slab(), the shrinker_info may be
> +  * released in expand_one_shrinker_info(), so reacquire the
> +  * shrinker_info.
> +  */
> + index++;
> + goto again;

With that, what makes the use of shrinker_info in
xchg_nr_deferred_memcg() in do_shrink_slab() coherent and valid?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless

2023-08-07 Thread Dave Chinner

On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote:
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index eb342994675a..f06225f18531 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,8 @@
>  
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #define SHRINKER_UNIT_BITS   BITS_PER_LONG
>  
> @@ -87,6 +89,10 @@ struct shrinker {
>   int seeks;  /* seeks to recreate an obj */
>   unsigned flags;
>  
> + refcount_t refcount;
> + struct completion done;
> + struct rcu_head rcu;

Documentation, please. What does the refcount protect, what does the
completion provide, etc.

> +
>   void *private_data;
>  
>   /* These are for internal use */
> @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, 
> const char *fmt, ...);
>  void shrinker_register(struct shrinker *shrinker);
>  void shrinker_free(struct shrinker *shrinker);
>  
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> + return refcount_inc_not_zero(>refcount);
> +}
> +
> +static inline void shrinker_put(struct shrinker *shrinker)
> +{
> + if (refcount_dec_and_test(>refcount))
> + complete(>done);
> +}
> +
>  #ifdef CONFIG_SHRINKER_DEBUG
>  extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker,
> const char *fmt, ...);
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 1911c06b8af5..d318f5621862 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -2,6 +2,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "internal.h"
> @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, 
> struct mem_cgroup *memcg,
>   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> - if (!down_read_trylock(_rwsem))
> - goto out;
> -
> - list_for_each_entry(shrinker, _list, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(shrinker, _list, list) {
>   struct shrink_control sc = {
>   .gfp_mask = gfp_mask,
>   .nid = nid,
>   .memcg = memcg,
>   };
>  
> + if (!shrinker_try_get(shrinker))
> + continue;
> +
> + /*
> +  * We can safely unlock the RCU lock here since we already
> +  * hold the refcount of the shrinker.
> +  */
> + rcu_read_unlock();
> +
>   ret = do_shrink_slab(, shrinker, priority);
>   if (ret == SHRINK_EMPTY)
>   ret = 0;
>   freed += ret;
> +
>   /*
> -  * Bail out if someone want to register a new shrinker to
> -  * prevent the registration from being stalled for long periods
> -  * by parallel ongoing shrinking.
> +  * This shrinker may be deleted from shrinker_list and freed
> +  * after the shrinker_put() below, but this shrinker is still
> +  * used for the next traversal. So it is necessary to hold the
> +  * RCU lock first to prevent this shrinker from being freed,
> +  * which also ensures that the next shrinker that is traversed
> +  * will not be freed (even if it is deleted from shrinker_list
> +  * at the same time).
>*/

This needs to be moved to the head of the function, and document
the whole list walk, get, put and completion parts of the algorithm
that make it safe. There's more to this than "we hold a reference
count", especially the tricky "we might see the shrinker before it
is fully initialised" case


.
>  void shrinker_free(struct shrinker *shrinker)
>  {
>   struct dentry *debugfs_entry = NULL;
> @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker)
>   if (!shrinker)
>   return;
>  
> + if (shrinker->flags & SHRINKER_REGISTERED) {
> + shrinker_put(shrinker);
> + wait_for_completion(>done);
> + }

Needs a comment explaining why we need to wait here...
> +
>   down_write(_rwsem);
>   if (shrinker->flags & SHRINKER_REGISTERED) {
> - list_del(>list);
> + /*
> +  * Lookups on the shrinker are over and will fail in the future,
> +  * so we can now remove it from the lists and free it.
> +  */

 rather than here after the wait has been done and provided the
guarantee that no shrinker is running or will run again...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v4 44/48] mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}

2023-08-07 Thread Dave Chinner

  }
>   }
>   }
>   up_read(_rwsem);
> @@ -407,7 +458,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>  {
>   struct shrinker_info *info;
>   unsigned long ret, freed = 0;
> - int i;
> + int offset, index = 0;
>  
>   if (!mem_cgroup_online(memcg))
>   return 0;
> @@ -419,56 +470,63 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>   if (unlikely(!info))
>   goto unlock;
>  
> - for_each_set_bit(i, info->map, info->map_nr_max) {
> - struct shrink_control sc = {
> - .gfp_mask = gfp_mask,
> - .nid = nid,
> - .memcg = memcg,
> - };
> - struct shrinker *shrinker;
> + for (; index < shriner_id_to_index(info->map_nr_max); index++) {
> + struct shrinker_info_unit *unit;

This adds another layer of indent to shrink_slab_memcg(). Please
factor it first so that the code ends up being readable. Doing that
first as a separate patch will also make the actual algorithm
changes in this patch be much more obvious - this huge hunk of
diff is pretty much impossible to review...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless

2023-08-07 Thread Dave Chinner

On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
> which protects most operations such as slab shrink, registration and
> unregistration of shrinkers, etc. This can easily cause problems in the
> following cases.

> This commit uses the refcount+RCU method [5] proposed by Dave Chinner
> to re-implement the lockless global slab shrink. The memcg slab shrink is
> handled in the subsequent patch.

> ---
>  include/linux/shrinker.h | 17 ++
>  mm/shrinker.c| 70 +---
>  2 files changed, 68 insertions(+), 19 deletions(-)

There's no documentation in the code explaining how the lockless
shrinker algorithm works. It's left to the reader to work out how
this all goes together

> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index eb342994675a..f06225f18531 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,8 @@
>  
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #define SHRINKER_UNIT_BITS   BITS_PER_LONG
>  
> @@ -87,6 +89,10 @@ struct shrinker {
>   int seeks;  /* seeks to recreate an obj */
>   unsigned flags;
>  
> + refcount_t refcount;
> + struct completion done;
> + struct rcu_head rcu;

What does the refcount protect, why do we need the completion, etc?

> +
>   void *private_data;
>  
>   /* These are for internal use */
> @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, 
> const char *fmt, ...);
>  void shrinker_register(struct shrinker *shrinker);
>  void shrinker_free(struct shrinker *shrinker);
>  
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> + return refcount_inc_not_zero(>refcount);
> +}
> +
> +static inline void shrinker_put(struct shrinker *shrinker)
> +{
> + if (refcount_dec_and_test(>refcount))
> + complete(>done);
> +}
> +
>  #ifdef CONFIG_SHRINKER_DEBUG
>  extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker,
> const char *fmt, ...);
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 1911c06b8af5..d318f5621862 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -2,6 +2,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "internal.h"
> @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, 
> struct mem_cgroup *memcg,
>   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> - if (!down_read_trylock(_rwsem))
> - goto out;
> -
> - list_for_each_entry(shrinker, _list, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(shrinker, _list, list) {
>   struct shrink_control sc = {
>   .gfp_mask = gfp_mask,
>   .nid = nid,
>   .memcg = memcg,
>   };
>  
> + if (!shrinker_try_get(shrinker))
> + continue;
> +
> + /*
> +  * We can safely unlock the RCU lock here since we already
> +  * hold the refcount of the shrinker.
> +  */
> + rcu_read_unlock();
> +
>   ret = do_shrink_slab(, shrinker, priority);
>   if (ret == SHRINK_EMPTY)
>   ret = 0;
>   freed += ret;
> +
>   /*
> -  * Bail out if someone want to register a new shrinker to
> -  * prevent the registration from being stalled for long periods
> -  * by parallel ongoing shrinking.
> +  * This shrinker may be deleted from shrinker_list and freed
> +  * after the shrinker_put() below, but this shrinker is still
> +  * used for the next traversal. So it is necessary to hold the
> +  * RCU lock first to prevent this shrinker from being freed,
> +  * which also ensures that the next shrinker that is traversed
> +  * will not be freed (even if it is deleted from shrinker_list
> +  * at the same time).
>*/

This comment really should be at the head of the function,
describing the algorithm used within the function itself. i.e. how
reference counts are used w.r.t. the rcu_read_lock() usage to
guarantee existence of the shrinker and the validity of the list
walk.

I'm not going to remember all these little details when I look at
this code in another 6 months time, and having to work it out from
first principles every time I look at the code will waste of a lot
of time...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v3 28/49] dm zoned: dynamically allocate the dm-zoned-meta shrinker

2023-07-27 Thread Dave Chinner

On Thu, Jul 27, 2023 at 07:20:46PM +0900, Damien Le Moal wrote:
> On 7/27/23 17:55, Qi Zheng wrote:
> >>>   goto err;
> >>>   }
> >>>   +    zmd->mblk_shrinker->count_objects = dmz_mblock_shrinker_count;
> >>> +    zmd->mblk_shrinker->scan_objects = dmz_mblock_shrinker_scan;
> >>> +    zmd->mblk_shrinker->seeks = DEFAULT_SEEKS;
> >>> +    zmd->mblk_shrinker->private_data = zmd;
> >>> +
> >>> +    shrinker_register(zmd->mblk_shrinker);
> >>
> >> I fail to see how this new shrinker API is better... Why isn't there a
> >> shrinker_alloc_and_register() function ? That would avoid adding all this 
> >> code
> >> all over the place as the new API call would be very similar to the current
> >> shrinker_register() call with static allocation.
> > 
> > In some registration scenarios, memory needs to be allocated in advance.
> > So we continue to use the previous prealloc/register_prepared()
> > algorithm. The shrinker_alloc_and_register() is just a helper function
> > that combines the two, and this increases the number of APIs that
> > shrinker exposes to the outside, so I choose not to add this helper.
> 
> And that results in more code in many places instead of less code + a simple
> inline helper in the shrinker header file...

It's not just a "simple helper" - it's a function that has to take 6
or 7 parameters with a return value that must be checked and
handled.

This was done in the first versions of the patch set - the amount of
code in each caller does not go down and, IMO, was much harder to
read and determine "this is obviously correct" that what we have
now.

> So not adding that super simple
> helper is not exactly the best choice in my opinion.

Each to their own - I much prefer the existing style/API over having
to go look up a helper function every time I want to check some
random shrinker has been set up correctly

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v2 03/47] mm: shrinker: add infrastructure for dynamically allocating shrinker

2023-07-26 Thread Dave Chinner

On Mon, Jul 24, 2023 at 05:43:10PM +0800, Qi Zheng wrote:
> Currently, the shrinker instances can be divided into the following three
> types:
> 
> a) global shrinker instance statically defined in the kernel, such as
>workingset_shadow_shrinker.
> 
> b) global shrinker instance statically defined in the kernel modules, such
>as mmu_shrinker in x86.
> 
> c) shrinker instance embedded in other structures.
> 
> For case a, the memory of shrinker instance is never freed. For case b,
> the memory of shrinker instance will be freed after synchronize_rcu() when
> the module is unloaded. For case c, the memory of shrinker instance will
> be freed along with the structure it is embedded in.
> 
> In preparation for implementing lockless slab shrink, we need to
> dynamically allocate those shrinker instances in case c, then the memory
> can be dynamically freed alone by calling kfree_rcu().
> 
> So this commit adds the following new APIs for dynamically allocating
> shrinker, and add a private_data field to struct shrinker to record and
> get the original embedded structure.
> 
> 1. shrinker_alloc()
> 
> Used to allocate shrinker instance itself and related memory, it will
> return a pointer to the shrinker instance on success and NULL on failure.
> 
> 2. shrinker_free_non_registered()
> 
> Used to destroy the non-registered shrinker instance.

This is a bit nasty

> 
> 3. shrinker_register()
> 
> Used to register the shrinker instance, which is same as the current
> register_shrinker_prepared().
> 
> 4. shrinker_unregister()

rename this "shrinker_free()" and key the two different freeing
cases on the SHRINKER_REGISTERED bit rather than mostly duplicating
the two.

void shrinker_free(struct shrinker *shrinker)
{
struct dentry *debugfs_entry = NULL;
int debugfs_id;

if (!shrinker)
return;

down_write(_rwsem);
if (shrinker->flags & SHRINKER_REGISTERED) {
list_del(>list);
debugfs_entry = shrinker_debugfs_detach(shrinker, _id);
} else if (IS_ENABLED(CONFIG_SHRINKER_DEBUG)) {
kfree_const(shrinker->name);
}

if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
up_write(_rwsem);

if (debugfs_entry)
shrinker_debugfs_remove(debugfs_entry, debugfs_id);

kfree(shrinker->nr_deferred);
kfree(shrinker);
}
EXPORT_SYMBOL_GPL(shrinker_free);

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v5 6/8] xfs: switch to multigrain timestamps

2023-07-18 Thread Dave Chinner

On Thu, Jul 13, 2023 at 07:00:55PM -0400, Jeff Layton wrote:
> Enable multigrain timestamps, which should ensure that there is an
> apparent change to the timestamp whenever it has been written after
> being actively observed via getattr.
> 
> Also, anytime the mtime changes, the ctime must also change, and those
> are now the only two options for xfs_trans_ichgtime. Have that function
> unconditionally bump the ctime, and warn if XFS_ICHGTIME_CHG is ever not
> set.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 6 +++---
>  fs/xfs/xfs_iops.c   | 4 ++--
>  fs/xfs/xfs_super.c  | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 0c9df8df6d4a..86f5ffce2d89 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -62,12 +62,12 @@ xfs_trans_ichgtime(
>   ASSERT(tp);
>   ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  
> - tv = current_time(inode);
> + /* If the mtime changes, then ctime must also change */
> + WARN_ON_ONCE(!(flags & XFS_ICHGTIME_CHG));

Make that an ASSERT(flags & XFS_ICHGTIME_CHG), please. There's no
need to verify this at runtime on production kernels.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v7 19/20] fs: iomap: use bio_add_folio_nofail where possible

2023-05-31 Thread Dave Chinner

On Thu, Jun 01, 2023 at 06:20:21AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 01, 2023 at 08:36:59AM +1000, Dave Chinner wrote:
> > We lose adjacent page merging with this change.
> 
> This is only used for adding the first folio to a brand new bio,
> so there is nothing to merge with yet at this point.

Ah, sorry, missed that. Carry on. :)

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v7 19/20] fs: iomap: use bio_add_folio_nofail where possible

2023-05-31 Thread Dave Chinner

On Wed, May 31, 2023 at 04:50:42AM -0700, Johannes Thumshirn wrote:
> When the iomap buffered-io code can't add a folio to a bio, it allocates a
> new bio and adds the folio to that one. This is done using bio_add_folio(),
> but doesn't check for errors.
> 
> As adding a folio to a newly created bio can't fail, use the newly
> introduced bio_add_folio_nofail() function.
> 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Matthew Wilcox (Oracle) 
> Signed-off-by: Johannes Thumshirn 
> ---
>  fs/iomap/buffered-io.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 063133ec77f4..0edab9deae2a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -312,7 +312,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter 
> *iter,
>   ctx->bio->bi_opf |= REQ_RAHEAD;
>   ctx->bio->bi_iter.bi_sector = sector;
>   ctx->bio->bi_end_io = iomap_read_end_io;
> - bio_add_folio(ctx->bio, folio, plen, poff);
> + bio_add_folio_nofail(ctx->bio, folio, plen, poff);
>   }
>  
>  done:
> @@ -539,7 +539,7 @@ static int iomap_read_folio_sync(loff_t block_start, 
> struct folio *folio,
>  
>   bio_init(, iomap->bdev, , 1, REQ_OP_READ);
>   bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
> - bio_add_folio(, folio, plen, poff);
> + bio_add_folio_nofail(, folio, plen, poff);
>   return submit_bio_wait();
>  }
>  
> @@ -1582,7 +1582,7 @@ iomap_add_to_ioend(struct inode *inode, loff_t pos, 
> struct folio *folio,
>  
>   if (!bio_add_folio(wpc->ioend->io_bio, folio, len, poff)) {
>   wpc->ioend->io_bio = iomap_chain_bio(wpc->ioend->io_bio);
> - bio_add_folio(wpc->ioend->io_bio, folio, len, poff);
> + bio_add_folio_nofail(wpc->ioend->io_bio, folio, len, poff);
>   }

We lose adjacent page merging with this change.

We've had performance regressions in the past that have been
attributed to either the page allocator not handing out sequential
adjacent pages for sequential writes and/or bios not merging
adjacent pages. Some hardware is much more performant when it only
has to do a single large DMA instead of (potentially) hundreds of
single page DMAs for the same amount of data...

What performance regression testing has been done on this change?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 16/17] block: use iomap for writes to block devices

2023-05-23 Thread Dave Chinner

On Fri, May 19, 2023 at 04:22:01PM +0200, Hannes Reinecke wrote:
> On 4/24/23 07:49, Christoph Hellwig wrote:
> > Use iomap in buffer_head compat mode to write to block devices.
> > 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >   block/Kconfig |  1 +
> >   block/fops.c  | 33 +
> >   2 files changed, 30 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 941b2dca70db73..672b08f0096ab4 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ -5,6 +5,7 @@
> >   menuconfig BLOCK
> >  bool "Enable the block layer" if EXPERT
> >  default y
> > +   select IOMAP
> >  select SBITMAP
> >  help
> >  Provide block layer support for the kernel.
> > diff --git a/block/fops.c b/block/fops.c
> > index 318247832a7bcf..7910636f8df33b 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -15,6 +15,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include "blk.h"
> > @@ -386,6 +387,27 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, 
> > struct iov_iter *iter)
> > return __blkdev_direct_IO(iocb, iter, bio_max_segs(nr_pages));
> >   }
> > +static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t 
> > length,
> > +   unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> > +{
> > +   struct block_device *bdev = I_BDEV(inode);
> > +   loff_t isize = i_size_read(inode);
> > +
> > +   iomap->bdev = bdev;
> > +   iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
> > +   if (WARN_ON_ONCE(iomap->offset >= isize))
> > +   return -EIO;
> 
> I'm hitting this during booting:
> [5.016324]  
> [5.030256]  iomap_iter+0x11a/0x350
> [5.030264]  iomap_readahead+0x1eb/0x2c0
> [5.030272]  read_pages+0x5d/0x220
> [5.030279]  page_cache_ra_unbounded+0x131/0x180
> [5.030284]  filemap_get_pages+0xff/0x5a0

Why is filemap_get_pages() using unbounded readahead? Surely
readahead should be limited to reading within EOF

> [5.030292]  filemap_read+0xca/0x320
> [5.030296]  ? aa_file_perm+0x126/0x500
> [5.040216]  ? touch_atime+0xc8/0x150
> [5.040224]  blkdev_read_iter+0xb0/0x150
> [5.040228]  vfs_read+0x226/0x2d0
> [5.040234]  ksys_read+0xa5/0xe0
> [5.040238]  do_syscall_64+0x5b/0x80
> 
> Maybe we should consider this patch:
> 
> diff --git a/block/fops.c b/block/fops.c
> index 524b8a828aad..d202fb663f25 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -386,10 +386,13 @@ static int blkdev_iomap_begin(struct inode *inode,
> loff_t offset, loff_t length,
> 
> iomap->bdev = bdev;
> iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
> -   if (WARN_ON_ONCE(iomap->offset >= isize))
> -   return -EIO;
> -   iomap->type = IOMAP_MAPPED;
> -   iomap->addr = iomap->offset;
> +   if (WARN_ON_ONCE(iomap->offset >= isize)) {
> +   iomap->type = IOMAP_HOLE;
> +   iomap->addr = IOMAP_NULL_ADDR;
> +   } else {
> +   iomap->type = IOMAP_MAPPED;
> +   iomap->addr = iomap->offset;
> +   }

I think Christoph's code is correct. IMO, any attempt to read beyond
the end of the device should throw out a warning and return an
error, not silently return zeros.

If readahead is trying to read beyond the end of the device, then it
really seems to me like the problem here is readahead, not the iomap
code detecting the OOB read request

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v2 00/23] fs-verity support for XFS

2023-04-11 Thread Dave Chinner

On Tue, Apr 11, 2023 at 07:33:19PM -0700, Eric Biggers wrote:
> On Mon, Apr 10, 2023 at 10:19:46PM -0700, Christoph Hellwig wrote:
> > Dave is going to hate me for this, but..
> > 
> > I've been looking over some of the interfaces here, and I'm starting
> > to very seriously questioning the design decisions of storing the
> > fsverity hashes in xattrs.
> > 
> > Yes, storing them beyond i_size in the file is a bit of a hack, but
> > it allows to reuse a lot of the existing infrastructure, and much
> > of fsverity is based around it.  So storing them in an xattrs causes
> > a lot of churn in the interface.  And the XFS side with special
> > casing xattr indices also seems not exactly nice.
> 
> It seems it's really just the Merkle tree caching interface that is causing
> problems, as it's currently too closely tied to the page cache?  That is just 
> an
> implementation detail that could be reworked along the lines of what is being
> discussed.
> 
> But anyway, it is up to the XFS folks.  Keep in mind there is also the option 
> of
> doing what btrfs is doing, where it stores the Merkle tree separately from the
> file data stream, but caches it past i_size in the page cache at runtime.

Right. It's not entirely simple to store metadata on disk beyond EOF
in XFS because of all the assumptions throughout the IO path and
allocator interfaces that it can allocate space beyond EOF at will
and something else will clean it up later if it is not needed. This
impacts on truncate, delayed allocation, writeback, IO completion,
EOF block removal on file close, background garbage collection,
ENOSPC/EDQUOT driven space freeing, etc.  Some of these things cross
over into iomap infrastructure, too.

AFAIC, it's far more intricate, complex and risky to try to store
merkle tree data beyond EOF than it is to put it in an xattr
namespace because IO path EOF handling bugs result in user data
corruption. This happens over and over again, no matter how careful
we are about these aspects of user data handling.

OTOH, putting the merkle tree data in a different namespace avoids
these issues completely. Yes, we now have to solve an API mismatch,
but we aren't risking the addition of IO path data corruption bugs
to every non-fsverity filesystem in production...

Hence I think copying the btrfs approach (i.e. only caching the
merkle tree data in the page cache beyond EOF) would be as far as I
think we'd want to go. Realistically, there would be little
practical difference between btrfs storing the merkle tree blocks in
a separate internal btree and XFS storing them in an internal
private xattr btree namespace.

I would, however, prefer not to have to do this at all if we could
simply map the blocks directly out of the xattr buffers as we
already do internally for all the XFS code...

> I guess there is also the issue of encryption, which hasn't come up yet since
> we're talking about fsverity support only.  The Merkle tree (including the
> fsverity_descriptor) is supposed to be encrypted, just like the file contents
> are.  Having it be stored after the file contents accomplishes that easily...
> Of course, it doesn't have to be that way; a separate key could be derived, or
> the Merkle tree blocks could be encrypted with the file contents key using
> indices past i_size, without them physically being stored in the data stream.

I'm expecting that fscrypt for XFS will include encryption of the
xattr names and values (just like we will need to do for directory
names) except for the xattrs that hold the encryption keys
themselves. That means the merkle tree blocks should get encrypted
without any extra work needing to be done anywhere.  This will
simply require the fscrypt keys to be held in a private internal
xattr namespace that isn't encrypted, but that's realtively trivial
to do...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Dave Chinner

On Wed, Apr 05, 2023 at 10:54:06PM +, Eric Biggers wrote:
> On Thu, Apr 06, 2023 at 08:26:46AM +1000, Dave Chinner wrote:
> > > We could certainly think about moving to a design where fs/verity/ asks 
> > > the
> > > filesystem to just *read* a Merkle tree block, without adding it to a 
> > > cache, and
> > > then fs/verity/ implements the caching itself.  That would require some 
> > > large
> > > changes to each filesystem, though, unless we were to double-cache the 
> > > Merkle
> > > tree blocks which would be inefficient.
> > 
> > No, that's unnecessary.
> > 
> > All we need if for fsverity to require filesystems to pass it byte
> > addressable data buffers that are externally reference counted. The
> > filesystem can take a page reference before mapping the page and
> > passing the kaddr to fsverity, then unmap and drop the reference
> > when the merkle tree walk is done as per Andrey's new drop callout.
> > 
> > fsverity doesn't need to care what the buffer is made from, how it
> > is cached, what it's life cycle is, etc. The caching mechanism and
> > reference counting is entirely controlled by the filesystem callout
> > implementations, and fsverity only needs to deal with memory buffers
> > that are guaranteed to live for the entire walk of the merkle
> > tree
> 
> Sure.  Just a couple notes:
> 
> First, fs/verity/ does still need to be able to tell whether the buffer is 
> newly
> instantiated or not.

Boolean flag from the caller.

> Second, fs/verity/ uses the ahash API to do the hashing.  ahash is a
> scatterlist-based API.  Virtual addresses can still be used (see 
> sg_set_buf()),
> but the memory cannot be vmalloc'ed memory, since virt_to_page() needs to 
> work.
> Does XFS use vmalloc'ed memory for these buffers?

Not vmalloc'ed, but vmapped. we allocate the pages individually, but
then call vm_map_page() to present the higher level code with a
single contiguous memory range if it is a multi-page buffer.

We do have the backing info held in the buffer, and that's what we
use for IO. If fsverity needs a page based scatter/gather list
for hardware offload, it could ask the filesystem to provide it
for that given buffer...

> BTW, converting fs/verity/ from ahash to shash is an option; I've really never
> been a fan of the scatterlist-based crypto APIs!  The disadvantage of doing
> this, though, would be that it would remove support for all the hardware 
> crypto
> drivers.
>
> That *might* actually be okay, as that approach to crypto acceleration
> has mostly fallen out of favor, in favor of CPU-based acceleration.  But I do
> worry about e.g. someone coming out of the woodwork and saying they need to 
> use
> fsverity on a low-powered ARM board that has a crypto accelerator like CAAM, 
> and
> they MUST use their crypto accelerator to get acceptable performance.

True, but we are very unlikely to be using XFS on such small
systems and I don't think we really care about XFS performance on
android sized systems, either.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Dave Chinner

 time the cache is accessed.  The problem with what you have now is that 
> > every
> > time a single 32-byte hash is accessed, a full page (potentially 64KB!) 
> > will be
> > allocated and filled.  That's not very efficient.  The need to allocate the
> > temporary page can also cause ENOMEM (which will get reported as EIO).
> > 
> > Did you consider alternatives that would work more efficiently?  I think it
> > would be worth designing something that works properly with how XFS is 
> > planned
> > to cache the Merkle tree, instead of designing a workaround.
> > ->read_merkle_tree_page was not really designed for what you are doing here.
> > 
> > How about replacing ->read_merkle_tree_page with a function that takes in a
> > Merkle tree block index (not a page index!) and hands back a (page, offset) 
> > pair
> > that identifies where the Merkle tree block's data is located?  Or (folio,
> > offset), I suppose.

{kaddr, len}, please.

> > 
> > With that, would it be possible to directly return the XFS cache?
> > 
> > - Eric
> > 
> 
> Yeah, I also don't like it, I didn't want to change fs-verity much
> so went with this workaround. But as it's ok, I will look into trying
> to pass xfs buffers to fs-verity without direct use of
> ->read_merkle_tree_page(). I think it's possible with (folio,
> offset), the xfs buffers aren't xattr value align so the 4k merkle
> tree block is stored in two pages.

I don't think this is necessary to actually merge the code. We want
it to work correctly as the primary concern, performance is a
secondary concern.

Regardless, as you mention, the xfs_buf is not made up of contiguous
pages so the merkle tree block data will be split across two
(or more) pages.  AFAICT, the fsverity code doesn't work with data
structures that span multiple disjoint pages...

Another problem is that the xfs-buf might be backed by heap memory
(e.g. 4kB fs block size on 64kB PAGE_SIZE) and so it cannot be
treated like a page cache page by the fsverity merkle tree code. We
most definitely do not want to be passing pages containing heap
memory to functions expecting to be passed lru-resident page cache
pages

That said, xfs-bufs do have a stable method of addressing the data
in the buffers, and all the XFS code uses this to access and
manipulate data directly in the buffers.

That is, xfs_buf_offset() returns a mapped kaddr that points to the
contiguous memory region containing the metadata in the buffer.  If
the xfs_buf spans multiple pages, it will return a kaddr pointing
into the contiguous vmapped memory address that maps the entire
buffer data range. If it is heap memory, it simply returns a pointer
into that heap memory. If it's a single page, then it returns the
kaddr for the data within the page.

If you move all the assumptions about how the merkle tree data is
managed out of fsverity and require the fielsystems to do the
mapping to kaddrs and reference counting to guarantee life times,
then the need for multiple different methods for reading merkle tree
data go away...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Dave Chinner

On Wed, Apr 05, 2023 at 06:16:00PM +, Eric Biggers wrote:
> On Wed, Apr 05, 2023 at 09:38:47AM -0700, Darrick J. Wong wrote:
> > > The merkle tree pages are dropped after verification. When page is
> > > dropped xfs_buf is marked as verified. If fs-verity wants to
> > > verify again it will get the same verified buffer. If buffer is
> > > evicted it won't have verified state.
> > > 
> > > So, with enough memory pressure buffers will be dropped and need to
> > > be reverified.
> > 
> > Please excuse me if this was discussed and rejected long ago, but
> > perhaps fsverity should try to hang on to the merkle tree pages that
> > this function returns for as long as possible until reclaim comes for
> > them?
> > 
> > With the merkle tree page lifetimes extended, you then don't need to
> > attach the xfs_buf to page->private, nor does xfs have to extend the
> > buffer cache to stash XBF_VERITY_CHECKED.
> 
> Well, all the other filesystems that support fsverity (ext4, f2fs, and btrfs)
> just cache the Merkle tree pages in the inode's page cache.  It's an approach
> that I know some people aren't a fan of, but it's efficient and it works.

Which puts pages beyond EOF in the page cache. Given that XFS also
allows persistent block allocation beyond EOF, having both data in the page
cache and blocks beyond EOF that contain unrelated information is a
Real Bad Idea.

Just because putting metadata in the file data address space works
for one filesystem, it doesn't me it's a good idea or that it works
for every filesystem.

> We could certainly think about moving to a design where fs/verity/ asks the
> filesystem to just *read* a Merkle tree block, without adding it to a cache, 
> and
> then fs/verity/ implements the caching itself.  That would require some large
> changes to each filesystem, though, unless we were to double-cache the Merkle
> tree blocks which would be inefficient.

No, that's unnecessary.

All we need if for fsverity to require filesystems to pass it byte
addressable data buffers that are externally reference counted. The
filesystem can take a page reference before mapping the page and
passing the kaddr to fsverity, then unmap and drop the reference
when the merkle tree walk is done as per Andrey's new drop callout.

fsverity doesn't need to care what the buffer is made from, how it
is cached, what it's life cycle is, etc. The caching mechanism and
reference counting is entirely controlled by the filesystem callout
implementations, and fsverity only needs to deal with memory buffers
that are guaranteed to live for the entire walk of the merkle
tree

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files

2023-04-05 Thread Dave Chinner

On Wed, Apr 05, 2023 at 06:02:47PM +, Eric Biggers wrote:
> And I really hope that you don't want to do DIO to the *Merkle tree*, as that

Not for XFS - the merkle tree is not held as file data.

That said, the merkle tree in XFS is not page cache or block aligned
metadata either, so we really want the same memory buffer based
interface for the merkle tree reading so that the merkle tree code
can read directly from the xfs-buf rather than requiring us to copy
it out of the xfsbuf into temporary pages...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files

2023-04-05 Thread Dave Chinner

On Wed, Apr 05, 2023 at 08:09:27AM -0700, Darrick J. Wong wrote:
> On Wed, Apr 05, 2023 at 05:01:42PM +0200, Andrey Albershteyn wrote:
> > On Tue, Apr 04, 2023 at 09:10:47AM -0700, Darrick J. Wong wrote:
> > > On Tue, Apr 04, 2023 at 04:53:15PM +0200, Andrey Albershteyn wrote:
> > > > The direct path is not supported on verity files. Attempts to use direct
> > > > I/O path on such files should fall back to buffered I/O path.
> > > > 
> > > > Signed-off-by: Andrey Albershteyn 
> > > > ---
> > > >  fs/xfs/xfs_file.c | 14 +++---
> > > >  1 file changed, 11 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > index 947b5c436172..9e072e82f6c1 100644
> > > > --- a/fs/xfs/xfs_file.c
> > > > +++ b/fs/xfs/xfs_file.c
> > > > @@ -244,7 +244,8 @@ xfs_file_dax_read(
> > > > struct kiocb*iocb,
> > > > struct iov_iter *to)
> > > >  {
> > > > -   struct xfs_inode*ip = 
> > > > XFS_I(iocb->ki_filp->f_mapping->host);
> > > > +   struct inode*inode = iocb->ki_filp->f_mapping->host;
> > > > +   struct xfs_inode*ip = XFS_I(inode);
> > > > ssize_t ret = 0;
> > > >  
> > > > trace_xfs_file_dax_read(iocb, to);
> > > > @@ -297,10 +298,17 @@ xfs_file_read_iter(
> > > >  
> > > > if (IS_DAX(inode))
> > > > ret = xfs_file_dax_read(iocb, to);
> > > > -   else if (iocb->ki_flags & IOCB_DIRECT)
> > > > +   else if (iocb->ki_flags & IOCB_DIRECT && 
> > > > !fsverity_active(inode))
> > > > ret = xfs_file_dio_read(iocb, to);
> > > > -   else
> > > > +   else {
> > > > +   /*
> > > > +* In case fs-verity is enabled, we also fallback to the
> > > > +* buffered read from the direct read path. Therefore,
> > > > +* IOCB_DIRECT is set and need to be cleared
> > > > +*/
> > > > +   iocb->ki_flags &= ~IOCB_DIRECT;
> > > > ret = xfs_file_buffered_read(iocb, to);
> > > 
> > > XFS doesn't usually allow directio fallback to the pagecache. Why
> > > would fsverity be any different?
> > 
> > Didn't know that, this is what happens on ext4 so I did the same.
> > Then it probably make sense to just error on DIRECT on verity
> > sealed file.
> 
> Thinking about this a little more -- I suppose we shouldn't just go
> breaking directio reads from a verity file if we can help it.  Is there
> a way to ask fsverity to perform its validation against some arbitrary
> memory buffer that happens to be fs-block aligned?

The memory buffer doesn't even need to be fs-block aligned - it just
needs to be a pointer to memory the kernel can read...

We also need fsverity to be able to handle being passed mapped
kernel memory rather than pages/folios for the merkle tree
interfaces. That way we can just pass it the mapped buffer memory
straight from the xfs-buf and we don't have to do the whacky "copy
from xattr xfs_bufs into pages so fsverity can take temporary
reference counts on what it thinks are page cache pages" as it walks
the merkle tree.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v2 16/23] xfs: add inode on-disk VERITY flag

2023-04-04 Thread Dave Chinner

On Tue, Apr 04, 2023 at 03:41:23PM -0700, Eric Biggers wrote:
> Hi Andrey,
> 
> On Tue, Apr 04, 2023 at 04:53:12PM +0200, Andrey Albershteyn wrote:
> > Add flag to mark inodes which have fs-verity enabled on them (i.e.
> > descriptor exist and tree is built).
> > 
> > Signed-off-by: Andrey Albershteyn 
> > ---
> >  fs/ioctl.c | 4 
> >  fs/xfs/libxfs/xfs_format.h | 4 +++-
> >  fs/xfs/xfs_inode.c | 2 ++
> >  fs/xfs/xfs_iops.c  | 2 ++
> >  include/uapi/linux/fs.h| 1 +
> >  5 files changed, 12 insertions(+), 1 deletion(-)
> [...]
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index b7b56871029c..5172a2eb902c 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -140,6 +140,7 @@ struct fsxattr {
> >  #define FS_XFLAG_FILESTREAM0x4000  /* use filestream 
> > allocator */
> >  #define FS_XFLAG_DAX   0x8000  /* use DAX for IO */
> >  #define FS_XFLAG_COWEXTSIZE0x0001  /* CoW extent size 
> > allocator hint */
> > +#define FS_XFLAG_VERITY0x0002  /* fs-verity sealed 
> > inode */
> >  #define FS_XFLAG_HASATTR   0x8000  /* no DIFLAG for this   */
> >  
> 
> I don't think "xfs: add inode on-disk VERITY flag" is an accurate description 
> of
> a patch that involves adding something to the UAPI.

Well it does that, but it also adds the UAPI for querying the
on-disk flag via the FS_IOC_FSGETXATTR interface as well.  It
probably should be split up into two patches.

> Should the other filesystems support this new flag too?

I think they should get it automatically now that it has been
defined for FS_IOC_FSGETXATTR and added to the generic fileattr flag
fill functions in fs/ioctl.c.

> I'd also like all ways of getting the verity flag to continue to be mentioned 
> in
> Documentation/filesystems/fsverity.rst.  The existing methods (FS_IOC_GETFLAGS
> and statx) are already mentioned there.

*nod*

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v2 06/23] fsverity: add drop_page() callout

2023-04-04 Thread Dave Chinner

On Tue, Apr 04, 2023 at 04:53:02PM +0200, Andrey Albershteyn wrote:
> Allow filesystem to make additional processing on verified pages
> instead of just dropping a reference. This will be used by XFS for
> internal buffer cache manipulation in further patches. The btrfs,
> ext4, and f2fs just drop the reference.
> 
> Signed-off-by: Andrey Albershteyn 
> ---
>  fs/btrfs/verity.c | 12 
>  fs/ext4/verity.c  |  6 ++
>  fs/f2fs/verity.c  |  6 ++
>  fs/verity/read_metadata.c |  4 ++--
>  fs/verity/verify.c|  6 +++---
>  include/linux/fsverity.h  | 10 ++
>  6 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
> index c5ff16f9e9fa..4c2c09204bb4 100644
> --- a/fs/btrfs/verity.c
> +++ b/fs/btrfs/verity.c
> @@ -804,10 +804,22 @@ static int btrfs_write_merkle_tree_block(struct inode 
> *inode, const void *buf,
>  pos, buf, size);
>  }
>  
> +/*
> + * fsverity op that releases the reference obtained by 
> ->read_merkle_tree_page()
> + *
> + * @page:  reference to the page which can be released
> + *
> + */
> +static void btrfs_drop_page(struct page *page)
> +{
> + put_page(page);
> +}
> +
>  const struct fsverity_operations btrfs_verityops = {
>   .begin_enable_verity = btrfs_begin_enable_verity,
>   .end_enable_verity   = btrfs_end_enable_verity,
>   .get_verity_descriptor   = btrfs_get_verity_descriptor,
>   .read_merkle_tree_page   = btrfs_read_merkle_tree_page,
>   .write_merkle_tree_block = btrfs_write_merkle_tree_block,
> + .drop_page   = _drop_page,
>  };

Ok, that's a generic put_page() call.


> diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> index f50e3b5b52c9..c2fc4c86af34 100644
> --- a/fs/verity/verify.c
> +++ b/fs/verity/verify.c
> @@ -210,7 +210,7 @@ verify_data_block(struct inode *inode, struct 
> fsverity_info *vi,
>   if (is_hash_block_verified(vi, hpage, hblock_idx)) {
>   memcpy_from_page(_want_hash, hpage, hoffset, hsize);
>   want_hash = _want_hash;
> - put_page(hpage);
> + inode->i_sb->s_vop->drop_page(hpage);
>   goto descend;

fsverity_drop_page(hpage);

static inline void
fsverity_drop_page(struct inode *inode, struct page *page)
{
if (inode->i_sb->s_vop->drop_page)
inode->i_sb->s_vop->drop_page(page);
else
put_page(page);
}

And then you don't need to add the functions to each of the
filesystems nor make an indirect call just to run put_page().

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler

2023-01-18 Thread Dave Chinner

On Sun, Jan 15, 2023 at 09:29:58AM -0800, Darrick J. Wong wrote:
> 2. Do we need to revalidate mappings for directio writes?  I think the
> answer is no (for xfs) because the ->iomap_begin call will allocate
> whatever blocks are needed and truncate/punch/reflink block on the
> iolock while the directio writes are pending, so you'll never end up
> with a stale mapping.  But I don't know if that statement applies
> generally...

The issue is not truncate/punch/reflink for either DIO or buffered
IO - the issue that leads to stale iomaps is async extent state.
i.e. IO completion doing unwritten extent conversion.

For DIO, AIO doesn't hold the IOLOCK at all when completion is run
(like buffered writeback), but non-AIO DIO writes hold the IOLOCK
shared while waiting for completion. This means that we can have DIO
submission and completion still running concurrently, and so stale
iomaps are a definite possibility.

>From my notes when I looked at this:

1. the race condition for a DIO write mapping go stale is an
overlapping DIO completion and converting the block from unwritten
to written, and then the dio write incorrectly issuing sub-block
zeroing because the mapping is now stale.

2. DIO read into a hole or unwritten extent zeroes the entire range
in the user buffer in one operation. If this is a large range, this
could race with small DIO writes within that range that have
completed

3. There is a window between dio write completion doing unwritten
extent conversion (by ->end_io) and the page cache being
invalidated, providing a window where buffered read maps can be
stale and incorrect read behaviour exposed to userpace before
the page cache is invalidated.

These all stem from IO having overlapping ranges, which is largely
unsupported but can't be entirely prevented (e.g. backup
applications running in the background). Largely the problems are
confined to sub-block IOs. i.e.  when sub-block DIO writes to the
same block are being performed, we have the possiblity that one
write completes whilst the other is deciding what to zero, unaware
that the range is now MAPPED rather than UNWRITTEN.

We currently avoid issues with sub-block dio writes by using
IOMAP_DIO_OVERWRITE_ONLY with shared locking. This ensures that the
unaligned IO fits entirely within a MAPPED extent so no sub-block
zeroing is required. If allocation or sub-block zeroing is required,
then we force the filesystem to fall back to exclusive IO locking
and wait for all concurrent DIO in flight to complete so that it
can't race with any other DIO write that might cause the map to
become stale while we are doing the zeroing.

This does not avoid potential issues with DIO write vs buffered
read, nor DIO write vs mmap IO. It's not totally clear to me
whether we need ->iomap_valid checks in the buffered read paths
to avoid the completion races with DIO writes, but there are windows
there where cached iomaps could be considered stale

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper

2023-01-11 Thread Dave Chinner

On Wed, Jan 11, 2023 at 07:36:26PM +, Matthew Wilcox wrote:
> On Tue, Jan 10, 2023 at 07:24:27AM -0800, Christoph Hellwig wrote:
> > On Tue, Jan 10, 2023 at 01:34:16PM +, Matthew Wilcox wrote:
> > > > Exactly.  And as I already pointed out in reply to Dave's original
> > > > patch what we really should be doing is returning an ERR_PTR from
> > > > __filemap_get_folio instead of reverse-engineering the expected
> > > > error code.
> > > 
> > > Ouch, we have a nasty problem.
> > > 
> > > If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> > > encodings for shadow entries overlap with the encodings for ERR_PTR,
> > > meaning that some shadow entries will look like errors.  The way I
> > > solved this in the XArray code is by shifting the error values by
> > > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> > > 
> > > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> > > but so far we haven't, and I'd like to make that decision intentionally.
> > 
> > So what would be an alternative way to tell the callers why no folio
> > was found instead of trying to reverse engineer that?  Return an errno
> > and the folio by reference?  The would work, but the calling conventions
> > would be awful.
> 
> Agreed.  How about an xa_filemap_get_folio()?
> 
> (there are a number of things to fix here; haven't decided if XA_ERROR
> should return void *, or whether i should use a separate 'entry' and
> 'folio' until I know the entry is actually a folio ...)

That's awful. Exposing internal implementation details in the API
that is supposed to abstract away the internal implementation
details from users doesn't seem like a great idea to me.

Exactly what are we trying to fix here?  Do we really need to punch
a hole through the abstraction layers like this just to remove half
a dozen lines of -slow path- context specific error handling from a
single caller?

If there's half a dozen cases that need this sort of handling, then
maybe it's the right thing to do. But for a single calling context
that only needs to add a null return check in one specific case?
There's absolutely no need to make generic infrastructure violate
layering abstractions to handle that...

-Dave.

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler

2023-01-09 Thread Dave Chinner

On Mon, Jan 09, 2023 at 07:45:27PM +0100, Andreas Gruenbacher wrote:
> On Sun, Jan 8, 2023 at 10:59 PM Dave Chinner  wrote:
> > On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> > > Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> > > handler and validating the mapping there.
> > >
> > > Signed-off-by: Andreas Gruenbacher 
> >
> > I think this is wrong.
> >
> > The ->iomap_valid() function handles a fundamental architectural
> > issue with cached iomaps: the iomap can become stale at any time
> > whilst it is in use by the iomap core code.
> >
> > The current problem it solves in the iomap_write_begin() path has to
> > do with writeback and memory reclaim races over unwritten extents,
> > but the general case is that we must be able to check the iomap
> > at any point in time to assess it's validity.
> >
> > Indeed, we also have this same "iomap valid check" functionality in the
> > writeback code as cached iomaps can become stale due to racing
> > writeback, truncated, etc. But you wouldn't know it by looking at the iomap
> > writeback code - this is currently hidden by XFS by embedding
> > the checks into the iomap writeback ->map_blocks function.
> >
> > That is, the first thing that xfs_map_blocks() does is check if the
> > cached iomap is valid, and if it is valid it returns immediately and
> > the iomap writeback code uses it without question.
> >
> > The reason that this is embedded like this is that the iomap did not
> > have a validity cookie field in it, and so the validity information
> > was wrapped around the outside of the iomap_writepage_ctx and the
> > filesystem has to decode it from that private wrapping structure.
> >
> > However, the validity information iin the structure wrapper is
> > indentical to the iomap validity cookie,
> 
> Then could that part of the xfs code be converted to use
> iomap->validity_cookie so that struct iomap_writepage_ctx can be
> eliminated?

Yes, that is the plan.

> 
> > and so the direction I've
> > been working towards is to replace this implicit, hidden cached
> > iomap validity check with an explicit ->iomap_valid call and then
> > only call ->map_blocks if the validity check fails (or is not
> > implemented).
> >
> > I want to use the same code for all the iomap validity checks in all
> > the iomap core code - this is an iomap issue, the conditions where
> > we need to check for iomap validity are different for depending on
> > the iomap context being run, and the checks are not necessarily
> > dependent on first having locked a folio.
> >
> > Yes, the validity cookie needs to be decoded by the filesystem, but
> > that does not dictate where the validity checking needs to be done
> > by the iomap core.
> >
> > Hence I think removing ->iomap_valid is a big step backwards for the
> > iomap core code - the iomap core needs to be able to formally verify
> > the iomap is valid at any point in time, not just at the point in
> > time a folio in the page cache has been locked...
> 
> We don't need to validate an iomap "at any time". It's two specific
> places in the code in which we need to check, and we're not going to
> end up with ten more such places tomorrow.

Not immediately, but that doesn't change the fact this is not a
filesystem specific issue - it's an inherent characteristic of
cached iomaps and unsynchronised extent state changes that occur
outside exclusive inode->i_rwsem IO context (e.g. in writeback and
IO completion contexts).

Racing mmap + buffered writes can expose these state changes as the
iomap bufferred write IO path is not serialised against the iomap
mmap IO path except via folio locks. Hence a mmap page fault can
invalidate a cached buffered write iomap by causing a hole ->
unwritten, hole -> delalloc or hole -> written conversion in the
middle of the buffered write range. The buffered write still has a
hole mapping cached for that entire range, and it is now incorrect.

If the mmap write happens to change extent state at the trailing
edge of a partial buffered write, data corruption will occur if we
race just right with writeback and memory reclaim. I'm pretty sure
that this corruption can be reporduced on gfs2 if we try hard enough
- generic/346 triggers the mmap/write race condition, all that is
needed from that point is for writeback and reclaiming pages at
exactly the right time...

> I'd prefer to keep those
> filesystem internals in the filesystem specific code instead of
> exposing them to the iomap layer. But that's just me ...

My point is that there is noth

Re: [Cluster-devel] [RFC v6 08/10] iomap/xfs: Eliminate the iomap_valid handler

2023-01-08 Thread Dave Chinner

On Sun, Jan 08, 2023 at 08:40:32PM +0100, Andreas Gruenbacher wrote:
> Eliminate the ->iomap_valid() handler by switching to a ->get_folio()
> handler and validating the mapping there.
> 
> Signed-off-by: Andreas Gruenbacher 

I think this is wrong.

The ->iomap_valid() function handles a fundamental architectural
issue with cached iomaps: the iomap can become stale at any time
whilst it is in use by the iomap core code.

The current problem it solves in the iomap_write_begin() path has to
do with writeback and memory reclaim races over unwritten extents,
but the general case is that we must be able to check the iomap
at any point in time to assess it's validity.

Indeed, we also have this same "iomap valid check" functionality in the
writeback code as cached iomaps can become stale due to racing
writeback, truncated, etc. But you wouldn't know it by looking at the iomap
writeback code - this is currently hidden by XFS by embedding
the checks into the iomap writeback ->map_blocks function.

That is, the first thing that xfs_map_blocks() does is check if the
cached iomap is valid, and if it is valid it returns immediately and
the iomap writeback code uses it without question.

The reason that this is embedded like this is that the iomap did not
have a validity cookie field in it, and so the validity information
was wrapped around the outside of the iomap_writepage_ctx and the
filesystem has to decode it from that private wrapping structure.

However, the validity information iin the structure wrapper is
indentical to the iomap validity cookie, and so the direction I've
been working towards is to replace this implicit, hidden cached
iomap validity check with an explicit ->iomap_valid call and then
only call ->map_blocks if the validity check fails (or is not
implemented).

I want to use the same code for all the iomap validity checks in all
the iomap core code - this is an iomap issue, the conditions where
we need to check for iomap validity are different for depending on
the iomap context being run, and the checks are not necessarily
dependent on first having locked a folio.

Yes, the validity cookie needs to be decoded by the filesystem, but
that does not dictate where the validity checking needs to be done
by the iomap core.

Hence I think removing ->iomap_valid is a big step backwards for the
iomap core code - the iomap core needs to be able to formally verify
the iomap is valid at any point in time, not just at the point in
time a folio in the page cache has been locked...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper

2023-01-08 Thread Dave Chinner

On Sun, Jan 08, 2023 at 08:40:28PM +0100, Andreas Gruenbacher wrote:
> Add an iomap_get_folio() helper that gets a folio reference based on
> an iomap iterator and an offset into the address space.  Use it in
> iomap_write_begin().
> 
> Signed-off-by: Andreas Gruenbacher 
> Reviewed-by: Darrick J. Wong 
> Reviewed-by: Christoph Hellwig 
> ---
>  fs/iomap/buffered-io.c | 39 ++-
>  include/linux/iomap.h  |  1 +
>  2 files changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d4b444e44861..de4a8e5f721a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -457,6 +457,33 @@ bool iomap_is_partially_uptodate(struct folio *folio, 
> size_t from, size_t count)
>  }
>  EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>  
> +/**
> + * iomap_get_folio - get a folio reference for writing
> + * @iter: iteration structure
> + * @pos: start offset of write
> + *
> + * Returns a locked reference to the folio at @pos, or an error pointer if 
> the
> + * folio could not be obtained.
> + */
> +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> +{
> + unsigned fgp = FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE | FGP_NOFS;
> + struct folio *folio;
> +
> + if (iter->flags & IOMAP_NOWAIT)
> + fgp |= FGP_NOWAIT;
> +
> + folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> + fgp, mapping_gfp_mask(iter->inode->i_mapping));
> + if (folio)
> + return folio;
> +
> + if (iter->flags & IOMAP_NOWAIT)
> + return ERR_PTR(-EAGAIN);
> + return ERR_PTR(-ENOMEM);
> +}
> +EXPORT_SYMBOL_GPL(iomap_get_folio);

H.

This is where things start to get complex. I have sent a patch to
fix a problem with iomap_zero_range() failing to zero cached dirty
pages over UNWRITTEN extents, and that requires making FGP_CREAT
optional. This is an iomap bug, and needs to be fixed in the core
iomap code:

https://lore.kernel.org/linux-xfs/20221201005214.3836105-1-da...@fromorbit.com/

Essentially, we need to pass fgp flags to iomap_write_begin() need
so the callers can supply a 0 or FGP_CREAT appropriately. This
allows iomap_write_begin() to act only on pre-cached pages rather
than always instantiating a new page if one does not exist in cache.

This allows that iomap_write_begin() to return a NULL folio
successfully, and this is perfectly OK for callers that pass in fgp
= 0 as they are expected to handle a NULL folio return indicating
there was no cached data over the range...

Exposing the folio allocation as an external interface makes bug
fixes like this rather messy - it's taking a core abstraction (iomap
hides all the folio and page cache manipulations from the
filesystem) and punching a big hole in it by requiring filesystems
to actually allocation page cache folios on behalf of the iomap
core.

Given that I recently got major push-back for fixing an XFS-only bug
by walking the page cache directly instead of abstracting it via the
iomap core, punching an even bigger hole in the abstraction layer to
fix a GFS2-only problem is just as bad

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [RFC v2 0/3] Turn iomap_page_ops into iomap_folio_ops

2022-12-05 Thread Dave Chinner

On Fri, Dec 02, 2022 at 02:54:00AM +0100, Andreas Gruenbacher wrote:
> On Thu, Dec 1, 2022 at 10:30 PM Dave Chinner  wrote:
> > On Thu, Dec 01, 2022 at 07:09:54PM +0100, Andreas Gruenbacher wrote:
> > > Hi again,
> > >
> > > [Same thing, but with the patches split correctly this time.]
> > >
> > > we're seeing a race between journaled data writes and the shrinker on
> > > gfs2.  What's happening is that gfs2_iomap_page_done() is called after
> > > the page has been unlocked, so try_to_free_buffers() can come in and
> > > free the buffers while gfs2_iomap_page_done() is trying to add them to
> > > the transaction.  Not good.
> > >
> > > This is a proposal to change iomap_page_ops so that page_prepare()
> > > prepares the write and grabs the locked page, and page_done() unlocks
> > > and puts that page again.  While at it, this also converts the hooks
> > > from pages to folios.
> > >
> > > To move the pagecache_isize_extended() call in iomap_write_end() out of
> > > the way, a new folio_may_straddle_isize() helper is introduced that
> > > takes a locked folio.  That is then used when the inode size is updated,
> > > before the folio is unlocked.
> > >
> > > I've also converted the other applicable folio_may_straddle_isize()
> > > users, namely generic_write_end(), ext4_write_end(), and
> > > ext4_journalled_write_end().
> > >
> > > Any thoughts?
> >
> > I doubt that moving page cache operations from the iomap core to
> > filesystem specific callouts will be acceptible. I recently proposed
> > patches that added page cache walking to an XFS iomap callout to fix
> > a data corruption, but they were NAKd on the basis that iomap is
> > supposed to completely abstract away the folio and page cache
> > manipulations from the filesystem.
> 
> Right. The resulting code is really quite disgusting, for a
> fundamentalist dream of abstraction.
> 
> > This patchset seems to be doing the same thing - moving page cache
> > and folio management directly in filesystem specific callouts. Hence
> > I'm going to assume that the same architectural demarcation is
> > going to apply here, too...
> >
> > FYI, there is already significant change committed to the iomap
> > write path in the current XFS tree as a result of the changes I
> > mention - there is stale IOMAP detection which adds a new page ops
> > method and adds new error paths with a locked folio in
> > iomap_write_begin().
> 
> That would have belonged on the iomap-for-next branch rather than in
> the middle of a bunch of xfs commits.

Damned if you do, damned if you don't.

There were non-trivial cross dependencies between XFS and iomap in
that patch set.  The initial IOMAP_F_STALE infrastructure needed XFS
changes first, otherwise it could deadlock at ENOSPC on write page
faults. i.e. the iomap change in isolation broke stuff, so we're
forced to either carry XFs changes in iomap or iomap changes in XFS
so that there are no regressions in a given tree.

Then we had to move XFS functionality to iomap to fix another data
corruption that the IOMAP_F_STALE infrastructure exposed in XFS via
generic/346. Once the code was moved, then we could build it up into
the page cache scanning functionality in iomap. And only then could
we add the XFS IOMAP_F_STALE validation to XFS to solve the original
data corruption that started all this off.

IOWs, there were so many cross dependencies between XFs and iomap
that it was largely impossible to break it up into two separate sets
of indpendent patches that didn't cause regressions in one or the
other tree. And in the end, we'd still have to merge the iomap tree
into XFS or vice versa to actually test that the data corruption fix
worked.

In situations like this, we commonly take the entire series into one
of the two trees rather than make a whole lot more work for
ourselves by trying to separate them out. And in this case, because
it was XFS data corruption and race conditions that needed fixing,
it made sense to take it through the XFS tree so that it gets
coverage from all the XFS testing that happens - the iomap tree gets
a lot less early coverage than the XFS tree...

> > And this other data corruption (and performance) fix for handling
> > zeroing over unwritten extents properly:
> >
> > https://lore.kernel.org/linux-xfs/20221201005214.3836105-1-da...@fromorbit.com/
> >
> > changes the way folios are looked up and instantiated in the page
> > cache in iomap_write_begin(). It also adds new error conditions that
> > need to be returned to callers so to implement conditional "folio
> > must be present and dirty"

Re: [Cluster-devel] [RFC v2 0/3] Turn iomap_page_ops into iomap_folio_ops

2022-12-01 Thread Dave Chinner

On Thu, Dec 01, 2022 at 07:09:54PM +0100, Andreas Gruenbacher wrote:
> Hi again,
> 
> [Same thing, but with the patches split correctly this time.]
> 
> we're seeing a race between journaled data writes and the shrinker on
> gfs2.  What's happening is that gfs2_iomap_page_done() is called after
> the page has been unlocked, so try_to_free_buffers() can come in and
> free the buffers while gfs2_iomap_page_done() is trying to add them to
> the transaction.  Not good.
> 
> This is a proposal to change iomap_page_ops so that page_prepare()
> prepares the write and grabs the locked page, and page_done() unlocks
> and puts that page again.  While at it, this also converts the hooks
> from pages to folios.
> 
> To move the pagecache_isize_extended() call in iomap_write_end() out of
> the way, a new folio_may_straddle_isize() helper is introduced that
> takes a locked folio.  That is then used when the inode size is updated,
> before the folio is unlocked.
> 
> I've also converted the other applicable folio_may_straddle_isize()
> users, namely generic_write_end(), ext4_write_end(), and
> ext4_journalled_write_end().
> 
> Any thoughts?

I doubt that moving page cache operations from the iomap core to
filesystem specific callouts will be acceptible. I recently proposed
patches that added page cache walking to an XFS iomap callout to fix
a data corruption, but they were NAKd on the basis that iomap is
supposed to completely abstract away the folio and page cache
manipulations from the filesystem.

This patchset seems to be doing the same thing - moving page cache
and folio management directly in filesystem specific callouts. Hence
I'm going to assume that the same architectural demarcation is
going to apply here, too...

FYI, there is already significant change committed to the iomap
write path in the current XFS tree as a result of the changes I
mention - there is stale IOMAP detection which adds a new page ops
method and adds new error paths with a locked folio in
iomap_write_begin(). 

And this other data corruption (and performance) fix for handling
zeroing over unwritten extents properly:

https://lore.kernel.org/linux-xfs/20221201005214.3836105-1-da...@fromorbit.com/

changes the way folios are looked up and instantiated in the page
cache in iomap_write_begin(). It also adds new error conditions that
need to be returned to callers so to implement conditional "folio
must be present and dirty" page cache zeroing from
iomap_zero_iter(). Those semantics would also have to be supported
by gfs2, and that greatly complicates modifying and testing iomap
core changes.

To avoid all this, can we simple move the ->page_done() callout in
the error path and iomap_write_end() to before we unlock the folio?
You've already done that for pagecache_isize_extended(), and I can't
see anything obvious in the gfs2 ->page_done callout that
would cause issues if it is called with a locked dirty folio...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [syzbot] WARNING in iomap_read_inline_data

2022-11-29 Thread Dave Chinner

 
> _raw_spin_unlock_irq+0x3c/0x70 kernel/locking/spinlock.c:202
> hardirqs last disabled at (104804): [] el1_dbg+0x24/0x80 
> arch/arm64/kernel/entry-common.c:405
> softirqs last  enabled at (104756): [] 
> local_bh_enable+0x10/0x34 include/linux/bottom_half.h:32
> softirqs last disabled at (104754): [] 
> local_bh_disable+0x10/0x34 include/linux/bottom_half.h:19
> ---[ end trace  ]---
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches
> 

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()

2022-11-03 Thread Dave Chinner

On Thu, Nov 03, 2022 at 07:45:01PM -0700, Darrick J. Wong wrote:
> On Fri, Nov 04, 2022 at 11:32:35AM +1100, Dave Chinner wrote:
> > On Thu, Nov 03, 2022 at 03:28:05PM -0700, Vishal Moola wrote:
> > > On Wed, Oct 19, 2022 at 08:01:52AM +1100, Dave Chinner wrote:
> > > > On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote:
> > > > > - BUG_ON(PageWriteback(page));
> > > > > - if (!clear_page_dirty_for_io(page))
> > > > > + BUG_ON(folio_test_writeback(folio));
> > > > > + if (!folio_clear_dirty_for_io(folio))
> > > > >   goto continue_unlock;
> > > > >  
> > > > >   trace_wbc_writepage(wbc, 
> > > > > inode_to_bdi(mapping->host));
> > > > > - error = (*writepage)(page, wbc, data);
> > > > > + error = writepage(>page, wbc, data);
> > > > 
> > > > Yet, IIUC, this treats all folios as if they are single page folios.
> > > > i.e. it passes the head page of a multi-page folio to a callback
> > > > that will treat it as a single PAGE_SIZE page, because that's all
> > > > the writepage callbacks are currently expected to be passed...
> > > > 
> > > > So won't this break writeback of dirty multipage folios?
> > > 
> > > Yes, it appears it would. But it wouldn't because its already 'broken'.
> > 
> > It is? Then why isn't XFS broken on existing kernels? Oh, we don't
> > know because it hasn't been tested?
> > 
> > Seriously - if this really is broken, and this patchset further
> > propagating the brokeness, then somebody needs to explain to me why
> > this is not corrupting data in XFS.
> 
> It looks like iomap_do_writepage finds the folio size correctly
> 
>   end_pos = folio_pos(folio) + folio_size(folio);
> 
> and iomap_writpage_map will map out the correct number of blocks
> 
>   unsigned nblocks = i_blocks_per_folio(inode, folio);
> 
>   for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> 
> right?

Yup, that's how I read it, too.

But my recent experience with folios involved being repeatedly
burnt by edge case corruptions due to multipage folios showing up
when and where I least expected them.

Hence doing a 1:1 conversion of page based code to folio based code
and just assuming large folios will work without any testing seems
akin to playing russian roulette with loose cannons that have been
doused with napalm and then set on fire by an air-dropped barrel
bomb...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()

2022-11-03 Thread Dave Chinner

On Thu, Nov 03, 2022 at 03:28:05PM -0700, Vishal Moola wrote:
> On Wed, Oct 19, 2022 at 08:01:52AM +1100, Dave Chinner wrote:
> > On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote:
> > > Converted function to use folios throughout. This is in preparation for
> > > the removal of find_get_pages_range_tag().
> > > 
> > > Signed-off-by: Vishal Moola (Oracle) 
> > > ---
> > >  mm/page-writeback.c | 44 +++-
> > >  1 file changed, 23 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index 032a7bf8d259..087165357a5a 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -2285,15 +2285,15 @@ int write_cache_pages(struct address_space 
> > > *mapping,
> > >   int ret = 0;
> > >   int done = 0;
> > >   int error;
> > > - struct pagevec pvec;
> > > - int nr_pages;
> > > + struct folio_batch fbatch;
> > > + int nr_folios;
> > >   pgoff_t index;
> > >   pgoff_t end;/* Inclusive */
> > >   pgoff_t done_index;
> > >   int range_whole = 0;
> > >   xa_mark_t tag;
> > >  
> > > - pagevec_init();
> > > + folio_batch_init();
> > >   if (wbc->range_cyclic) {
> > >   index = mapping->writeback_index; /* prev offset */
> > >   end = -1;
> > > @@ -2313,17 +2313,18 @@ int write_cache_pages(struct address_space 
> > > *mapping,
> > >   while (!done && (index <= end)) {
> > >   int i;
> > >  
> > > - nr_pages = pagevec_lookup_range_tag(, mapping, , end,
> > > - tag);
> > > - if (nr_pages == 0)
> > > + nr_folios = filemap_get_folios_tag(mapping, , end,
> > > + tag, );
> > 
> > This can find and return dirty multi-page folios if the filesystem
> > enables them in the mapping at instantiation time, right?
> 
> Yup, it will.
> 
> > > +
> > > + if (nr_folios == 0)
> > >   break;
> > >  
> > > - for (i = 0; i < nr_pages; i++) {
> > > - struct page *page = pvec.pages[i];
> > > + for (i = 0; i < nr_folios; i++) {
> > > + struct folio *folio = fbatch.folios[i];
> > >  
> > > - done_index = page->index;
> > > + done_index = folio->index;
> > >  
> > > - lock_page(page);
> > > + folio_lock(folio);
> > >  
> > >   /*
> > >* Page truncated or invalidated. We can freely skip it
> > > @@ -2333,30 +2334,30 @@ int write_cache_pages(struct address_space 
> > > *mapping,
> > >* even if there is now a new, dirty page at the same
> > >* pagecache address.
> > >*/
> > > - if (unlikely(page->mapping != mapping)) {
> > > + if (unlikely(folio->mapping != mapping)) {
> > >  continue_unlock:
> > > - unlock_page(page);
> > > + folio_unlock(folio);
> > >   continue;
> > >   }
> > >  
> > > - if (!PageDirty(page)) {
> > > + if (!folio_test_dirty(folio)) {
> > >   /* someone wrote it for us */
> > >   goto continue_unlock;
> > >   }
> > >  
> > > - if (PageWriteback(page)) {
> > > + if (folio_test_writeback(folio)) {
> > >   if (wbc->sync_mode != WB_SYNC_NONE)
> > > - wait_on_page_writeback(page);
> > > + folio_wait_writeback(folio);
> > >   else
> > >   goto continue_unlock;
> > >   }
> > >  
> > > - BUG_ON(PageWriteback(page));
> > > - if (!clear_page_dirty_for_io(page))
> > > + BUG_ON(folio_test_writeback(folio));
> > > + if (!folio_clear_dirty_for_io(folio))
> > >   goto continue_unlock;
> > >  
> > &

Re: [Cluster-devel] [PATCH v4 00/23] Convert to filemap_get_folios_tag()

2022-11-03 Thread Dave Chinner

On Thu, Nov 03, 2022 at 09:38:48AM -0700, Vishal Moola wrote:
> On Thu, Nov 3, 2022 at 12:08 AM Dave Chinner  wrote:
> >
> > On Wed, Nov 02, 2022 at 09:10:08AM -0700, Vishal Moola (Oracle) wrote:
> > > This patch series replaces find_get_pages_range_tag() with
> > > filemap_get_folios_tag(). This also allows the removal of multiple
> > > calls to compound_head() throughout.
> > > It also makes a good chunk of the straightforward conversions to folios,
> > > and takes the opportunity to introduce a function that grabs a folio
> > > from the pagecache.
> > >
> > > F2fs and Ceph have quite a lot of work to be done regarding folios, so
> > > for now those patches only have the changes necessary for the removal of
> > > find_get_pages_range_tag(), and only support folios of size 1 (which is
> > > all they use right now anyways).
> > >
> > > I've run xfstests on btrfs, ext4, f2fs, and nilfs2, but more testing may 
> > > be
> > > beneficial. The page-writeback and filemap changes implicitly work. 
> > > Testing
> > > and review of the other changes (afs, ceph, cifs, gfs2) would be 
> > > appreciated.
> >
> > Same question as last time: have you tested this with multipage
> > folios enabled? If you haven't tested XFS, then I'm guessing the
> > answer is no, and you haven't fixed the bug I pointed out in
> > the write_cache_pages() implementation
> >
> 
> I haven't tested the series with multipage folios or XFS.
> 
> I don't seem to have gotten your earlier comments, and I
> can't seem to find them on the mailing lists. Could you
> please send them again so I can take a look?

They are in the lore -fsdevel archive - no idea why you couldn't
find them

https://lore.kernel.org/linux-fsdevel/20221018210152.gh2703...@dread.disaster.area/
https://lore.kernel.org/linux-fsdevel/20221018214544.gi2703...@dread.disaster.area/

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v4 00/23] Convert to filemap_get_folios_tag()

2022-11-03 Thread Dave Chinner

On Wed, Nov 02, 2022 at 09:10:08AM -0700, Vishal Moola (Oracle) wrote:
> This patch series replaces find_get_pages_range_tag() with
> filemap_get_folios_tag(). This also allows the removal of multiple
> calls to compound_head() throughout.
> It also makes a good chunk of the straightforward conversions to folios,
> and takes the opportunity to introduce a function that grabs a folio
> from the pagecache.
> 
> F2fs and Ceph have quite a lot of work to be done regarding folios, so
> for now those patches only have the changes necessary for the removal of
> find_get_pages_range_tag(), and only support folios of size 1 (which is
> all they use right now anyways).
> 
> I've run xfstests on btrfs, ext4, f2fs, and nilfs2, but more testing may be
> beneficial. The page-writeback and filemap changes implicitly work. Testing
> and review of the other changes (afs, ceph, cifs, gfs2) would be appreciated.

Same question as last time: have you tested this with multipage
folios enabled? If you haven't tested XFS, then I'm guessing the
answer is no, and you haven't fixed the bug I pointed out in
the write_cache_pages() implementation

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 00/23] Convert to filemap_get_folios_tag()

2022-10-18 Thread Dave Chinner

On Thu, Sep 01, 2022 at 03:01:15PM -0700, Vishal Moola (Oracle) wrote:
> This patch series replaces find_get_pages_range_tag() with
> filemap_get_folios_tag(). This also allows the removal of multiple
> calls to compound_head() throughout.
> It also makes a good chunk of the straightforward conversions to folios,
> and takes the opportunity to introduce a function that grabs a folio
> from the pagecache.
> 
> F2fs and Ceph have quite alot of work to be done regarding folios, so
> for now those patches only have the changes necessary for the removal of
> find_get_pages_range_tag(), and only support folios of size 1 (which is
> all they use right now anyways).
> 
> I've run xfstests on btrfs, ext4, f2fs, and nilfs2, but more testing may be
> beneficial.

Well, that answers my question about how filesystems that enable
multi-page folios were tested: they weren't. 

I'd suggest that anyone working on further extending the
filemap/folio infrastructure really needs to be testing XFS as a
first priority, and then other filesystems as a secondary concern.

That's because XFS (via the fs/iomap infrastructure) is one of only
3 filesystems in the kernel (AFS and tmpfs are the others) that
interact with the page cache and page cache "pages" solely via folio
interfaces. As such they are able to support multi-page folios in
the page cache. All of the tested filesystems still use the fixed
PAGE_SIZE page interfaces to interact with the page cache, so they
don't actually exercise interactions with multi-page folios at all.

Hence if you are converting generic infrastructure that looks up
pages in the page cache to look up folios in the page cache, the
code that processes the returned folios also needs to be updated and
validated to ensure that it correctly handles multi-page folios. And
the only way you can do that fully at this point in time is via
testing XFS or AFS...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()

2022-10-18 Thread Dave Chinner

On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote:
> Converted function to use folios throughout. This is in preparation for
> the removal of find_get_pages_range_tag().
> 
> Signed-off-by: Vishal Moola (Oracle) 
> ---
>  mm/page-writeback.c | 44 +++-
>  1 file changed, 23 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 032a7bf8d259..087165357a5a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2285,15 +2285,15 @@ int write_cache_pages(struct address_space *mapping,
>   int ret = 0;
>   int done = 0;
>   int error;
> - struct pagevec pvec;
> - int nr_pages;
> + struct folio_batch fbatch;
> + int nr_folios;
>   pgoff_t index;
>   pgoff_t end;/* Inclusive */
>   pgoff_t done_index;
>   int range_whole = 0;
>   xa_mark_t tag;
>  
> - pagevec_init();
> + folio_batch_init();
>   if (wbc->range_cyclic) {
>   index = mapping->writeback_index; /* prev offset */
>   end = -1;
> @@ -2313,17 +2313,18 @@ int write_cache_pages(struct address_space *mapping,
>   while (!done && (index <= end)) {
>   int i;
>  
> - nr_pages = pagevec_lookup_range_tag(, mapping, , end,
> - tag);
> - if (nr_pages == 0)
> + nr_folios = filemap_get_folios_tag(mapping, , end,
> + tag, );

This can find and return dirty multi-page folios if the filesystem
enables them in the mapping at instantiation time, right?

> +
> + if (nr_folios == 0)
>   break;
>  
> - for (i = 0; i < nr_pages; i++) {
> - struct page *page = pvec.pages[i];
> + for (i = 0; i < nr_folios; i++) {
> + struct folio *folio = fbatch.folios[i];
>  
> - done_index = page->index;
> + done_index = folio->index;
>  
> - lock_page(page);
> + folio_lock(folio);
>  
>   /*
>* Page truncated or invalidated. We can freely skip it
> @@ -2333,30 +2334,30 @@ int write_cache_pages(struct address_space *mapping,
>* even if there is now a new, dirty page at the same
>* pagecache address.
>*/
> - if (unlikely(page->mapping != mapping)) {
> + if (unlikely(folio->mapping != mapping)) {
>  continue_unlock:
> - unlock_page(page);
> + folio_unlock(folio);
>   continue;
>   }
>  
> - if (!PageDirty(page)) {
> + if (!folio_test_dirty(folio)) {
>   /* someone wrote it for us */
>   goto continue_unlock;
>   }
>  
> - if (PageWriteback(page)) {
> + if (folio_test_writeback(folio)) {
>   if (wbc->sync_mode != WB_SYNC_NONE)
> - wait_on_page_writeback(page);
> + folio_wait_writeback(folio);
>   else
>   goto continue_unlock;
>   }
>  
> - BUG_ON(PageWriteback(page));
> - if (!clear_page_dirty_for_io(page))
> + BUG_ON(folio_test_writeback(folio));
> + if (!folio_clear_dirty_for_io(folio))
>   goto continue_unlock;
>  
>   trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
> - error = (*writepage)(page, wbc, data);
> + error = writepage(>page, wbc, data);

Yet, IIUC, this treats all folios as if they are single page folios.
i.e. it passes the head page of a multi-page folio to a callback
that will treat it as a single PAGE_SIZE page, because that's all
the writepage callbacks are currently expected to be passed...

So won't this break writeback of dirty multipage folios?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] remove iomap_writepage v2

2022-07-28 Thread Dave Chinner

On Thu, Jul 28, 2022 at 03:18:03PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 28, 2022 at 01:10:16PM +0200, Jan Kara wrote:
> > Hi Christoph!
> > 
> > On Tue 19-07-22 06:13:07, Christoph Hellwig wrote:
> > > this series removes iomap_writepage and it's callers, following what xfs
> > > has been doing for a long time.
> > 
> > So this effectively means "no writeback from page reclaim for these
> > filesystems" AFAICT (page migration of dirty pages seems to be handled by
> > iomap_migrate_page()) which is going to make life somewhat harder for
> > memory reclaim when memory pressure is high enough that dirty pages are
> > reaching end of the LRU list. I don't expect this to be a problem on big
> > machines but it could have some undesirable effects for small ones
> > (embedded, small VMs). I agree per-page writeback has been a bad idea for
> > efficiency reasons for at least last 10-15 years and most filesystems
> > stopped dealing with more complex situations (like block allocation) from
> > ->writepage() already quite a few years ago without any bug reports AFAIK.
> > So it all seems like a sensible idea from FS POV but are MM people on board
> > or at least aware of this movement in the fs land?
> 
> I mentioned it during my folio session at LSFMM, but didn't put a huge
> emphasis on it.
> 
> For XFS, writeback should already be in progress on other pages if
> we're getting to the point of trying to call ->writepage() in vmscan.
> Surely this is also true for other filesystems?

Yes.

It's definitely true for btrfs, too, because btrfs_writepage does:

static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
{
struct inode *inode = page->mapping->host;
int ret;

if (current->flags & PF_MEMALLOC) {
redirty_page_for_writepage(wbc, page);
unlock_page(page);
return 0;
}

It also rejects all calls to write dirty pages from memory reclaim
contexts.

ext4 will also reject writepage calls from memory allocation if
block allocation is required (due to delayed allocation) or
unwritten extents need converting to written. i.e. if it has to run
blocking transactions.

So all three major filesystems will either partially or wholly
reject ->writepage calls from memory reclaim context.

IOWs, if memory reclaim is depending on ->writepage() to make
reclaim progress, it's not working as advertised on the vast
majority of production Linux systems

The reality is that ->writepage is a relic of a bygone era of OS and
filesystem design. It was useful in the days where writing a dirty
page just involved looking up the bufferhead attached to the page to
get the disk mapping and then submitting it for IO.

Those days are long gone - filesystems have complex IO submission
paths now that have to handle delayed allocation, copy-on-write,
unwritten extents, have unbound memory demand, etc. All the
filesystems that support these 1990s era filesystem technologies
simply turn off ->writepage in memory reclaim contexts.

Hence for the vast majority of linux users (i.e. everyone using
ext4, btrfs and XFS), ->writepage no longer plays any part in memory
reclaim on their systems.

So why should we try to maintain the fiction that ->writepage is
required functionality in a filesystem when it clearly isn't?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH] iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL

2022-03-02 Thread Dave Chinner

On Wed, Mar 02, 2022 at 02:03:28PM +0100, Andreas Gruenbacher wrote:
> On Wed, Mar 2, 2022 at 11:17 AM Filipe Manana  wrote:
> > On Tue, Mar 01, 2022 at 09:38:30AM +1100, Dave Chinner wrote:
> > > On Mon, Feb 28, 2022 at 02:32:03PM +, fdman...@kernel.org wrote:
> > > > From: Filipe Manana 
> > > .
> > >
> > > > 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
> > > > and return 4K (the amount of io done) to iomap_dio_complete_work();
> > > >
> > > > 12) iomap_dio_complete_work() calls the iocb completion callback,
> > > > iocb->ki_complete() with a second argument value of 4K (total io
> > > > done) and the iocb with the adjust ki_pos of X + 4K. This results
> > > > in completing the read request for io_uring, leaving it with a
> > > > result of 4K bytes read, and only the first page of the buffer
> > > > filled in, while the remaining 3 pages, corresponding to the other
> > > > 3 extents, were not filled;
> > > >
> > > > 13) For the application, the result is unexpected because if we ask
> > > > to read N bytes, it expects to get N bytes read as long as those
> > > > N bytes don't cross the EOF (i_size).
> > >
> > > Yeah, that's exactly the sort of problem we were having with XFS
> > > with partial DIO completions due to needing multiple iomap iteration
> > > loops to complete a single IO. Hence IOMAP_NOWAIT now triggers the
> > > above range check and aborts before we start...
> >
> > Interesting.
> 
> Dave, this seems to affect all users of iomap_dio_rw in the same way,
> so would it make sense to move this check there?

Perhaps, but I'm not sure it makes sense because filesystems need to
abort ->iomap_begin with -EAGAIN in IOMAP_NOWAIT contexts before
they make any changes.

Hence detecting short extents in the generic code becomes ...
difficult because we might now need to undo changes that have been
made in ->iomap_begin. e.g. if the filesystem allocates space and
the iomap core says "not long enough" because IOMAP_NOWAIT is set,
then we may have to punch out that allocation in ->iomap_end beforei
returning -EAGAIN.

That means filesystems like XFS may now need to supply a ->iomap_end
function to undo stuff the core decides it shouldn't have done,
instead of the filesystem ensuring it never does the operation it in
the first place...

IOWs, the correct behaviour here is for the filesystem ->iomap_begin
method to see that it needs to allocate and return -EAGAIN if
IOMAP_NOWAIT is set, not do the allocation and hope it that it ends
up being long enough to cover the entire IO we have to do.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH] iomap: fix incomplete async dio reads when using IOMAP_DIO_PARTIAL

2022-02-28 Thread Dave Chinner

fragile, IMO. I'd much
prefer that *_iomap_begin_write() implementations simply follow
IOMAP_NOWAIT requirements to ensure that any DIO that needs multiple
mappings if punted to a context that can block...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

2022-02-24 Thread Dave Chinner

On Wed, Feb 23, 2022 at 10:50:09PM -0500, Theodore Ts'o wrote:
> On Thu, Feb 24, 2022 at 12:48:42PM +1100, Dave Chinner wrote:
> > > Fair enough; on the other hand, we could also view this as making ext4
> > > more robust against buggy code in other subsystems, and while other
> > > file systems may be losing user data if they are actually trying to do
> > > remote memory access to file-backed memory, apparently other file
> > > systems aren't noticing and so they're not crashing.
> > 
> > Oh, we've noticed them, no question about that.  We've got bug
> > reports going back years for systems being crashed, triggering BUGs
> > and/or corrupting data on both XFS and ext4 filesystems due to users
> > trying to run RDMA applications with file backed pages.
> 
> Is this issue causing XFS to crash?  I didn't know that.

I have no idea if crashes nowdays -  go back a few years before and
search for XFS BUGging out in ->invalidate_page (or was it
->release_page?) because of unexpected dirty pages. I think it could
also trigger BUGs in writeback when ->writepages tripped over a
dirty page without a delayed allocation mapping over the hole...

We were pretty aggressive about telling people reporting such issues
that they get to keep all the borken bits to themselves and to stop
wasting our time with unsolvable problems caused by their
broken-by-design RDMA applications. Hence people have largely
stopped bothering us with random filesystem crashes on systems using
RDMA on file-backed pages...

> I tried the Syzbot reproducer with XFS mounted, and it didn't trigger
> any crashes.  I'm sure data was getting corrupted, but I figured I
> should bring ext4 to the XFS level of "at least we're not reliably
> killing the kernel".

Oh, well, good to know XFS didn't die a horrible death immediately.
Thanks for checking, Ted.

> On ext4, an unprivileged process can use process_vm_writev(2) to crash
> the system.  I don't know how quickly we can get a fix into mm/gup.c,
> but if some other kernel path tries calling set_page_dirty() on a
> file-backed page without first asking permission from the file system,
> it seems to be nice if the file system doesn't BUG() --- as near as I
> can tell, xfs isn't crashing in this case, but ext4 is.

iomap is probably refusing to map holes for writepage - we've
cleaned up most of the weird edge cases to return errors, so I'm
guessing iomap is just ignoring such pages these days.

Yeah, see iomap_writepage_map():

error = wpc->ops->map_blocks(wpc, inode, pos);
if (error)
break;
if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE))
continue;
if (wpc->iomap.type == IOMAP_HOLE)
continue;

Yeah, so if writeback maps a hole rather than converts a delalloc
region to IOMAP_MAPPED, it'll just skip over the block/page.  IIRC,
they essentially become uncleanable pages, and I think eventually
inode reclaim will just toss them out of memory.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

2022-02-23 Thread Dave Chinner

On Wed, Feb 23, 2022 at 06:35:54PM -0500, Theodore Ts'o wrote:
> On Fri, Feb 18, 2022 at 08:51:54AM +0100, Greg Kroah-Hartman wrote:
> > > The challenge is that fixing this "the right away" is probably not
> > > something we can backport into an LTS kernel, whether it's 5.15 or
> > > 5.10... or 4.19.
> > 
> > Don't worry about stable backports to start with.  Do it the "right way"
> > first and then we can consider if it needs to be backported or not.
> 
> Fair enough; on the other hand, we could also view this as making ext4
> more robust against buggy code in other subsystems, and while other
> file systems may be losing user data if they are actually trying to do
> remote memory access to file-backed memory, apparently other file
> systems aren't noticing and so they're not crashing.

Oh, we've noticed them, no question about that.  We've got bug
reports going back years for systems being crashed, triggering BUGs
and/or corrupting data on both XFS and ext4 filesystems due to users
trying to run RDMA applications with file backed pages.

Most of the people doing this now know that we won't support such
applications until the RDMA stack/hardware can trigger on-demand
write page faults the same way CPUs do when they first write to a
clean page. They don't have this, so mostly these people don't
bother reporting these class of problems to us anymore.  The
gup/RDMA infrastructure to make this all work is slowly moving
forwards, but it's not here yet.

> Issuing a
> warning and then not crashing is arguably a better way for ext4 to
> react, especially if there are other parts of the kernel that are
> randomly calling set_page_dirty() on file-backed memory without
> properly first informing the file system in a context where it can
> block and potentially do I/O to do things like allocate blocks.

I'm not sure that replacing the BUG() with a warning is good enough
- it's still indicative of an application doing something dangerous
that could result in silent data corruption and/or other problems.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v2.1 11/30] iomap: add the new iomap_iter model

2021-08-16 Thread Dave Chinner

On Wed, Aug 11, 2021 at 12:17:20PM -0700, Darrick J. Wong wrote:
> From: Christoph Hellwig 
> 
> The iomap_iter struct provides a convenient way to package up and
> maintain all the arguments to the various mapping and operation
> functions.  It is operated on using the iomap_iter() function that
> is called in loop until the whole range has been processed.  Compared
> to the existing iomap_apply() function this avoid an indirect call
> for each iteration.
> 
> For now iomap_iter() calls back into the existing ->iomap_begin and
> ->iomap_end methods, but in the future this could be further optimized
> to avoid indirect calls entirely.
> 
> Based on an earlier patch from Matthew Wilcox .
> 
> Signed-off-by: Christoph Hellwig 
> [djwong: add to apply.c to preserve git history of iomap loop control]
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 

Looks like a straight translation of Christoph's original. Seems
fine to me as a simepl step towards preserving the git history.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 31/30] iomap: move iomap iteration code to iter.c

2021-08-16 Thread Dave Chinner

On Wed, Aug 11, 2021 at 12:19:26PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong 
> 
> Now that we've moved iomap to the iterator model, rename this file to be
> in sync with the functions contained inside of it.
> 
> Signed-off-by: Darrick J. Wong 
> ---
>  fs/iomap/Makefile |2 +-
>  fs/iomap/iter.c   |0 
>  2 files changed, 1 insertion(+), 1 deletion(-)
>  rename fs/iomap/{apply.c => iter.c} (100%)
> 
> diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile
> index e46f936dde81..bb64215ae256 100644
> --- a/fs/iomap/Makefile
> +++ b/fs/iomap/Makefile
> @@ -26,9 +26,9 @@ ccflags-y += -I $(srctree)/$(src)   # needed for 
> trace events
>  obj-$(CONFIG_FS_IOMAP)   += iomap.o
>  
>  iomap-y  += trace.o \
> -apply.o \
>  buffered-io.o \
>  direct-io.o \
>  fiemap.o \
> +iter.o \
>  seek.o
>  iomap-$(CONFIG_SWAP) += swapfile.o
> diff --git a/fs/iomap/apply.c b/fs/iomap/iter.c
> similarity index 100%
> rename from fs/iomap/apply.c
> rename to fs/iomap/iter.c

LGTM,

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v2.1 24/30] iomap: remove iomap_apply

2021-08-16 Thread Dave Chinner

On Wed, Aug 11, 2021 at 12:18:26PM -0700, Darrick J. Wong wrote:
> From: Christoph Hellwig 
> 
> iomap_apply is unused now, so remove it.
> 
> Signed-off-by: Christoph Hellwig 
> [djwong: rebase this patch to preserve git history of iomap loop control]
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 
> ---
>  fs/iomap/apply.c  |   91 
> -
>  fs/iomap/trace.h  |   40 --
>  include/linux/iomap.h |   10 -
>  3 files changed, 141 deletions(-)

Looks good.

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v2.1 19/30] iomap: switch iomap_bmap to use iomap_iter

2021-08-16 Thread Dave Chinner

On Wed, Aug 11, 2021 at 12:18:00PM -0700, Darrick J. Wong wrote:
> From: Christoph Hellwig 
> 
> Rewrite the ->bmap implementation based on iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 
> [djwong: restructure the loop to make its behavior a little clearer]
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 
> ---
>  fs/iomap/fiemap.c |   31 +--
>  1 file changed, 13 insertions(+), 18 deletions(-)

Looks good.

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 11/30] iomap: add the new iomap_iter model

2021-08-09 Thread Dave Chinner

On Mon, Aug 09, 2021 at 08:12:25AM +0200, Christoph Hellwig wrote:
> The iomap_iter struct provides a convenient way to package up and
> maintain all the arguments to the various mapping and operation
> functions.  It is operated on using the iomap_iter() function that
> is called in loop until the whole range has been processed.  Compared
> to the existing iomap_apply() function this avoid an indirect call
> for each iteration.
> 
> For now iomap_iter() calls back into the existing ->iomap_begin and
> ->iomap_end methods, but in the future this could be further optimized
> to avoid indirect calls entirely.
> 
> Based on an earlier patch from Matthew Wilcox .
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap/Makefile |  1 +
>  fs/iomap/core.c   | 79 +++
>  fs/iomap/trace.h  | 37 +++-
>  include/linux/iomap.h | 56 ++
>  4 files changed, 172 insertions(+), 1 deletion(-)
>  create mode 100644 fs/iomap/core.c
> 
> diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile
> index eef2722d93a183..6b56b10ded347a 100644
> --- a/fs/iomap/Makefile
> +++ b/fs/iomap/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_FS_IOMAP)  += iomap.o
>  
>  iomap-y  += trace.o \
>  apply.o \
> +core.o \

This creates a discontinuity in the iomap git history. Can you add
these new functions to iomap/apply.c, then when the old apply code
is removed later in the series rename the file to core.c? At least
that way 'git log --follow fs/iomap/core.c' will walk back into the
current history of fs/iomap/apply.c and the older pre-disaggregation
fs/iomap.c without having to take the tree back in time to find
those files...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 08/27] iomap: add the new iomap_iter model

2021-07-19 Thread Dave Chinner

On Mon, Jul 19, 2021 at 12:35:01PM +0200, Christoph Hellwig wrote:
> The iomap_iter struct provides a convenient way to package up and
> maintain all the arguments to the various mapping and operation
> functions.  It is operated on using the iomap_iter() function that
> is called in loop until the whole range has been processed.  Compared
> to the existing iomap_apply() function this avoid an indirect call
> for each iteration.
> 
> For now iomap_iter() calls back into the existing ->iomap_begin and
> ->iomap_end methods, but in the future this could be further optimized
> to avoid indirect calls entirely.
> 
> Based on an earlier patch from Matthew Wilcox .
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap/Makefile |  1 +
>  fs/iomap/iter.c   | 74 +++
>  fs/iomap/trace.h  | 37 +-
>  include/linux/iomap.h | 56 
>  4 files changed, 167 insertions(+), 1 deletion(-)
>  create mode 100644 fs/iomap/iter.c
> 
> diff --git a/fs/iomap/Makefile b/fs/iomap/Makefile
> index eef2722d93a183..85034deb5a2f19 100644
> --- a/fs/iomap/Makefile
> +++ b/fs/iomap/Makefile
> @@ -10,6 +10,7 @@ obj-$(CONFIG_FS_IOMAP)  += iomap.o
>  
>  iomap-y  += trace.o \
>  apply.o \
> +iter.o \

Can we break this cycle of creating new files and removing old files
when changing the iomap core code? It breaks the ability to troll
git history easily through git blame and other techniques that are
file based.

If we are going to create a new file, then the core iomap code that
every thing depends on should just be in a neutrally names file such
as "iomap.c" so that we don't need to play these games in future.



> +/**
> + * iomap_iter - iterate over a ranges in a file
> + * @iter: iteration structue
> + * @ops: iomap ops provided by the file system
> + *
> + * Iterate over file system provided contiguous ranges of blocks with the 
> same
> + * state.  Should be called in a loop that continues as long as this function
> + * returns a positive value.  If 0 or a negative value is returned the caller
> + * should break out of the loop - a negative value is an error either from 
> the
> + * file system or from the last iteration stored in @iter.copied.
> + */
> +int iomap_iter(struct iomap_iter *iter, const struct iomap_ops *ops)
> +{

We should avoid namespace conflicts where function names shadow
object types. iomap_iterate() is fine as the function name - there's
no need for abbreviation here because it's not an overly long name.
This will makes it clearly different to the struct iomap_iter that
is passed to it and it will also make grep, cscope and other
code searching tools much more precise...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 20/27] fsdax: switch dax_iomap_rw to use iomap_iter

2021-07-19 Thread Dave Chinner

On Mon, Jul 19, 2021 at 12:35:13PM +0200, Christoph Hellwig wrote:
> Switch the dax_iomap_rw implementation to use iomap_iter.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/dax.c | 49 -
>  1 file changed, 24 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4d63040fd71f56..51da45301350a6 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1103,20 +1103,21 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct 
> iomap *iomap)
>   return size;
>  }
>  
> -static loff_t
> -dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> - struct iomap *iomap, struct iomap *srcmap)
> +static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
> + struct iov_iter *iter)

At first I wondered "iomi? Strange name, why is this one-off name
used?" and then I realised it's because this function also takes an
struct iov_iter named "iter".

That's going to cause confusion in the long run - iov_iter and
iomap_iter both being generally named "iter", and then one or the
other randomly changing when both are used in the same function.

Would it be better to avoid any possible confusion simply by using
"iomi" for all iomap_iter variables throughout the patchset from the
start? That way nobody is going to confuse iov_iter with iomap_iter
iteration variables and code that uses both types will naturally
have different, well known names...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] RFC: iomap write invalidation

2020-07-14 Thread Dave Chinner

On Mon, Jul 13, 2020 at 09:46:31AM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series has two parts:  the first one picks up Dave's patch to avoid
> invalidation entierly for reads, picked up deep down from the btrfs iomap
> thread.  The second one falls back to buffered writes if invalidation fails
> instead of leaving a stale cache around.  Let me know what you think about
> this approch.

Either we maintain application level concurrency for direct IOs and
ignore the stale data in the page cache, or we kill application IO
concurrency and keep the page cache coherent.

It's a lose-lose choice and I'm on the fence as to which is the
lesser of two evils.

The main factor is whether the buffered IO fallback can be
diagnosed. There's a new tracepoint for that case, so at least we
will be able to tell if the fallback co-incides with application
performance cratering. Hopefully this will only be a rare event.

So, to hoist myself on my own petard: correctness first, performance
second.

Acked-by: Dave Chinner 

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-09 Thread Dave Chinner

On Thu, Jul 09, 2020 at 10:10:38AM -0700, Darrick J. Wong wrote:
> On Thu, Jul 09, 2020 at 06:05:19PM +0100, Matthew Wilcox wrote:
> > On Thu, Jul 09, 2020 at 09:09:26AM -0700, Darrick J. Wong wrote:
> > > On Thu, Jul 09, 2020 at 12:25:27PM +1000, Dave Chinner wrote:
> > > > -*/
> > > > -   ret = invalidate_inode_pages2_range(mapping,
> > > > -   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > -   if (ret)
> > > > -   dio_warn_stale_pagecache(iocb->ki_filp);
> > > > -   ret = 0;
> > > > +   if (iov_iter_rw(iter) == WRITE) {
> > > > +   /*
> > > > +* Try to invalidate cache pages for the range we're 
> > > > direct
> > > > +* writing.  If this invalidation fails, tough, the 
> > > > write will
> > > > +* still work, but racing two incompatible write paths 
> > > > is a
> > > > +* pretty crazy thing to do, so we don't support it 
> > > > 100%.
> > > > +*/
> > > > +   ret = invalidate_inode_pages2_range(mapping,
> > > > +   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> > > > +   if (ret)
> > > > +   dio_warn_stale_pagecache(iocb->ki_filp);
> > > > +   ret = 0;
> > > >  
> > > > -   if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> > > > -   !inode->i_sb->s_dio_done_wq) {
> > > > -   ret = sb_init_dio_done_wq(inode->i_sb);
> > > > -   if (ret < 0)
> > > > -   goto out_free_dio;
> > > > +       if (!wait_for_completion &&
> > > > +   !inode->i_sb->s_dio_done_wq) {
> > > > +   ret = sb_init_dio_done_wq(inode->i_sb);
> > > > +   if (ret < 0)
> > > > +   goto out_free_dio;
> 
> ...and yes I did add in the closing brace here. :P

Doh! I forgot to refresh the patch after fixing that. :/

Thanks!

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-08 Thread Dave Chinner

On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 08, 2020 at 04:51:27PM +1000, Dave Chinner wrote:
> > On Tue, Jul 07, 2020 at 03:00:30PM +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> > > > Indeed, I'm in favour of not invalidating
> > > > the page cache at all for direct I/O.  For reads, I think the page cache
> > > > should be used to satisfy any portion of the read which is currently
> > > > cached.  For writes, I think we should write into the page cache pages
> > > > which currently exist, and then force those pages to be written back,
> > > > but left in cache.
> > > 
> > > Something like that, yes.
> > 
> > So are we really willing to take the performance regression that
> > occurs from copying out of the page cache consuming lots more CPU
> > than an actual direct IO read? Or that direct IO writes suddenly
> > serialise because there are page cache pages and now we have to do
> > buffered IO?
> > 
> > Direct IO should be a deterministic, zero-copy IO path to/from
> > storage. Using the CPU to copy data during direct IO is the complete
> > opposite of the intended functionality, not to mention the behaviour
> > that many applications have been careful designed and tuned for.
> 
> Direct I/O isn't deterministic though.

When all the application is doing is direct IO, it is as
deterministic as the underlying storage behaviour. This is the best
we can possibly do from the perspective of the filesystem, and this
is largely what Direct IO requires from the filesystem.

Direct IO starts from delegating all responsibility for IO
synchronisation data coherency and integrity to userspace, and then
follows up by requiring the filesystem and kernel to stay out of the
IO path except where it is absolutely necessary to read or write
data to/from the underlying storage hardware. Serving Direct IO from
the page cache violates the second of these requirements.

> If the file isn't shared, then
> it works great, but as soon as you get mixed buffered and direct I/O,
> everything is already terrible.

Right, but that's *the rare exception* for applications using direct
IO, not the common fast path. It is the slow path where -shit has
already gone wrong on the production machine-, and it most
definitely does not change the DIO requirements that the filesystem
needs to provide userspace via the direct IO path.

Optimising the slow exception path is fine if it does not affect the
guarantees we try to provide through the common/fast path. If it is
does affect behaviour of the fast path, then we must be able to
either turn it off or provide our own alternative implementation
that does not have that cost.

> Direct I/Os perform pagecache lookups
> already, but instead of using the data that we found in the cache, we
> (if it's dirty) write it back, wait for the write to complete, remove
> the page from the pagecache and then perform another I/O to get the data
> that we just wrote out!  And then the app that's using buffered I/O has
> to read it back in again.

Yup, that's because we have a history going back 20+ years of people
finding performance regressions in applications using direct IO when
we leave incorrectly left pages in the page cache rather than
invalidating them and continuing to do direct IO.

> Nobody's proposing changing Direct I/O to exclusively work through the
> pagecache.  The proposal is to behave less weirdly when there's already
> data in the pagecache.

No, the proposal it to make direct IO behave *less
deterministically* if there is data in the page cache.

e.g. Instead of having a predicatable submission CPU overhead and
read latency of 100us for your data, this proposal makes the claim
that it is always better to burn 10x the IO submission CPU for a
single IO to copy the data and give that specific IO 10x lower
latency than it is to submit 10 async IOs to keep the IO pipeline
full.

What it fails to take into account is that in spending that CPU time
to copy the data, we haven't submitted 10 other IOs and so the
actual in-flight IO for the application has decreased. If
performance comes from keeping the IO pipeline as close to 100% full
as possible, then copying the data out of the page cache will cause
performance regressions.

i.e. Hit 5 page cache pages in 5 IOs in a row, and the IO queue
depth craters because we've only fulfilled 5 complete IOs instead of
submitting 50 entire IOs. This is the hidden cost of synchronous IO
via CPU data copying vs async IO via hardware offload, and if we
take that into account we must look at future hardware performance
trends to determine if this cost is going to increase or decrease in
future.

That is: CPUs are not getting faster anytime soon. IO subsyst

Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-08 Thread Dave Chinner

On Tue, Jul 07, 2020 at 03:00:30PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> > On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> > > On  9:53 01/07, Christoph Hellwig wrote:
> > > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > > > From: Goldwyn Rodrigues 
> > > > > 
> > > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to 
> > > > > indicate
> > > > > that if the page invalidation fails, return back control to the
> > > > > filesystem so it may fallback to buffered mode.
> > > > > 
> > > > > Reviewed-by: Darrick J. Wong 
> > > > > Signed-off-by: Goldwyn Rodrigues 
> > > > 
> > > > I'd like to start a discussion of this shouldn't really be the
> > > > default behavior.  If we have page cache that can't be invalidated it
> > > > actually makes a whole lot of sense to not do direct I/O, avoid the
> > > > warnings, etc.
> > > > 
> > > > Adding all the relevant lists.
> > > 
> > > Since no one responded so far, let me see if I can stir the cauldron :)
> > > 
> > > What error should be returned in case of such an error? I think the
> > 
> > Christoph's message is ambiguous.  I don't know if he means "fail the
> > I/O with an error" or "satisfy the I/O through the page cache".  I'm
> > strongly in favour of the latter.
> 
> Same here.  Sorry if my previous mail was unclear.
> 
> > Indeed, I'm in favour of not invalidating
> > the page cache at all for direct I/O.  For reads, I think the page cache
> > should be used to satisfy any portion of the read which is currently
> > cached.  For writes, I think we should write into the page cache pages
> > which currently exist, and then force those pages to be written back,
> > but left in cache.
> 
> Something like that, yes.

So are we really willing to take the performance regression that
occurs from copying out of the page cache consuming lots more CPU
than an actual direct IO read? Or that direct IO writes suddenly
serialise because there are page cache pages and now we have to do
buffered IO?

Direct IO should be a deterministic, zero-copy IO path to/from
storage. Using the CPU to copy data during direct IO is the complete
opposite of the intended functionality, not to mention the behaviour
that many applications have been careful designed and tuned for.

Hence I think that forcing iomap to use cached pages for DIO is a
non-starter. I have no problems with providing infrastructure that
allows filesystems to -opt in- to using buffered IO for the direct
IO path. However, the change in IO behaviour caused by unpredicably
switching between direct IO and buffered IO (e.g. suddening DIO
writes serialise -all IO-) will cause unacceptible performance
regressions for many applications and be -very difficult to
diagnose- in production systems.

IOWs, we need to let the individual filesystems decide how they want
to use the page cache for direct IO. Just because we have new direct
IO infrastructure (i.e. iomap) it does not mean we can just make
wholesale changes to the direct IO path behaviour...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor

2020-02-18 Thread Dave Chinner

On Tue, Feb 18, 2020 at 10:04:15PM -0800, Matthew Wilcox wrote:
> On Wed, Feb 19, 2020 at 02:29:00PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote:
> > > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t 
> > > pos, loff_t length,
> > >   }
> > >   ret = iomap_readpage_actor(inode, pos + done, length - done,
> > >   ctx, iomap, srcmap);
> > > + if (WARN_ON(ret == 0))
> > > + break;
> > 
> > This error case now leaks ctx->cur_page
> 
> Yes ... and I see the consequence.  I mean, this is a "shouldn't happen",
> so do we want to put effort into cleanup here ...

Well, the normal thing for XFS is that a production kernel cleans up
and handles the error gracefully with a WARN_ON_ONCE, while a debug
kernel build will chuck a tanty and burn the house down so to make
the developers aware that there is a "should not happen" situation
occurring

> > > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, 
> > > struct list_head *pages,
> > >  done:
> > >   if (ctx.bio)
> > >   submit_bio(ctx.bio);
> > > - if (ctx.cur_page) {
> > > - if (!ctx.cur_page_in_bio)
> > > - unlock_page(ctx.cur_page);
> > > - put_page(ctx.cur_page);
> > > - }
> > > + BUG_ON(ctx.cur_page);
> > 
> > And so will now trigger both a warn and a bug
> 
> ... or do we just want to run slap bang into this bug?
> 
> Option 1: Remove the check for 'ret == 0' altogether, as we had it before.
> That puts us into endless loop territory for a failure mode, and it's not
> parallel with iomap_readpage().
> 
> Option 2: Remove the WARN_ON from the check.  Then we just hit the BUG_ON,
> but we don't know why we did it.
> 
> Option 3: Set cur_page to NULL.  We'll hit the WARN_ON, avoid the BUG_ON,
> might end up with a page in the page cache which is never unlocked.

None of these are appealing.

> Option 4: Do the unlock/put page dance before setting the cur_page to NULL.
> We might double-unlock the page.

why would we double unlock the page?

Oh, the readahead cursor doesn't handle the case of partial page
submission, which would result in IO completion unlocking the page.

Ok, that's what the ctx.cur_page_in_bio check is used to detect i.e.
if we've got a page that the readahead cursor points at, and we
haven't actually added it to a bio, then we can leave it to the
read_pages() to unlock and clean up. If it's in a bio, then IO
completion will unlock it and so we only have to drop the submission
reference and move the readahead cursor forwards so read_pages()
doesn't try to unlock this page. i.e:

/* clean up partial page submission failures */
if (ctx.cur_page && ctx.cur_page_in_bio) {
put_page(ctx.cur_page);
readahead_next(rac);
}

looks to me like it will handle the case of "ret == 0" in the actor
function just fine.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API

2020-02-18 Thread Dave Chinner

On Tue, Feb 18, 2020 at 07:48:32PM -0800, Matthew Wilcox wrote:
> On Wed, Feb 19, 2020 at 02:45:25PM +1100, Dave Chinner wrote:
> > On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote:
> > > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> > > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > > > > Latest version in your git tree:
> > > > > 
> > > > > $ ▶ glo -n 5 willy/readahead
> > > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > > > > ff63497fcb98 iomap: Convert from readpages to readahead
> > > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > > > > 8115bcca7312 fuse: Convert from readpages to readahead
> > > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > > > > $
> > > > > 
> > > > > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > > > > 
> > > > > [2.423116] [ cut here ]
> > > > > [2.424957] list_add double add: new=ea000efff4c8, 
> > > > > prev=8883bfffee60, next=ea000efff4c8.
> > > > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 
> > > > > __list_add_valid+0x67/0x70
> > > > > [2.457484] Call Trace:
> > > > > [2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > > > > [2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > > > > [2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > > > > [2.461712]  lru_add_drain_cpu+0x8d/0x160
> > > > > [2.462787]  lru_add_drain+0x18/0x20
> > > > 
> > > > Are you sure that was 4be497096c04 ?  I ask because there was a
> > > 
> > > Yes, because it's the only version I've actually merged into my
> > > working tree, compiled and tried to run. :P
> > > 
> > > > version pushed to that git tree that did contain a list double-add
> > > > (due to a mismerge when shuffling patches).  I noticed it and fixed
> > > > it, and 4be497096c04 doesn't have that problem.  I also test with
> > > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> > > > probabilistic because it'll depend on the timing between whatever other
> > > > list is being used and the page actually being added to the LRU.
> > > 
> > > I'll see if I can reproduce it.
> > 
> > Just updated to a current TOT Linus kernel and your latest branch,
> > and so far this is 100% reproducable.
> > 
> > Not sure how I'm going to debug it yet, because it's init that is
> > triggering it
> 
> Eric found it ...

Yeah, just saw that and am applying his patch to test it...

> still not sure why I don't see it.

No readahead configured on your device?


Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API

2020-02-18 Thread Dave Chinner

On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote:
> On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > > Latest version in your git tree:
> > > 
> > > $ ▶ glo -n 5 willy/readahead
> > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > > ff63497fcb98 iomap: Convert from readpages to readahead
> > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > > 8115bcca7312 fuse: Convert from readpages to readahead
> > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > > $
> > > 
> > > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > > 
> > > [2.423116] [ cut here ]
> > > [2.424957] list_add double add: new=ea000efff4c8, 
> > > prev=8883bfffee60, next=ea000efff4c8.
> > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 
> > > __list_add_valid+0x67/0x70
> > > [2.457484] Call Trace:
> > > [2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > > [2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > > [2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > > [2.461712]  lru_add_drain_cpu+0x8d/0x160
> > > [2.462787]  lru_add_drain+0x18/0x20
> > 
> > Are you sure that was 4be497096c04 ?  I ask because there was a
> 
> Yes, because it's the only version I've actually merged into my
> working tree, compiled and tried to run. :P
> 
> > version pushed to that git tree that did contain a list double-add
> > (due to a mismerge when shuffling patches).  I noticed it and fixed
> > it, and 4be497096c04 doesn't have that problem.  I also test with
> > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> > probabilistic because it'll depend on the timing between whatever other
> > list is being used and the page actually being added to the LRU.
> 
> I'll see if I can reproduce it.

Just updated to a current TOT Linus kernel and your latest branch,
and so far this is 100% reproducable.

Not sure how I'm going to debug it yet, because it's init that is
triggering it

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 18/19] iomap: Convert from readpages to readahead

2020-02-18 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:46:12AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in iomap.  Convert XFS and ZoneFS to
> use it.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/iomap/buffered-io.c | 91 +++---
>  fs/iomap/trace.h   |  2 +-
>  fs/xfs/xfs_aops.c  | 13 +++---
>  fs/zonefs/super.c  |  7 ++--
>  include/linux/iomap.h  |  3 +-
>  5 files changed, 42 insertions(+), 74 deletions(-)

All pretty straight forward...

> @@ -416,6 +384,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   break;
>   done += ret;
>   if (offset_in_page(pos + done) == 0) {
> + readahead_next(ctx->rac);
>   if (!ctx->cur_page_in_bio)
>   unlock_page(ctx->cur_page);
>   put_page(ctx->cur_page);

Though now I look at the addition here, this might be better
restructured to mention how we handle partial page submission such as:

done += ret;

/*
 * Keep working on a partially complete page, otherwise ready
 * the ctx for the next page to be acted on.
 */
if (offset_in_page(pos + done))
continue;

if (!ctx->cur_page_in_bio)
unlock_page(ctx->cur_page);
put_page(ctx->cur_page);
    ctx->cur_page = NULL;
readahead_next(ctx->rac);
}

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor

2020-02-18 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> By putting the 'have we reached the end of the page' condition at the end
> of the loop instead of the beginning, we can remove the 'submit the last
> page' code from iomap_readpages().  Also check that iomap_readpage_actor()
> didn't return 0, which would lead to an endless loop.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/iomap/buffered-io.c | 25 -
>  1 file changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index cb3511eb152a..44303f370b2d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct iomap_readpage_ctx *ctx = data;
> - loff_t done, ret;
> + loff_t ret, done = 0;
>  
> - for (done = 0; done < length; done += ret) {
> - if (ctx->cur_page && offset_in_page(pos + done) == 0) {
> - if (!ctx->cur_page_in_bio)
> - unlock_page(ctx->cur_page);
> - put_page(ctx->cur_page);
> - ctx->cur_page = NULL;
> - }
> + while (done < length) {
>   if (!ctx->cur_page) {
>   ctx->cur_page = iomap_next_page(inode, ctx->pages,
>   pos, length, );
> @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   }
>   ret = iomap_readpage_actor(inode, pos + done, length - done,
>   ctx, iomap, srcmap);
> + if (WARN_ON(ret == 0))
> + break;

This error case now leaks ctx->cur_page

> + done += ret;
> + if (offset_in_page(pos + done) == 0) {
> + if (!ctx->cur_page_in_bio)
> + unlock_page(ctx->cur_page);
> + put_page(ctx->cur_page);
> + ctx->cur_page = NULL;
> + }
>   }
>  
>   return done;
> @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct 
> list_head *pages,
>  done:
>   if (ctx.bio)
>   submit_bio(ctx.bio);
> - if (ctx.cur_page) {
> -     if (!ctx.cur_page_in_bio)
> - unlock_page(ctx.cur_page);
> - put_page(ctx.cur_page);
> - }
> + BUG_ON(ctx.cur_page);

And so will now trigger both a warn and a bug

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 16/19] fuse: Convert from readpages to readahead

2020-02-18 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:46:09AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in fuse.  Switching away from the
> read_cache_pages() helper gets rid of an implicit call to put_page(),
> so we can get rid of the get_page() call in fuse_readpages_fill().
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Looks OK.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 14/19] ext4: Convert from readpages to readahead

2020-02-18 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:46:05AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in ext4
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

There's nothing I can see in this that would cause that list
corruption I saw with ext4.

I'll re-introduce the patch and see if it falls over again.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 13/19] erofs: Convert compressed files from readpages to readahead

2020-02-18 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:46:03AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in erofs.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/erofs/zdata.c | 29 +
>  1 file changed, 9 insertions(+), 20 deletions(-)

Looks fine.

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 12/19] erofs: Convert uncompressed files from readpages to readahead

2020-02-18 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:46:01AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in erofs
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/erofs/data.c  | 39 +---
>  fs/erofs/zdata.c |  2 +-
>  include/trace/events/erofs.h |  6 +++---
>  3 files changed, 18 insertions(+), 29 deletions(-)

Looks fine from the perspective of page iteration and error handling.

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 11/19] btrfs: Convert from readpages to readahead

2020-02-18 Thread Dave Chinner

On Tue, Feb 18, 2020 at 01:12:28PM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:57:58PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:59AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" 

> > >  
> > > - if (nr) {
> > > - u64 contig_start = page_offset(pagepool[0]);
> > > + ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);
> > 
> > Ok, yes it does. :)
> > 
> > I don't see how readahead_for_each_batch() guarantees that, though.
> 
> I ... don't see how it doesn't?  We start at rac->_start and iterate
> through the consecutive pages in the page cache.  readahead_for_each_batch()
> does assume that __do_page_cache_readahead() has its current behaviour
> of putting the pages in the page cache in order, and kicks off a new
> call to ->readahead() every time it has to skip an index for whatever
> reason (eg page already in page cache).

And there is the comment I was looking for while reading
readahead_for_each_batch() :)

> 
> > > - if (bio)
> > > - return submit_one_bio(bio, 0, bio_flags);
> > > - return 0;
> > > + if (bio) {
> > > + if (submit_one_bio(bio, 0, bio_flags))
> > > + return;
> > > + }
> > >  }
> > 
> > Shouldn't that just be
> > 
> > if (bio)
> > submit_one_bio(bio, 0, bio_flags);
> 
> It should, but some overzealous person decided to mark submit_one_bio()
> as __must_check, so I have to work around that.

/me looks at code

Ngggh.

I rather dislike functions that are named in a way that look like
they belong to core kernel APIs but in reality are local static
functions.

I'd ask for this to be fixed if it was generic code, but it's btrfs
specific code so they can deal with the ugliness of their own
creation. :/

> > Confusing when put alongside rac->_batch_count counting the number
> > of pages in the batch, and "batch" being the index into the page
> > array, and they aren't the same counts
> 
> Yes.  Renamed to 'i'.
> 
> > > + XA_STATE(xas, >mapping->i_pages, rac->_start);
> > > + struct page *page;
> > > +
> > > + rac->_batch_count = 0;
> > > + xas_for_each(, page, rac->_start + rac->_nr_pages - 1) {
> > 
> > That just iterates pages in the start,end doesn't it? What
> > guarantees that this fills the array with a contiguous page range?
> 
> The behaviour of __do_page_cache_readahead().  Dave Howells also has a
> usecase for xas_for_each_contig(), so I'm going to add that soon.
> 
> > > + VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > + VM_BUG_ON_PAGE(PageTail(page), page);
> > > + array[batch++] = page;
> > > + rac->_batch_count += hpage_nr_pages(page);
> > > + if (PageHead(page))
> > > + xas_set(, rac->_start + rac->_batch_count);
> > 
> > What on earth does this do? Comments please!
> 
>   /*
>* The page cache isn't using multi-index entries yet,
>* so xas_for_each() won't do the right thing for
>* large pages.  This can be removed once the page cache
>* is converted.
>*/

Oh, it's changing the internal xarray lookup cursor position to
point at the correct next page index? Perhaps it's better to say
that instead of "won't do the right thing"?

> > > +#define readahead_for_each_batch(rac, array, size, nr)   
> > > \
> > > + for (; (nr = readahead_page_batch(rac, array, size));   \
> > > + readahead_next(rac))
> > 
> > I had to go look at the caller to work out what "size" refered to
> > here.
> > 
> > This is complex enough that it needs proper API documentation.
> 
> How about just:
> 
> -#define readahead_for_each_batch(rac, array, size, nr) \
> -   for (; (nr = readahead_page_batch(rac, array, size));   \
> +#define readahead_for_each_batch(rac, array, array_sz, nr) \
> +   for (; (nr = readahead_page_batch(rac, array, array_sz));   \

Yup, that's fine - now the macro documents itself.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 09/19] mm: Add page_cache_readahead_limit

2020-02-18 Thread Dave Chinner

On Tue, Feb 18, 2020 at 11:54:04AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:31:10PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" 
> > > 
> > > ext4 and f2fs have duplicated the guts of the readahead code so
> > > they can read past i_size.  Instead, separate out the guts of the
> > > readahead code so they can call it directly.
> > 
> > Gross and nasty (hosting non-stale data beyond EOF in the page
> > cache, that is).
> 
> I thought you meant sneaking changes into the VFS (that were rejected) by
> copying VFS code and modifying it ...

Well, now that you mention it... :P

> > > +/**
> > > + * page_cache_readahead_limit - Start readahead beyond a file's i_size.
> > > + * @mapping: File address space.
> > > + * @file: This instance of the open file; used for authentication.
> > > + * @offset: First page index to read.
> > > + * @end_index: The maximum page index to read.
> > > + * @nr_to_read: The number of pages to read.
> > > + * @lookahead_size: Where to start the next readahead.
> > > + *
> > > + * This function is for filesystems to call when they want to start
> > > + * readahead potentially beyond a file's stated i_size.  If you want
> > > + * to start readahead on a normal file, you probably want to call
> > > + * page_cache_async_readahead() or page_cache_sync_readahead() instead.
> > > + *
> > > + * Context: File is referenced by caller.  Mutexes may be held by caller.
> > > + * May sleep, but will not reenter filesystem to reclaim memory.
> > >   */
> > > -void __do_page_cache_readahead(struct address_space *mapping,
> > > - struct file *filp, pgoff_t offset, unsigned long nr_to_read,
> > > - unsigned long lookahead_size)
> > > +void page_cache_readahead_limit(struct address_space *mapping,
> > 
> > ... I don't think the function name conveys it's purpose. It's
> > really a ranged readahead that ignores where i_size lies. i.e
> > 
> > page_cache_readahead_range(mapping, start, end, nr_to_read)
> > 
> > seems like a better API to me, and then you can drop the "start
> > readahead beyond i_size" comments and replace it with "Range is not
> > limited by the inode's i_size and hence can be used to read data
> > stored beyond EOF into the page cache."
> 
> I'm concerned that calling it 'range' implies "I want to read between
> start and end" rather than "I want to read nr_to_read at start, oh but
> don't go past end".
> 
> Maybe the right way to do this is have the three callers cap nr_to_read.
> Well, the one caller ... after all, f2fs and ext4 have no desire to
> cap the length.  Then we can call it page_cache_readahead_exceed() or
> page_cache_readahead_dangerous() or something else like that to make it
> clear that you shouldn't be calling it.

Fair point.

And in reading this, it occurred to me that what we are enabling is
an "out of bounds" readahead function. so
page_cache_readahead_OOB() or *_unbounded() might be a better name

>   * Like add_to_page_cache_locked, but used to add newly allocated pages:
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9dd431fa16c9..cad26287ad8b 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -142,45 +142,43 @@ static void read_pages(struct readahead_control *rac, 
> struct list_head *pages)
>   blk_finish_plug();
>  }
>  
> -/*
> - * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
> - * the pages first, then submits them for I/O. This avoids the very bad
> - * behaviour which would occur if page allocations are causing VM writeback.
> - * We really don't want to intermingle reads and writes like that.
> +/**
> + * page_cache_readahead_exceed - Start unchecked readahead.
> + * @mapping: File address space.
> + * @file: This instance of the open file; used for authentication.
> + * @index: First page index to read.
> + * @nr_to_read: The number of pages to read.
> + * @lookahead_size: Where to start the next readahead.
> + *
> + * This function is for filesystems to call when they want to start
> + * readahead beyond a file's stated i_size.  This is almost certainly
> + * not the function you want to call.  Use page_cache_async_readahead()
> + * or page_cache_sync_readahead() instead.
> + *
> + * Context: File is referenced by caller.  Mutexes may be held by caller.
> + * May sleep, but will not reenter filesystem to reclaim memory.

Yup, looks much better.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 08/19] mm: Add readahead address space operation

2020-02-18 Thread Dave Chinner

On Tue, Feb 18, 2020 at 08:10:04AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:21:47PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" 
> > > 
> > > This replaces ->readpages with a saner interface:
> > >  - Return void instead of an ignored error code.
> > >  - Pages are already in the page cache when ->readahead is called.
> > 
> > Might read better as:
> > 
> >  - Page cache is already populates with locked pages when
> >->readahead is called.
> 
> Will do.
> 
> > >  - Implementation looks up the pages in the page cache instead of
> > >having them passed in a linked list.
> > 
> > Add:
> > 
> >  - cleanup of unused readahead handled by ->readahead caller, not
> >the method implementation.
> 
> The readpages caller does that cleanup too, so it's not an advantage
> to the readahead interface.

Right. I kinda of read the list as "the reasons the new API is sane"
not as "how readahead is different to readpages"

> > >  ``readpages``
> > >   called by the VM to read pages associated with the address_space
> > >   object.  This is essentially just a vector version of readpage.
> > >   Instead of just one page, several pages are requested.
> > >   readpages is only used for read-ahead, so read errors are
> > >   ignored.  If anything goes wrong, feel free to give up.
> > > + This interface is deprecated; implement readahead instead.
> > 
> > What is the removal schedule for the deprecated interface? 
> 
> I mentioned that in the cover letter; once Dave Howells has the fscache
> branch merged, I'll do the remaining filesystems.  Should be within the
> next couple of merge windows.

Sure, but I like to see actual release tags with the deprecation
notice so that it's obvious to the reader as to whether this is
something new and/or when they can expect it to go away.

> > > +/* The byte offset into the file of this readahead block */
> > > +static inline loff_t readahead_offset(struct readahead_control *rac)
> > > +{
> > > + return (loff_t)rac->_start * PAGE_SIZE;
> > > +}
> > 
> > Urk. Didn't an early page use "offset" for the page index? That
> > was was "mm: Remove 'page_offset' from readahead loop" did, right?
> > 
> > That's just going to cause confusion to have different units for
> > readahead "offsets"
> 
> We are ... not consistent anywhere in the VM/VFS with our naming.
> Unfortunately.
> 
> $ grep -n offset mm/filemap.c 
> 391: * @start:offset in bytes where the range starts
> ...
> 815:  pgoff_t offset = old->index;
> ...
> 2020: unsigned long offset;  /* offset into pagecache page */
> ...
> 2257: *ppos = ((loff_t)index << PAGE_SHIFT) + offset;
> 
> That last one's my favourite.  Not to mention the fine distinction you
> and I discussed recently between offset_in_page() and page_offset().
> 
> Best of all, even our types encode the ambiguity of an 'offset'.  We have
> pgoff_t and loff_t (replacing the earlier off_t).
> 
> So, new rule.  'pos' is the number of bytes into a file.  'index' is the
> number of PAGE_SIZE pages into a file.  We don't use the word 'offset'
> at all.  'length' as a byte count and 'count' as a page count seem like
> fine names to me.

That sounds very reasonable to me. Another patchset in the making? :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 07/19] mm: Put readahead pages in cache earlier

2020-02-18 Thread Dave Chinner

On Tue, Feb 18, 2020 at 07:42:22AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:14:59PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:52AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" 
> > > 
> > > At allocation time, put the pages in the cache unless we're using
> > > ->readpages.  Add the readahead_for_each() iterator for the benefit of
> > > the ->readpage fallback.  This iterator supports huge pages, even though
> > > none of the filesystems to be converted do yet.
> > 
> > This could be better written - took me some time to get my head
> > around it and the code.
> > 
> > "When populating the page cache for readahead, mappings that don't
> > use ->readpages need to have their pages added to the page cache
> > before ->readpage is called. Do this insertion earlier so that the
> > pages can be looked up immediately prior to ->readpage calls rather
> > than passing them on a linked list. This early insert functionality
> > is also required by the upcoming ->readahead method that will
> > replace ->readpages.
> > 
> > Optimise and simplify the readpage loop by adding a
> > readahead_for_each() iterator to provide the pages we need to read.
> > This iterator also supports huge pages, even though none of the
> > filesystems have been converted to use them yet."
> 
> Thanks, I'll use that.
> 
> > > +static inline struct page *readahead_page(struct readahead_control *rac)
> > > +{
> > > + struct page *page;
> > > +
> > > + if (!rac->_nr_pages)
> > > + return NULL;
> > 
> > H.
> > 
> > > +
> > > + page = xa_load(>mapping->i_pages, rac->_start);
> > > + VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > + rac->_batch_count = hpage_nr_pages(page);
> > 
> > So we could have rac->_nr_pages = 2, and then we get an order 2
> > large page returned, and so rac->_batch_count = 4.
> 
> Well, no, we couldn't.  rac->_nr_pages is incremented by 4 when we add
> an order-2 page to the readahead.

I don't see any code that does that. :)

i.e. we aren't actually putting high order pages into the page
cache here - page_alloc() allocates order-0 pages) - so there's
nothing in the patch that tells me how rac->_nr_pages behaves
when allocating large pages...

IOWs, we have an undocumented assumption in the implementation...

> I can put a
>   BUG_ON(rac->_batch_count > rac->_nr_pages)
> in here to be sure to catch any logic error like that.

Definitely necessary given that we don't insert large pages for
readahead yet. A comment explaining the assumptions that the
code makes for large pages is probably in order, too.

> > > - page->index = offset;
> > > - list_add(>lru, _pool);
> > > + if (use_list) {
> > > + page->index = offset;
> > > + list_add(>lru, _pool);
> > > + } else if (add_to_page_cache_lru(page, mapping, offset,
> > > + gfp_mask) < 0) {
> > > + put_page(page);
> > > + goto read;
> > > + }
> > 
> > Ok, so that's why you put read code at the end of the loop. To turn
> > the code into spaghetti :/
> > 
> > How much does this simplify down when we get rid of ->readpages and
> > can restructure the loop? This really seems like you're trying to
> > flatten two nested loops into one by the use of goto
> 
> I see it as having two failure cases in this loop.  One for "page is
> already present" (which already existed) and one for "allocated a page,
> but failed to add it to the page cache" (which used to be done later).
> I didn't want to duplicate the "call read_pages()" code.  So I reshuffled
> the code rather than add a nested loop.  I don't think the nested loop
> is easier to read (we'll be at 5 levels of indentation for some statements).
> Could do it this way ...

Can we move the update of @rac inside read_pages()? The next
start offset^Windex we start at is rac._start + rac._nr_pages, right?

so read_pages() could do:

{
if (readahead_count(rac)) {
/* do readahead */
}

/* advance the readahead cursor */
rac->_start += rac->_nr_pages;
rac._nr_pages = 0;
}

and then we only need to call read_pages() in these cases and so
the requirement for avoiding duplicating code is avoided...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 04/19] mm: Rearrange readahead loop

2020-02-18 Thread Dave Chinner

On Tue, Feb 18, 2020 at 05:57:36AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 04:08:24PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" 
> > > 
> > > Move the declaration of 'page' to inside the loop and move the 'kick
> > > off a fresh batch' code to the end of the function for easier use in
> > > subsequent patches.
> > 
> > Stale? the "kick off" code is moved to the tail of the loop, not the
> > end of the function.
> 
> Braino; I meant to write end of the loop.
> 
> > > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space 
> > > *mapping,
> > >   page = xa_load(>i_pages, page_offset);
> > >   if (page && !xa_is_value(page)) {
> > >   /*
> > > -  * Page already present?  Kick off the current batch of
> > > -  * contiguous pages before continuing with the next
> > > -  * batch.
> > > +  * Page already present?  Kick off the current batch
> > > +  * of contiguous pages before continuing with the
> > > +  * next batch.  This page may be the one we would
> > > +  * have intended to mark as Readahead, but we don't
> > > +  * have a stable reference to this page, and it's
> > > +  * not worth getting one just for that.
> > >*/
> > > - if (readahead_count())
> > > - read_pages(, _pool, gfp_mask);
> > > - rac._nr_pages = 0;
> > > - continue;
> > > + goto read;
> > >   }
> > >  
> > >   page = __page_cache_alloc(gfp_mask);
> > > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space 
> > > *mapping,
> > >   if (page_idx == nr_to_read - lookahead_size)
> > >   SetPageReadahead(page);
> > >   rac._nr_pages++;
> > > + continue;
> > > +read:
> > > + if (readahead_count())
> > > + read_pages(, _pool, gfp_mask);
> > > + rac._nr_pages = 0;
> > >   }
> > 
> > Also, why? This adds a goto from branched code that continues, then
> > adds a continue so the unbranched code doesn't execute the code the
> > goto jumps to. In absence of any explanation, this isn't an
> > improvement and doesn't make any sense...
> 
> I thought I was explaining it ... "for easier use in subsequent patches".

Sorry, my braino there. :) I commented on the problem with the first
part of the sentence, then the rest of the sentence completely
failed to sink in.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 03/19] mm: Use readahead_control to pass arguments

2020-02-18 Thread Dave Chinner

On Tue, Feb 18, 2020 at 05:56:18AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 04:03:00PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote:
> > > +static void read_pages(struct readahead_control *rac, struct list_head 
> > > *pages,
> > > + gfp_t gfp)
> > >  {
> > > + const struct address_space_operations *aops = rac->mapping->a_ops;
> > >   struct blk_plug plug;
> > >   unsigned page_idx;
> > 
> > Splitting out the aops rather than the mapping here just looks
> > weird, especially as you need the mapping later in the function.
> > Using aops doesn't even reduce the code side
> 
> It does in subsequent patches ... I agree it looks a little weird here,
> but I think in the final form, it makes sense:

Ok. Perhaps just an additional commit comment to say "read_pages() is
changed to be aops centric as @rac abstracts away all other
implementation details by the end of the patchset."

> > > + if (readahead_count())
> > > + read_pages(, _pool, gfp_mask);
> > > + rac._nr_pages = 0;
> > 
> > Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead
> > control structure - if we have to pass the gfp_mask down all the
> > way along side the rac, then I think it makes sense to do that...
> 
> So we end up removing it later on in this series, but I do wonder if
> it would make sense anyway.  By the end of the series, we still have
> this in iomap:
> 
> if (ctx->rac) /* same as readahead_gfp_mask */
> gfp |= __GFP_NORETRY | __GFP_NOWARN;
> 
> and we could get rid of that by passing gfp flags down in the rac.  On the
> other hand, I don't know why it doesn't just use readahead_gfp_mask()
> here anyway ... Christoph?

mapping->gfp_mask is awful. Is it a mask, or is it a valid set of
allocation flags? Or both?  Some callers to mapping_gfp_constraint()
uses it as a mask, some callers to mapping_gfp_constraint() use it
as base flags that context specific flags get masked out of,
readahead_gfp_mask() callers use it as the entire set of gfp flags
for allocation.

That whole API sucks - undocumented as to what it's suposed to do
and how it's supposed to be used. Hence it's difficult to use
correctly or understand whether it's being used correctly. And
reading callers only leads to more confusion and crazy code like in
do_mpage_readpage() where readahead returns a mask that are used as
base flags and normal reads return a masked set of base flags...

The iomap code is obviously correct when it comes to gfp flag
manipulation. We start with GFP_KERNEL context, then constrain it
via the mask held in mapping->gfp_mask, then if it's readahead we
allow the allocation to silently fail.

Simple to read and understand code, versus having weird code that
requires the reader to decipher an undocumented and inconsistent API
to understand how the gfp flags have been calculated and are valid.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API

2020-02-18 Thread Dave Chinner

On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > Latest version in your git tree:
> > 
> > $ ▶ glo -n 5 willy/readahead
> > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > ff63497fcb98 iomap: Convert from readpages to readahead
> > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > 8115bcca7312 fuse: Convert from readpages to readahead
> > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > $
> > 
> > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > 
> > [2.423116] [ cut here ]
> > [2.424957] list_add double add: new=ea000efff4c8, 
> > prev=8883bfffee60, next=ea000efff4c8.
> > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 
> > __list_add_valid+0x67/0x70
> > [2.457484] Call Trace:
> > [2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > [2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > [2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > [2.461712]  lru_add_drain_cpu+0x8d/0x160
> > [2.462787]  lru_add_drain+0x18/0x20
> 
> Are you sure that was 4be497096c04 ?  I ask because there was a

Yes, because it's the only version I've actually merged into my
working tree, compiled and tried to run. :P

> version pushed to that git tree that did contain a list double-add
> (due to a mismerge when shuffling patches).  I noticed it and fixed
> it, and 4be497096c04 doesn't have that problem.  I also test with
> CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> probabilistic because it'll depend on the timing between whatever other
> list is being used and the page actually being added to the LRU.

I'll see if I can reproduce it.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 11/19] btrfs: Convert from readpages to readahead

2020-02-17 Thread Dave Chinner

VM_BUG_ON_PAGE(!PageLocked(page), page);
> + VM_BUG_ON_PAGE(PageTail(page), page);
> + array[batch++] = page;
> + rac->_batch_count += hpage_nr_pages(page);
> + if (PageHead(page))
> + xas_set(, rac->_start + rac->_batch_count);

What on earth does this do? Comments please!

> +
> + if (batch == size)
> + break;
> + }
> +
> + return batch;
> +}

Seems a bit big for an inline function.

> +
> +#define readahead_for_each_batch(rac, array, size, nr)   
> \
> + for (; (nr = readahead_page_batch(rac, array, size));   \
> + readahead_next(rac))

I had to go look at the caller to work out what "size" refered to
here.

This is complex enough that it needs proper API documentation.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead

2020-02-17 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Implement the new readahead aop and convert all callers (block_dev,
> exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
> reiserfs & udf).  The callers are all trivial except for GFS2 & OCFS2.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Junxiao Bi  # ocfs2
> ---
>  drivers/staging/exfat/exfat_super.c |  7 +++---
>  fs/block_dev.c  |  7 +++---
>  fs/ext2/inode.c | 10 +++-
>  fs/fat/inode.c  |  7 +++---
>  fs/gfs2/aops.c  | 23 ++---
>  fs/hpfs/file.c  |  7 +++---
>  fs/iomap/buffered-io.c  |  2 +-
>  fs/isofs/inode.c|  7 +++---
>  fs/jfs/inode.c  |  7 +++---
>  fs/mpage.c  | 38 +
>  fs/nilfs2/inode.c   | 15 +++-
>  fs/ocfs2/aops.c | 34 ++
>  fs/omfs/file.c  |  7 +++---
>  fs/qnx6/inode.c |  7 +++---
>  fs/reiserfs/inode.c |  8 +++---
>  fs/udf/inode.c  |  7 +++---
>  include/linux/mpage.h   |  4 +--
>  mm/migrate.c|  2 +-
>  18 files changed, 73 insertions(+), 126 deletions(-)

That's actually pretty simple changeover. Nothing really scary
there. :)

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 09/19] mm: Add page_cache_readahead_limit

2020-02-17 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> ext4 and f2fs have duplicated the guts of the readahead code so
> they can read past i_size.  Instead, separate out the guts of the
> readahead code so they can call it directly.

Gross and nasty (hosting non-stale data beyond EOF in the page
cache, that is).

Code is pretty simple, but...

>  }
>  
> -/*
> - * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
> - * the pages first, then submits them for I/O. This avoids the very bad
> - * behaviour which would occur if page allocations are causing VM writeback.
> - * We really don't want to intermingle reads and writes like that.
> +/**
> + * page_cache_readahead_limit - Start readahead beyond a file's i_size.
> + * @mapping: File address space.
> + * @file: This instance of the open file; used for authentication.
> + * @offset: First page index to read.
> + * @end_index: The maximum page index to read.
> + * @nr_to_read: The number of pages to read.
> + * @lookahead_size: Where to start the next readahead.
> + *
> + * This function is for filesystems to call when they want to start
> + * readahead potentially beyond a file's stated i_size.  If you want
> + * to start readahead on a normal file, you probably want to call
> + * page_cache_async_readahead() or page_cache_sync_readahead() instead.
> + *
> + * Context: File is referenced by caller.  Mutexes may be held by caller.
> + * May sleep, but will not reenter filesystem to reclaim memory.
>   */
> -void __do_page_cache_readahead(struct address_space *mapping,
> - struct file *filp, pgoff_t offset, unsigned long nr_to_read,
> - unsigned long lookahead_size)
> +void page_cache_readahead_limit(struct address_space *mapping,

... I don't think the function name conveys it's purpose. It's
really a ranged readahead that ignores where i_size lies. i.e

page_cache_readahead_range(mapping, start, end, nr_to_read)

seems like a better API to me, and then you can drop the "start
readahead beyond i_size" comments and replace it with "Range is not
limited by the inode's i_size and hence can be used to read data
stored beyond EOF into the page cache."

Also: "This is almost certainly not the function you want to call.
Use page_cache_async_readahead or page_cache_sync_readahead()
instead."

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 08/19] mm: Add readahead address space operation

2020-02-17 Thread Dave Chinner

agemap.h
> index 3613154e79e4..bd4291f78f41 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -665,6 +665,24 @@ static inline void readahead_next(struct 
> readahead_control *rac)
>  #define readahead_for_each(rac, page)
> \
>   for (; (page = readahead_page(rac)); readahead_next(rac))
>  
> +/* The byte offset into the file of this readahead block */
> +static inline loff_t readahead_offset(struct readahead_control *rac)
> +{
> + return (loff_t)rac->_start * PAGE_SIZE;
> +}

Urk. Didn't an early page use "offset" for the page index? That
was was "mm: Remove 'page_offset' from readahead loop" did, right?

That's just going to cause confusion to have different units for
readahead "offsets"

> +
> +/* The number of bytes in this readahead block */
> +static inline loff_t readahead_length(struct readahead_control *rac)
> +{
> + return (loff_t)rac->_nr_pages * PAGE_SIZE;
> +}
> +
> +/* The index of the first page in this readahead block */
> +static inline unsigned int readahead_index(struct readahead_control *rac)
> +{
> + return rac->_start;
> +}

Based on this, I suspect the earlier patch should use "index" rather
than "offset" when walking the page cache indexes...

> +
>  /* The number of pages in this readahead block */
>  static inline unsigned int readahead_count(struct readahead_control *rac)
>  {
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9e430daae42f..975ff5e387be 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -121,7 +121,13 @@ static void read_pages(struct readahead_control *rac, 
> struct list_head *pages)
>  
>   blk_start_plug();
>  
> - if (aops->readpages) {
> + if (aops->readahead) {
> + aops->readahead(rac);
> + readahead_for_each(rac, page) {
> + unlock_page(page);
> + put_page(page);
> + }

This needs a comment to explain the unwinding that needs to be done
here. I'm not going to remember in a year's time that this is just
for the pages that weren't submitted by ->readahead

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 07/19] mm: Put readahead pages in cache earlier

2020-02-17 Thread Dave Chinner

 page->index = offset;
> - list_add(>lru, _pool);
> + if (use_list) {
> + page->index = offset;
> + list_add(>lru, _pool);
> + } else if (add_to_page_cache_lru(page, mapping, offset,
> + gfp_mask) < 0) {
> + put_page(page);
> + goto read;
> + }

Ok, so that's why you put read code at the end of the loop. To turn
the code into spaghetti :/

How much does this simplify down when we get rid of ->readpages and
can restructure the loop? This really seems like you're trying to
flatten two nested loops into one by the use of goto

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 06/19] mm: rename readahead loop variable to 'i'

2020-02-17 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:45:50AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Change the type of page_idx to unsigned long, and rename it -- it's
> just a loop counter, not a page index.
> 
> Suggested-by: John Hubbard 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Looks fine.

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop

2020-02-17 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:45:48AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Eliminate the page_offset variable which was confusing with the
> 'offset' parameter and record the start of each consecutive run of
> pages in the readahead_control.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)

Looks ok, but having the readahead dispatch out of line from the
case that triggers it makes it hard to follow.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 04/19] mm: Rearrange readahead loop

2020-02-17 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Move the declaration of 'page' to inside the loop and move the 'kick
> off a fresh batch' code to the end of the function for easier use in
> subsequent patches.

Stale? the "kick off" code is moved to the tail of the loop, not the
end of the function.

> @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   page = xa_load(>i_pages, page_offset);
>   if (page && !xa_is_value(page)) {
>   /*
> -  * Page already present?  Kick off the current batch of
> -  * contiguous pages before continuing with the next
> -  * batch.
> +  * Page already present?  Kick off the current batch
> +  * of contiguous pages before continuing with the
> +  * next batch.  This page may be the one we would
> +  * have intended to mark as Readahead, but we don't
> +  * have a stable reference to this page, and it's
> +  * not worth getting one just for that.
>*/
> - if (readahead_count())
> - read_pages(, _pool, gfp_mask);
> - rac._nr_pages = 0;
> - continue;
> + goto read;
>   }
>  
>   page = __page_cache_alloc(gfp_mask);
> @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   if (page_idx == nr_to_read - lookahead_size)
>   SetPageReadahead(page);
>   rac._nr_pages++;
> + continue;
> +read:
> + if (readahead_count())
> + read_pages(, _pool, gfp_mask);
> + rac._nr_pages = 0;
>   }

Also, why? This adds a goto from branched code that continues, then
adds a continue so the unbranched code doesn't execute the code the
goto jumps to. In absence of any explanation, this isn't an
improvement and doesn't make any sense...

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 03/19] mm: Use readahead_control to pass arguments

2020-02-17 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> In this patch, only between __do_page_cache_readahead() and
> read_pages(), but it will be extended in upcoming patches.  Also add
> the readahead_count() accessor.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  include/linux/pagemap.h | 17 +
>  mm/readahead.c  | 36 +---
>  2 files changed, 38 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ccb14b6a16b5..982ecda2d4a2 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -630,6 +630,23 @@ static inline int add_to_page_cache(struct page *page,
>   return error;
>  }
>  
> +/*
> + * Readahead is of a block of consecutive pages.
> + */
> +struct readahead_control {
> + struct file *file;
> + struct address_space *mapping;
> +/* private: use the readahead_* accessors instead */
> + pgoff_t _start;
> + unsigned int _nr_pages;
> +};
> +
> +/* The number of pages in this readahead block */
> +static inline unsigned int readahead_count(struct readahead_control *rac)
> +{
> + return rac->_nr_pages;
> +}
> +
>  static inline unsigned long dir_pages(struct inode *inode)
>  {
>   return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 12d13b7792da..15329309231f 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -113,26 +113,29 @@ int read_cache_pages(struct address_space *mapping, 
> struct list_head *pages,
>  
>  EXPORT_SYMBOL(read_cache_pages);
>  
> -static void read_pages(struct address_space *mapping, struct file *filp,
> - struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
> +static void read_pages(struct readahead_control *rac, struct list_head 
> *pages,
> + gfp_t gfp)
>  {
> + const struct address_space_operations *aops = rac->mapping->a_ops;
>   struct blk_plug plug;
>   unsigned page_idx;

Splitting out the aops rather than the mapping here just looks
weird, especially as you need the mapping later in the function.
Using aops doesn't even reduce the code side

>  
>   blk_start_plug();
>  
> - if (mapping->a_ops->readpages) {
> - mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
> + if (aops->readpages) {
> + aops->readpages(rac->file, rac->mapping, pages,
> + readahead_count(rac));
>   /* Clean up the remaining pages */
>   put_pages_list(pages);
>   goto out;
>   }
>  
> - for (page_idx = 0; page_idx < nr_pages; page_idx++) {
> + for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) {
>   struct page *page = lru_to_page(pages);
>   list_del(>lru);
> - if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
> - mapping->a_ops->readpage(filp, page);
> + if (!add_to_page_cache_lru(page, rac->mapping, page->index,
> + gfp))
> + aops->readpage(rac->file, page);

... it just makes this less readable by splitting the if() over two
lines...

>   put_page(page);
>   }
>  
> @@ -155,9 +158,13 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   unsigned long end_index;/* The last page we want to read */
>   LIST_HEAD(page_pool);
>   int page_idx;
> - unsigned int nr_pages = 0;
>   loff_t isize = i_size_read(inode);
>   gfp_t gfp_mask = readahead_gfp_mask(mapping);
> + struct readahead_control rac = {
> + .mapping = mapping,
> + .file = filp,
> + ._nr_pages = 0,
> + };

No need to initialise _nr_pages to zero, leaving it out will do the
same thing.

>  
>   if (isize == 0)
>   return;
> @@ -180,10 +187,9 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>* contiguous pages before continuing with the next
>* batch.
>*/
> - if (nr_pages)
> - read_pages(mapping, filp, _pool, nr_pages,
> - gfp_mask);
> -     nr_pages = 0;
> + if (readahead_count())
> + read_pages(, _pool, gfp_mask);
> + rac._nr_pages = 0;

Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead
control structure - if we have to pass the gfp_mask down all the
way along side the rac, then I think it makes sense to do that...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 00/19] Change readahead API

2020-02-17 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:45:41AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> This series adds a readahead address_space operation to eventually
> replace the readpages operation.  The key difference is that
> pages are added to the page cache as they are allocated (and
> then looked up by the filesystem) instead of passing them on a
> list to the readpages operation and having the filesystem add
> them to the page cache.  It's a net reduction in code for each
> implementation, more efficient than walking a list, and solves
> the direct-write vs buffered-read problem reported by yu kuai at
> https://lore.kernel.org/linux-fsdevel/20200116063601.39201-1-yuku...@huawei.com/
> 
> The only unconverted filesystems are those which use fscache.
> Their conversion is pending Dave Howells' rewrite which will make the
> conversion substantially easier.

Latest version in your git tree:

$ ▶ glo -n 5 willy/readahead
4be497096c04 mm: Use memalloc_nofs_save in readahead path
ff63497fcb98 iomap: Convert from readpages to readahead
26aee60e89b5 iomap: Restructure iomap_readpages_actor
8115bcca7312 fuse: Convert from readpages to readahead
3db3d10d9ea1 f2fs: Convert from readpages to readahead
$

merged into a 5.6-rc2 tree fails at boot on my test vm:

[2.423116] [ cut here ]
[2.424957] list_add double add: new=ea000efff4c8, 
prev=8883bfffee60, next=ea000efff4c8.
[2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 
__list_add_valid+0x67/0x70
[2.430617] CPU: 4 PID: 1 Comm: sh Not tainted 5.6.0-rc2-dgc+ #1800
[2.432405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.12.0-1 04/01/2014
[2.434744] RIP: 0010:__list_add_valid+0x67/0x70
[2.436107] Code: c6 4c 89 ca 48 c7 c7 10 41 58 82 e8 55 29 89 ff 0f 0b 31 
c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 60 41 58 82 e8 3b 29 89 ff <0f> 0b 31 
c7
[2.441161] RSP: :c900018a3bb0 EFLAGS: 00010082
[2.442548] RAX:  RBX: ea000efff4c0 RCX: 0256
[2.32] RDX: 0001 RSI: 0086 RDI: 8288a8b0
[2.446315] RBP: ea000efff4c8 R08: c900018a3a65 R09: 0256
[2.448199] R10: 0008 R11: c900018a3a65 R12: ea000efff4c8
[2.450072] R13: 8883bfffee60 R14: 0010 R15: 0001
[2.451959] FS:  () GS:8883b9c0() 
knlGS:
[2.454083] CS:  0010 DS:  ES:  CR0: 80050033
[2.455604] CR2:  CR3: 0003b9a37002 CR4: 00060ee0
[2.457484] Call Trace:
[2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
[2.459376]  pagevec_lru_move_fn+0x87/0xd0
[2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
[2.461712]  lru_add_drain_cpu+0x8d/0x160
[2.462787]  lru_add_drain+0x18/0x20
[2.463757]  shift_arg_pages+0xb8/0x180
[2.464789]  ? vprintk_emit+0x101/0x1c0
[2.465813]  ? printk+0x58/0x6f
[2.466659]  setup_arg_pages+0x205/0x240
[2.467716]  load_elf_binary+0x34a/0x1560
[2.468789]  ? get_user_pages_remote+0x159/0x280
[2.470024]  ? selinux_inode_permission+0x10d/0x1e0
[2.471323]  ? _raw_read_unlock+0xa/0x20
[2.472375]  ? load_misc_binary+0x2b2/0x410
[2.473492]  search_binary_handler+0x60/0xe0
[2.474634]  __do_execve_file.isra.0+0x512/0x850
[2.475888]  ? rest_init+0xc6/0xc6
[2.476801]  do_execve+0x21/0x30
[2.477671]  try_to_run_init_process+0x10/0x34
[2.478855]  kernel_init+0xe2/0xfa
[2.479776]  ret_from_fork+0x1f/0x30
[2.480737] ---[ end trace e77079de9b22dc6a ]---

I just dropped the ext4 conversion from my local tree so I can boot
the machine and test XFS. Might have some more info when that
crashes and burns...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 02/19] mm: Ignore return value of ->readpages

2020-02-17 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:45:43AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> We used to assign the return value to a variable, which we then ignored.
> Remove the pretence of caring.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 
> ---
>  mm/readahead.c | 8 ++--
>  1 file changed, 2 insertions(+), 6 deletions(-)

Simple enough.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v6 01/19] mm: Return void from various readahead functions

2020-02-17 Thread Dave Chinner

On Mon, Feb 17, 2020 at 10:45:42AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> ondemand_readahead has two callers, neither of which use the return value.
> That means that both ra_submit and __do_page_cache_readahead() can return
> void, and we don't need to worry that a present page in the readahead
> window causes us to return a smaller nr_pages than we ought to have.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Looks good.

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v5 04/13] mm: Add readahead address space operation

2020-02-11 Thread Dave Chinner

On Tue, Feb 11, 2020 at 04:54:13AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 11, 2020 at 03:52:30PM +1100, Dave Chinner wrote:
> > > +struct readahead_control {
> > > + struct file *file;
> > > + struct address_space *mapping;
> > > +/* private: use the readahead_* accessors instead */
> > > + pgoff_t start;
> > > + unsigned int nr_pages;
> > > + unsigned int batch_count;
> > > +};
> > > +
> > > +static inline struct page *readahead_page(struct readahead_control *rac)
> > > +{
> > > + struct page *page;
> > > +
> > > + if (!rac->nr_pages)
> > > + return NULL;
> > > +
> > > + page = xa_load(>mapping->i_pages, rac->start);
> > > + VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > + rac->batch_count = hpage_nr_pages(page);
> > > + rac->start += rac->batch_count;
> > 
> > There's no mention of large page support in the patch description
> > and I don't recall this sort of large page batching in previous
> > iterations.
> > 
> > This seems like new functionality to me, not directly related to
> > the initial ->readahead API change? What have I missed?
> 
> I had a crisis of confidence when I was working on this -- the loop
> originally looked like this:
> 
> #define readahead_for_each(rac, page)   \
> for (; (page = readahead_page(rac)); rac->nr_pages--)
> 
> and then I started thinking about what I'd need to do to support large
> pages, and that turned into
> 
> #define readahead_for_each(rac, page)   \
> for (; (page = readahead_page(rac));  \
>   rac->nr_pages -= hpage_nr_pages(page))
> 
> but I realised that was potentially a use-after-free because 'page' has
> certainly had put_page() called on it by then.  I had a brief period
> where I looked at moving put_page() away from being the filesystem's
> responsibility and into the iterator, but that would introduce more
> changes into the patchset, as well as causing problems for filesystems
> that want to break out of the loop.
> 
> By this point, I was also looking at the readahead_for_each_batch()
> iterator that btrfs uses, and so we have the batch count anyway, and we
> might as well use it to store the number of subpages of the large page.
> And so it became easier to just put the whole ball of wax into the initial
> patch set, rather than introduce the iterator now and then fix it up in
> the patch set that I'm basing on this.
> 
> So yes, there's a certain amount of excess functionality in this patch
> set ... I can remove it for the next release.

I'd say "Just document it" as that was the main reason I noticed it.
Or perhaps add the batching function as a stand-alone patch so it's
clear that the batch interface solves two problems at once - large
pages and the btrfs page batching implementation...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v5 04/13] mm: Add readahead address space operation

2020-02-10 Thread Dave Chinner

On Mon, Feb 10, 2020 at 05:03:39PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> This replaces ->readpages with a saner interface:
>  - Return void instead of an ignored error code.
>  - Pages are already in the page cache when ->readahead is called.
>  - Implementation looks up the pages in the page cache instead of
>having them passed in a linked list.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 



>  
> +/*
> + * Readahead is of a block of consecutive pages.
> + */
> +struct readahead_control {
> + struct file *file;
> + struct address_space *mapping;
> +/* private: use the readahead_* accessors instead */
> + pgoff_t start;
> + unsigned int nr_pages;
> + unsigned int batch_count;
> +};
> +
> +static inline struct page *readahead_page(struct readahead_control *rac)
> +{
> + struct page *page;
> +
> + if (!rac->nr_pages)
> + return NULL;
> +
> + page = xa_load(>mapping->i_pages, rac->start);
> + VM_BUG_ON_PAGE(!PageLocked(page), page);
> + rac->batch_count = hpage_nr_pages(page);
> + rac->start += rac->batch_count;

There's no mention of large page support in the patch description
and I don't recall this sort of large page batching in previous
iterations.

This seems like new functionality to me, not directly related to
the initial ->readahead API change? What have I missed?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-02-03 Thread Dave Chinner

On Mon, Feb 03, 2020 at 06:46:41PM +0100, Christoph Hellwig wrote:
> On Sat, Jan 18, 2020 at 08:28:38PM +1100, Dave Chinner wrote:
> > I think it's pretty gross, actually. It  makes the same mistake made
> > with locking in the old direct IO code - it encodes specific lock
> > operations via flags into random locations in the DIO path. This is
> > a very slippery slope, and IMO it is an layering violation to encode
> > specific filesystem locking smeantics into a layer that is supposed
> > to be generic and completely filesystem agnostic. i.e.  this
> > mechanism breaks if a filesystem moves to a different type of lock
> > (e.g. range locks), and history teaches us that we'll end up making
> > a horrible, unmaintainable mess to support different locking
> > mechanisms and contexts.
> > 
> > I think that we should be moving to a model where the filesystem
> > provides an unlock method in the iomap operations structure, and if
> > the method is present in iomap_dio_complete() it gets called for the
> > filesystem to unlock the inode at the appropriate point. This also
> > allows the filesystem to provide a different method for read or
> > write unlock, depending on what type of lock it held at submission.
> > This gets rid of the need for the iomap code to know what type of
> > lock the caller holds, too.
> 
> I'd rather avoid yet another method.  But I think with a little
> tweaking we can move the unlock into the ->end_io method.

That would work, too :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH 04/12] mm: Add readahead address space operation

2020-01-28 Thread Dave Chinner

On Fri, Jan 24, 2020 at 05:35:45PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> This replaces ->readpages with a saner interface:
>  - Return the number of pages not read instead of an ignored error code.
>  - Pages are already in the page cache when ->readahead is called.
>  - Implementation looks up the pages in the page cache instead of
>having them passed in a linked list.

> diff --git a/mm/readahead.c b/mm/readahead.c
> index 5a6676640f20..6d65dae6dad0 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -121,7 +121,18 @@ static void read_pages(struct address_space *mapping, 
> struct file *filp,
>  
>   blk_start_plug();
>  
> - if (mapping->a_ops->readpages) {
> + if (mapping->a_ops->readahead) {
> + unsigned left = mapping->a_ops->readahead(filp, mapping,
> + start, nr_pages);
> +
> + while (left) {
> + struct page *page = readahead_page(mapping,
> + start + nr_pages - left - 1);

Off by one? start = 2, nr_pages = 2, left = 1, this looks up the
page at index 2, which is the one we issued IO on, not the one we
"left behind" which is at index 3.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-18 Thread Dave Chinner

On Tue, Jan 14, 2020 at 05:12:13PM +0100, Christoph Hellwig wrote:
> Hi all,
> 
> Asynchronous read/write operations currently use a rather magic locking
> scheme, were access to file data is normally protected using a rw_semaphore,
> but if we are doing aio where the syscall returns to userspace before the
> I/O has completed we also use an atomic_t to track the outstanding aio
> ops.  This scheme has lead to lots of subtle bugs in file systems where
> didn't wait to the count to reach zero, and due to its adhoc nature also
> means we have to serialize direct I/O writes that are smaller than the
> file system block size.
> 
> All this is solved by releasing i_rwsem only when the I/O has actually
> completed, but doings so is against to mantras of Linux locking primites:
> 
>  (1) no unlocking by another process than the one that acquired it
>  (2) no return to userspace with locks held
> 
> It actually happens we have various places that work around this.  A few
> callers do non-owner unlocks of rwsems, which are pretty nasty for
> PREEMPT_RT as the owner tracking doesn't work.  OTOH the file system
> freeze code has both problems and works around them a little better,
> although in a somewhat awkward way, in that it releases the lockdep
> object when returning to userspace, and reacquires it when done, and
> also clears the rwsem owner when returning to userspace, and then sets
> the new onwer before unlocking.
> 
> This series tries to follow that scheme, also it doesn't fully work.  The
> first issue is that the rwsem code has a bug where it doesn't properly
> handle clearing the owner.  This series has a patch to fix that, but it
> is ugly and might not be correct so some help is needed.  Second I/O
> completions often come from interrupt context, which means the re-acquire
> is recorded as from irq context, leading to warnings about incorrect
> contexts.  I wonder if we could just have a bit in lockdep that says
> returning to userspace is ok for this particular lock?  That would also
> clean up the fsfreeze situation a lot.
> 
> Let me know what you think of all this.  While I converted all the iomap
> using file systems only XFS is actually tested.

I think it's pretty gross, actually. It  makes the same mistake made
with locking in the old direct IO code - it encodes specific lock
operations via flags into random locations in the DIO path. This is
a very slippery slope, and IMO it is an layering violation to encode
specific filesystem locking smeantics into a layer that is supposed
to be generic and completely filesystem agnostic. i.e.  this
mechanism breaks if a filesystem moves to a different type of lock
(e.g. range locks), and history teaches us that we'll end up making
a horrible, unmaintainable mess to support different locking
mechanisms and contexts.

I think that we should be moving to a model where the filesystem
provides an unlock method in the iomap operations structure, and if
the method is present in iomap_dio_complete() it gets called for the
filesystem to unlock the inode at the appropriate point. This also
allows the filesystem to provide a different method for read or
write unlock, depending on what type of lock it held at submission.
This gets rid of the need for the iomap code to know what type of
lock the caller holds, too.

This way we always have a consistent point in the AIO/DIO completion
model where the inode gets unlocked, it only gets called for the IO
contexts where the filesystem wants/needs unlock on IO competion
semantics, and it's completely filesystem IO-lock implementation
agnostic. And for filesystems that use the inode i_rwsem, we can
just provide simple helper functions for their read/write unlock
methods.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] Interest in DAX for OCFS2 and/or GFS2?

2019-10-17 Thread Dave Chinner

On Thu, Oct 10, 2019 at 04:41:56PM +, Hayes, Bill wrote:
> Has there been any discussion or interest in DAX support in OCFS2?
> Is there interest from the OCFS2 development community to see DAX support 
> developed and put upstream?
> 
> Has there been any discussion or interest in DAX support in GFS2?
> Is there interest from the GFS2 development community to see DAX support 
> developed and put upstream?

You're probably best from a DAX implementation POV to head down the
GFS2 path, as FS-DAX requires the filesystem to use the fs/iomap/
extent mapping implementation. GFS2 is already partially ported to
use the iomap infrastructure, though more work is needed to provide
the iomap functionality DAX requires. OCFS2 would require a lot more
work on this front

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com

Re: [Cluster-devel] [Q] gfs2: mmap write vs. punch_hole consistency

2019-09-06 Thread Dave Chinner

On Fri, Sep 06, 2019 at 11:48:31PM +0200, Andreas Gruenbacher wrote:
> On Fri, Sep 6, 2019 at 11:28 PM Dave Chinner  wrote:
> > On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote:
> > > Hi,
> > >
> > > I've just fixed a mmap write vs. truncate consistency issue on gfs on
> > > filesystems with a block size smaller that the page size [1].
> > >
> > > It turns out that the same problem exists between mmap write and hole
> > > punching, and since xfstests doesn't seem to cover that,
> >
> > AFAIA, fsx exercises it pretty often. Certainly it's found problems
> > with XFS in the past w.r.t. these operations.
> >
> > > I've written a
> > > new test [2].
> >
> > I suspect that what we really want is a test that runs
> > _test_generic_punch using mmap rather than pwrite...
> >
> > > Ext4 and xfs both pass that test; they both apparently
> > > mark the pages that have a hole punched in them as read-only so that
> > > page_mkwrite is called before those pages can be written to again.
> >
> > XFS invalidates the range being hole punched (see
> > xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any
> > attempt to fault that page back in will block on the MMAPLOCK until
> > the hole punch finishes.
> 
> This isn't about writes during the hole punching, this is about writes
> once the hole is punched.

How do you prevent this:

hole punch process: other process
  clean page
  invalidate page
write page fault
instantiate new page
page_mkwrite
page dirty
  do hole punch
  overwrite hole

Firstly, you end up with a dirty page over a hole with no backing
store, and second, you get no page fault on overwrite because the
pages are already dirty.

That's the problem the MMAPLOCK in XFS solves - it's integral to
ensuring the page faults on mmap write are correctly serialised with
both the page cache invalidation and the underlying extent
manipulation that the hole punch operation then does.

i.e. if you want a page fault /after/ a hole punch, you have to
prevent it occurring during the hole punch after the page has
already been marked clean and/or invalidated.

> For example, the test case I've posted
> creates the following file layout with 1k blocksize:
> 
>     
> 
> Then it punches a hole like this:
> 
>   DDHH  HHDD
> 
> Then it fills the hole again with mwrite:
> 
>     
> 
> As far as I can tell, that needs to trigger page faults on all three
> pages.

Yes, but only /after/ the hole has been punched.

> I did get these on ext4; judging from the fact that xfs works,
> the also seem to occur there; but on gfs2, page_mkwrite isn't called
> for the two partially mapped pages, only for the page in the middle
> that's entirely within the hole. And I don't see where those pages are
> marked read-only; it appears like pagecache_isize_extended isn't
> called on ext4 or xfs. So how does this work there?

As I said in my last response, the answer is in
xfs_flush_unmap_range(). That uses truncate_pagecache_range() to do
the necessary page cache manipulation.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [Q] gfs2: mmap write vs. punch_hole consistency

2019-09-06 Thread Dave Chinner

On Fri, Sep 06, 2019 at 10:52:41PM +0200, Andreas Gruenbacher wrote:
> Hi,
> 
> I've just fixed a mmap write vs. truncate consistency issue on gfs on
> filesystems with a block size smaller that the page size [1].
> 
> It turns out that the same problem exists between mmap write and hole
> punching, and since xfstests doesn't seem to cover that,

AFAIA, fsx exercises it pretty often. Certainly it's found problems
with XFS in the past w.r.t. these operations.

> I've written a
> new test [2].

I suspect that what we really want is a test that runs
_test_generic_punch using mmap rather than pwrite...

> Ext4 and xfs both pass that test; they both apparently
> mark the pages that have a hole punched in them as read-only so that
> page_mkwrite is called before those pages can be written to again.

XFS invalidates the range being hole punched (see
xfs_flush_unmap_range() under XFS_MMAPLOCK_EXCL, which means any
attempt to fault that page back in will block on the MMAPLOCK until
the hole punch finishes.

> gfs2 fails that: for some reason, the partially block-mapped pages are
> not marked read-only on gfs2, and so page_mkwrite is not called for the
> partially block-mapped pages, and the hole is not filled in correctly.
> 
> The attached patch fixes the problem, but this really doesn't look right
> as neither ext4 nor xfs require this kind of hack.  So what am I
> overlooking, how does this work on ext4 and xfs?

XFS uses XFS_MMAPLOCK_* to serialise page faults against extent
manipulations (shift, hole punch, truncate, swap, etc) and ext4 uses
a similar locking mechanism to do the same thing. Fundamentally, the
page cache does not provide the necessary mechanisms to detect and
prevent invalidation races inside EOF

> 
> Signed-off-by: Andreas Gruenbacher 
> ---
>  fs/gfs2/bmap.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 9ef543dd38e2..e677e813be4c 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -2475,6 +2475,13 @@ int __gfs2_punch_hole(struct file *file, loff_t 
> offset, loff_t length)
>   if (error)
>   goto out;
>   }
> + /*
> +  * If the first or last page partially lies in the hole, mark
> +  * the page read-only so that memory-mapped writes will trigger
> +  * page_mkwrite.
> +  */
> + pagecache_isize_extended(inode, offset, inode->i_size);
> + pagecache_isize_extended(inode, offset + length, inode->i_size);

See xfs_flush_unmap_range(), which is run under XFS_MMAPLOCK_EXCL
to serialise against incoming page faults...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] RFC: use the iomap writepage path in gfs2

2019-07-07 Thread Dave Chinner

On Mon, Jul 01, 2019 at 11:54:24PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> in this straight from the jetplane edition I present the series to
> convert gfs2 to full iomap usage for the ordered and writeback mode,
> that is we use iomap_page everywhere and entirely get rid of
> buffer_heads in the data path.  This has only seen basic testing
> which ensured neither 4k or 1k blocksize in ordered mode regressed
> vs the xfstests baseline, although that baseline tends to look
> pretty bleak.
> 
> The series is to be applied on top of my "lift the xfs writepage code
> into iomap v2" series.

Ok, this doesn't look too bad from the iomap perspective, though it
does raise more questions. :)

gfs2 now has two iopaths, right? One that uses bufferheads for
journalled data, and the other that uses iomap? That seems like it's
only a partial conversion - what needs to be done to iomap and gfs2
to support the journalled data path so there's a single data IO
path?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH] fs: Move mark_inode_dirty out of __generic_write_end

2019-06-19 Thread Dave Chinner

On Tue, Jun 18, 2019 at 04:47:16PM +0200, Andreas Gruenbacher wrote:
> Remove the mark_inode_dirty call from __generic_write_end and add it to
> generic_write_end and the high-level iomap functions where necessary.
> That way, large writes will only dirty inodes at the end instead of
> dirtying them once per page.  This fixes a performance bottleneck on
> gfs2.
> 
> Signed-off-by: Andreas Gruenbacher 
> ---
>  fs/buffer.c | 26 ++
>  fs/iomap.c  | 42 ++
>  2 files changed, 56 insertions(+), 12 deletions(-)



> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 23ef63fd1669..9454568a7f5e 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -881,6 +881,13 @@ iomap_file_buffered_write(struct kiocb *iocb, struct 
> iov_iter *iter,
>  {
>   struct inode *inode = iocb->ki_filp->f_mapping->host;
>   loff_t pos = iocb->ki_pos, ret = 0, written = 0;
> + loff_t old_size;
> +
> +/*
> +  * No need to use i_size_read() here, the i_size cannot change under us
> +  * because we hold i_rwsem.
> +  */
> + old_size = inode->i_size;
>  
>   while (iov_iter_count(iter)) {
>   ret = iomap_apply(inode, pos, iov_iter_count(iter),
> @@ -891,6 +898,9 @@ iomap_file_buffered_write(struct kiocb *iocb, struct 
> iov_iter *iter,
>   written += ret;
>   }
>  
> + if (old_size != inode->i_size)
> + mark_inode_dirty(inode);
> +
>   return written ? written : ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
> @@ -961,18 +971,30 @@ int
>  iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
>   const struct iomap_ops *ops)
>  {
> + loff_t old_size;
>   loff_t ret;
>  
> +/*
> +  * No need to use i_size_read() here, the i_size cannot change under us
> +  * because we hold i_rwsem.
> +  */
> + old_size = inode->i_size;
> +
>   while (len) {
>   ret = iomap_apply(inode, pos, len, IOMAP_WRITE, ops, NULL,
>   iomap_dirty_actor);
>   if (ret <= 0)
> - return ret;
> + goto out;
>   pos += ret;
>   len -= ret;
>   }
> + ret = 0;
>  
> - return 0;
> +out:
> + if (old_size != inode->i_size)
> + mark_inode_dirty(inode);


I don't think we want to do this.

The patches I have that add range locking for XFS allow buffered
writes to run concurrently with operations that change the inode
size as long as the ranges don't overlap. To do this, XFS will not
hold the i_rwsem over any iomap call it makes in future - it will
hold a range lock instead. Hence we can have writes and other IO
operations occurring at the same time some other operation is
changing the size of the file, and that means this code no longer
does what you are intending it to do because the inode->i_size is no
longer constant across these operations...

Hence I think adding code that depends on i_rwsem to be held to
function correctly is the wrong direction to be taking the iomap
infrastructure.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v7 0/5] iomap and gfs2 fixes

2019-04-30 Thread Dave Chinner

On Mon, Apr 29, 2019 at 07:50:28PM -0700, Darrick J. Wong wrote:
> On Tue, Apr 30, 2019 at 12:09:29AM +0200, Andreas Gruenbacher wrote:
> > Here's another update of this patch queue, hopefully with all wrinkles
> > ironed out now.
> > 
> > Darrick, I think Linus would be unhappy seeing the first four patches in
> > the gfs2 tree; could you put them into the xfs tree instead like we did
> > some time ago already?
> 
> Sure.  When I'm done reviewing them I'll put them in the iomap tree,
> though, since we now have a separate one. :)

I'd just keep the iomap stuff in the xfs tree as a separate set of
branches and merge them into the XFS for-next when composing it.
That way it still gets plenty of test coverage from all the XFS
devs and linux next without anyone having to think about.

You really only need to send separate pull requests for the iomap
and XFS branches - IMO, there's no really need to have a complete
new tree for it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] gfs2 iomap dealock, IOMAP_F_UNBALANCED

2019-03-21 Thread Dave Chinner

On Thu, Mar 21, 2019 at 02:13:04PM +0100, Andreas Gruenbacher wrote:
> Hi Christoph,
> 
> we need your help fixing a gfs2 deadlock involving iomap.  What's going
> on is the following:
> 
> * During iomap_file_buffered_write, gfs2_iomap_begin grabs the log flush
>   lock and keeps it until gfs2_iomap_end.  It currently always does that
>   even though there is no point other than for journaled data writes.
> 
> * iomap_file_buffered_write then calls balance_dirty_pages_ratelimited.
>   If that ends up calling gfs2_write_inode, gfs2 will try to grab the
>   log flush lock again and deadlock.
> 
> We can stop gfs2_iomap_begin from keeping the log flush lock held for
> non-journaled data writes, but that still leaves us with the deadlock
> for journaled data writes: for them, we need to add the data pages to
> the running transaction, so dropping the log flush lock wouldn't work.
> 
> I've tried to work around the unwanted balance_dirty_pages_ratelimited
> in gfs2_write_inode as originally suggested by Ross.  That works to a
> certain degree, but only if we always skip inodes in the WB_SYNC_NONE
> case, and that means that gfs2_write_inode becomes quite ineffective.
> 
> So it seems that it would be more reasonable to avoid the dirty page
> balancing under the log flush lock in the first place.
> 
> The attached patch changes gfs2_iomap_begin to only hold on to the log
> flush lock for journaled data writes.  In that case, we also make sure
> to limit the write size to not overrun dirty limits using a similar
> logic as in balance_dirty_pages_ratelimited; there is precedent for that
> approach in btrfs_buffered_write.  Then, we prevent iomap from balancing
> dirty pages via the new IOMAP_F_UNBALANCED flag.


> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 58a768e59712e..542bd149c4aa3 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -867,6 +867,8 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>   iocb->ki_pos += ret;
>   }
>  
> + balance_dirty_pages_ratelimited(file->f_mapping);
> +
>  out2:
>   current->backing_dev_info = NULL;
>  out:

The problem is calling balance_dirty_pages() inside the
->iomap_begin/->iomap_end calls and not that it is called by the
iomap infrastructure itself, right?

Is so, I'd prefer to see this in iomap_apply() after the call to
ops->iomap_end because iomap_file_buffered_write() can iterate and
call iomap_apply() multiple times. This would keep the balancing to
a per-iomap granularity, rather than a per-syscall granularity.

i.e. if we do write(2GB), we want more than one balancing call
during that syscall, so it woul dbe up to the filesystem to a) limit
the size of write mappings to something smaller (e.g. 1024 pages)
so that there are still frequent balancing calls for large writes.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH v4 06/11] iomap: Add write_{begin, end} iomap operations

2018-05-14 Thread Dave Chinner

On Mon, May 14, 2018 at 05:36:19PM +0200, Andreas Gruenbacher wrote:
> Add write_begin and write_end operations to struct iomap_ops to provide
> a way of overriding the default behavior of iomap_write_begin and
> iomap_write_end.  This is needed for implementing data journaling: in
> the data journaling case, pages are written into the journal before
> being written back to their proper on-disk locations.

I'm not sure adding a per-page filesystem callout abstraction is
what we want in the iomap code.  The commit message does not explain
why gfs2 needs per-page callouts, nor why the per-page work cannot
be done/avoided by running code in the ->iomap_begin callout
(e.g. by unstuffing the inode so nothing needs to be done per-page).

My main concern is that the callouts to the filesystem in iomap are
- by design - per IO, not per page. The whole point of using iomap
was to get away from doing per-page filesystem operations on
multi-page IOs - it's highly inefficient when we only need to call
into the filesystem on a per-extent basis, and we simplified the
code a *lot* by avoiding such behaviours.

And to that point: this change has serious conflicts with the next
step for the iomap infrastructure: Christoph's recent
"remove bufferheads from iomap" series. Apart from the obvious code
conflicts, there's a fairly major architectural/functional conflict.

https://marc.info/?l=linux-fsdevel=152585212000411=2

That is, Christoph's patchset pushes us further away from
filesystems doing their own per-page processing of page state. It
converts the iomap IO path to store it's own per-page state tracking
information in page->private, greatly reducing memory overhead (16
bytes on 4k page instead of 104 bytes per bufferhead) and
streamlining the code.  This, however, essentially makes the
buffered IO path through the iomap infrastructure incompatible with
existing bufferhead based filesystems.

Depsite this, we can still provide limited support for buffered
writes on bufferhead based filesystems through the iomap
infrastructure, but I only see that as a mechanism for filesystems
moving away from dependencies on bufferheads. It would require a
different refactoring of the iomap code - I'd much prefer to keep
bufferhead dependent IO paths completely separate (e.g. via a new
actor function) to minimise impact and dependencies on the internal
bufferhead free iomap IO path

Let's see what Christoph thinks first, though

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH] fs: gfs2: Adding new return type vm_fault_t

2018-04-16 Thread Dave Chinner

On Mon, Apr 16, 2018 at 11:20:59PM +0530, Souptick Joarder wrote:
> > Hi,
> >
> > This patch is straightforward enough, but there are a lot of other
> > file systems that need similar patches. Shouldn't you do one big
> > patch set that fixes several file systems at once and run it through
> > Viro's kernel or Linus's kernel or something?
> > Adding Viro and linux-fsdevel for more opinions.
> 
> The plan for these patches is to introduce the typedef, initially just
> as documentation ("These functions should return a VM_FAULT_ status").
> We'll trickle the patches to individual drivers/filesystems in through
> the maintainers, as far as possible.  Then we'll change the typedef to
> an unsigned int and break the compilation of any unconverted
> drivers/filesystems.
>
> We have already started sending out drivers/filesystems changes
> to different maintainers.

Yes, we can see that. The response you are getting is "this is not
how we do cross-subsystem API changes.  Why are you doing it this
way?"

i.e. the problem being pointed out is that your process has not
followed the correct/normal process for proposing, reviewing and
mering cross-subsystem API changes. Bob has raised the same
questions as both Christoph and Darrick have asked in response to
the XFS patch. I only implied these questions by asking about
introducing useless typedefs with no context for reviewers...

I'd really like to have Darrick's questions answered(*) in a
constructive, non-abusive manner - I'll quote it here to get it all
in one thread on -fsdevel:

| ...hm, the original mm patch wasn't cc'd to fsdevel either, so that's
| probably why I never heard of any of this until now.
|
| So, uh, why wasn't this whole series (all the mm changes and all the
| required fs changes) sent out for review prior to the merge window?

We're not asking for a description of what you are doing - we are
asking why the normal processes for proposing and merging such a
change is not being followed, and how you plan to rectify that.

Cheers,

Dave.

https://marc.info/?l=linux-xfs=152389824107375=2
-- 
Dave Chinner
da...@fromorbit.com

Re: [Cluster-devel] [PATCH] gfs2: Fsync parent directories

2018-02-20 Thread Dave Chinner

On Tue, Feb 20, 2018 at 09:53:59PM +0100, Andreas Gruenbacher wrote:
> On 20 February 2018 at 20:46, Christoph Hellwig <h...@infradead.org> wrote:
> > On Tue, Feb 20, 2018 at 12:22:01AM +0100, Andreas Gruenbacher wrote:
> >> When fsyncing a new file, also fsync the directory the files is in,
> >> recursively.  This is how Linux filesystems should behave nowadays,
> >> even if not mandated by POSIX.
> >
> > I think that is bullshit.  Maybe it is what google wants for ext4
> > non-journal mode which no one else uses anyway. but it certainly
> > is anything but normal Linux semantics.
> 
> Here's some code from xfstest generic/322:
> 
>   _mount_flakey
>   $XFS_IO_PROG -f -c "pwrite 0 1M" -c "fsync" $SCRATCH_MNT/foo \
> > $seqres.full 2>&1 || _fail "xfs_io failed"
>   mv $SCRATCH_MNT/foo $SCRATCH_MNT/bar
>   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
>   md5sum $SCRATCH_MNT/bar | _filter_scratch
> 
>   _flakey_drop_and_remount
> 
>   md5sum $SCRATCH_MNT/bar | _filter_scratch
>   _unmount_flakey
> 
> Note that there is no fsync for the parent directory ($SCRATCH_MNT),
> yet the test obviously expects the directory to be synced as well.
> This isn't implemented as in this patch on all filesystems, but the
> major ones all show this behavior. So where's the bullshit?

This test is for filesystems that have strictly ordered metadata
journalling. All the filesystems that fstests supports
via _require_metadata_journalling() have strictly ordered metadata
journalling/crash recovery semantics. (i.e. xfs, ext4, btrfs, and
f2fs (IIRC)).

IOWs, if the filesystem is designed with strictly ordered metadata,
then fsync()ing a new file also implies that all references to the
new file are also on stable storage because they happened before the
fsync on the file was issued. i.e. the directory is fsync'd
implicitly because it was modified by the same operation that
created the file. Hence if the file creation is made stable, so must
be the directory modification done during file creation.

This has nothing to do with POSIX or what the "linux standard" is -
this is testing whether the implementation of strictly ordered
metadata journalling is correct or not.  If gfs2 does not have
strictly ordered metadata journalling, then it probably shouldn't
run these tests

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com

1 2 >

1 - 100 of 141 matches

Mail list logo