Re: [f2fs-dev] [syzbot] [f2fs?] WARNING in rcu_sync_dtor

2024-08-01 Thread Dave Chinner via Linux-f2fs-devel
On Mon, Jul 29, 2024 at 03:27:21PM +0200, Jan Kara wrote:
> Also as the "filesystem shutdown" is spreading across multiple filesystems,
> I'm playing with the idea that maybe we could lift a flag like this to VFS
> so that we can check it in VFS paths and abort some operations early. 

I've been thinking the same thing since I saw what CIFS was doing a
couple of days ago with shutdowns. It's basically just stopping all
new incoming modification operations if the flag is set. i.e. it's
just a check in each filesystem method, and I suspect that many
other filesystems that support shutdown do the same thing.

It looks like exactly the same implementation as CIFS is about to be
added to exfat - stop all the incoming methods and check in the
writeback method - so having a generic superblock flag and generic
checks before calling into filesystem methods would make it real
easy for all filesystems to have basic ->shutdown support for when
block devices go away suddenly.

I also think that we should be lifting *IOC_SHUTDOWN to the VFS -
the same ioctl is now implemented in 4-5 filesystems and they
largely do the same thing - just set a bit in the internal
filesystem superblock flags. Yes, filesystems like XFS and ext4 do
special stuff with journals, but the generic VFS implemenation could
call the filesystem ->shutdown method to do that

> But
> so far I'm not convinced the gain is worth the need to iron out various
> subtle semantical differences of "shutdown" among filesystems.

I don't think we need to change how any filesystem behaves when it
is shut down. As long as filesystems follow at least the "no new
modifications when shutdown" behaviour, filesystems can implement
internal shutdown however they want...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 1/5] zonefs: pass GFP_KERNEL to blkdev_zone_mgmt() call

2024-01-23 Thread Dave Chinner via Linux-f2fs-devel
On Tue, Jan 23, 2024 at 09:39:02PM +0100, Mikulas Patocka wrote:
> 
> 
> On Tue, 23 Jan 2024, Johannes Thumshirn wrote:
> 
> > Pass GFP_KERNEL instead of GFP_NOFS to the blkdev_zone_mgmt() call in
> > zonefs_zone_mgmt().
> > 
> > As as zonefs_zone_mgmt() and zonefs_inode_zone_mgmt() are never called
> > from a place that can recurse back into the filesystem on memory reclaim,
> > it is save to call blkdev_zone_mgmt() with GFP_KERNEL.
> > 
> > Signed-off-by: Johannes Thumshirn 
> > ---
> >  fs/zonefs/super.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> > index 93971742613a..63fbac018c04 100644
> > --- a/fs/zonefs/super.c
> > +++ b/fs/zonefs/super.c
> > @@ -113,7 +113,7 @@ static int zonefs_zone_mgmt(struct super_block *sb,
> >  
> > trace_zonefs_zone_mgmt(sb, z, op);
> > ret = blkdev_zone_mgmt(sb->s_bdev, op, z->z_sector,
> > -  z->z_size >> SECTOR_SHIFT, GFP_NOFS);
> > +  z->z_size >> SECTOR_SHIFT, GFP_KERNEL);
> > if (ret) {
> > zonefs_err(sb,
> >"Zone management operation %s at %llu failed %d\n",
> > 
> > -- 
> > 2.43.0
> 
> zonefs_inode_zone_mgmt calls 
> lockdep_assert_held(&ZONEFS_I(inode)->i_truncate_mutex); - so, this 
> function is called with the mutex held - could it happen that the 
> GFP_KERNEL allocation recurses into the filesystem and attempts to take 
> i_truncate_mutex as well?
> 
> i.e. GFP_KERNEL -> iomap_do_writepage -> zonefs_write_map_blocks -> 
> zonefs_write_iomap_begin -> mutex_lock(&zi->i_truncate_mutex)

zonefs doesn't have a ->writepage method, so writeback can't be
called from memory reclaim like this.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 07/11] vfs: add nowait parameter for file_accessed()

2023-09-10 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Sep 08, 2023 at 01:29:55AM +0100, Pavel Begunkov wrote:
> On 9/3/23 23:30, Dave Chinner wrote:
> > On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote:
> > > On 8/29/23 19:53, Matthew Wilcox wrote:
> > > > On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote:
> > > > > On 8/28/23 05:32, Matthew Wilcox wrote:
> > > > > > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> > > > > > > From: Hao Xu 
> > > > > > > 
> > > > > > > Add a boolean parameter for file_accessed() to support nowait 
> > > > > > > semantics.
> > > > > > > Currently it is true only with io_uring as its initial caller.
> > > > > > 
> > > > > > So why do we need to do this as part of this series?  Apparently it
> > > > > > hasn't caused any problems for filemap_read().
> > > > > > 
> > > > > 
> > > > > We need this parameter to indicate if nowait semantics should be 
> > > > > enforced in
> > > > > touch_atime(), There are locks and maybe IOs in it.
> > > > 
> > > > That's not my point.  We currently call file_accessed() and
> > > > touch_atime() for nowait reads and nowait writes.  You haven't done
> > > > anything to fix those.
> > > > 
> > > > I suspect you can trim this patchset down significantly by avoiding
> > > > fixing the file_accessed() problem.  And then come back with a later
> > > > patchset that fixes it for all nowait i/o.  Or do a separate prep series
> > > 
> > > I'm ok to do that.
> > > 
> > > > first that fixes it for the existing nowait users, and then a second
> > > > series to do all the directory stuff.
> > > > 
> > > > I'd do the first thing.  Just ignore the problem.  Directory atime
> > > > updates cause I/O so rarely that you can afford to ignore it.  Almost
> > > > everyone uses relatime or nodiratime.
> > > 
> > > Hi Matthew,
> > > The previous discussion shows this does cause issues in real
> > > producations: 
> > > https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a3051...@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write
> > > 
> > 
> > Then separate it out into it's own patch set so we can have a
> > discussion on the merits of requiring using noatime, relatime or
> > lazytime for really latency sensitive IO applications. Changing code
> > is not always the right solution...
> 
> Separation sounds reasonable, but it can hardly be said that only
> latency sensitive apps would care about >1s nowait/async submission
> delays. Presumably, btrfs can improve on that, but it still looks
> like it's perfectly legit for filesystems do heavy stuff in
> timestamping like waiting for IO. Right?

Yes, it is, no-one is denying that. And some filesystems are worse
than others, but none of that means it has to be fixed so getdents
can be converted to NOWAIT semantics.

ie. this patchset is about the getdents NOWAIT machinery, and
fiddling around with timestamps has much, much wider scope than just
NOWAIT getdents machinery. We'll have this discussion about NOWAIT
timestamp updates when a RFC is proposed to address the wider
problem of how timestamp updates should behave in NOWAIT context.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 02/11] xfs: add NOWAIT semantics for readdir

2023-09-03 Thread Dave Chinner via Linux-f2fs-devel
On Sun, Aug 27, 2023 at 09:28:26PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> Implement NOWAIT semantics for readdir. Return EAGAIN error to the
> caller if it would block, like failing to get locks, or going to
> do IO.
> 
> Co-developed-by: Dave Chinner 

Not really.

"Co-developed" implies equal development input between all the
parties, which is not the case here - this patch is based on
prototype I wrote, whilst you're doing the refining, testing and
correctness work.

In these cases with XFS code, we add a line in the commit message to
say:

"This is based on a patch originally written by Dave Chinner."


> Signed-off-by: Dave Chinner 
> Signed-off-by: Hao Xu 
> [fixes deadlock issue, tweak code style]

With a signoff chain like you already have.

In the end you'll also get a RVB from me, which seems rather wrong
to me if I've apparently been "co-developing" the code



> @@ -156,7 +157,9 @@ xfs_dir2_block_getdents(
>   if (xfs_dir2_dataptr_to_db(geo, ctx->pos) > geo->datablk)
>   return 0;
>  
> - error = xfs_dir3_block_read(args->trans, dp, &bp);
> + if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
> + flags |= XFS_DABUF_NOWAIT;
> + error = xfs_dir3_block_read(args->trans, dp, flags, &bp);
>   if (error)
>   return error;
>  

Given we do this same check in both block and leaf formats to set
XFS_DABUF_NOWAIT, and we do the DIR_CONTEXT_F_NOWAIT check in
xfs_readdir() as well, we should probably do this check once at the
higher level and pass flags down from there with XFS_DABUF_NOWAIT
already set.

> @@ -240,6 +243,7 @@ xfs_dir2_block_getdents(
>  STATIC int
>  xfs_dir2_leaf_readbuf(
>   struct xfs_da_args  *args,
> + struct dir_context  *ctx,
>   size_t  bufsize,
>   xfs_dir2_off_t  *cur_off,
>   xfs_dablk_t *ra_blk,
> @@ -258,10 +262,15 @@ xfs_dir2_leaf_readbuf(
>   struct xfs_iext_cursor  icur;
>   int ra_want;
>   int error = 0;
> -
> - error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
> - if (error)
> - goto out;
> + unsigned intflags = 0;
> +
> + if (ctx->flags & DIR_CONTEXT_F_NOWAIT) {
> + flags |= XFS_DABUF_NOWAIT;
> + } else {
> + error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
> + if (error)
> + goto out;
> + }

Especially as, in hindsight, this doesn't make a whole lot of sense.
If XFS_DABUF_NOWAIT is set, we keep going until
xfs_ilock_data_map_shared_nowait() where we call
xfs_need_iread_extents() to see if we need to read the extents in
and abort at that point.

So, really, we shouldn't get this far with nowait semantics if
we haven't read the extents in yet - we're supposed to already have
the inode locked here and so we should have already checked this
condition before we bother locking the inode...

i.e. all we should be doing here is this:

if (!(flags & XFS_DABUF_NOWAIT)) {
error = xfs_iread_extents(args->trans, dp, XFS_DATA_FORK);
if (error)
goto out;
}

And then we don't need to pass the VFS dir_context down into low
level XFS functions unnecessarily.


>  
>   /*
>* Look for mapped directory blocks at or above the current offset.
> @@ -280,7 +289,7 @@ xfs_dir2_leaf_readbuf(
>   new_off = xfs_dir2_da_to_byte(geo, map.br_startoff);
>   if (new_off > *cur_off)
>   *cur_off = new_off;
> - error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, 0, &bp);
> + error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, flags, 
> &bp);
>   if (error)
>   goto out;
>  
> @@ -360,6 +369,7 @@ xfs_dir2_leaf_getdents(
>   int byteoff;/* offset in current block */
>   unsigned intoffset = 0;
>   int error = 0;  /* error return value */
> + int written = 0;
>  
>   /*
>* If the offset is at or past the largest allowed value,
> @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents(
>   bp = NULL;
>   }
>  
> - if (*lock_mode == 0)
> - *lock_mode = xfs_ilock_data_map_shared(dp);
> - error = xfs_dir2_leaf_readbuf(args, bufsize, &curoff,
> - &rablk, &bp);
> +   

Re: [f2fs-dev] [PATCH 07/11] vfs: add nowait parameter for file_accessed()

2023-09-03 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Aug 30, 2023 at 02:11:31PM +0800, Hao Xu wrote:
> On 8/29/23 19:53, Matthew Wilcox wrote:
> > On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote:
> > > On 8/28/23 05:32, Matthew Wilcox wrote:
> > > > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> > > > > From: Hao Xu 
> > > > > 
> > > > > Add a boolean parameter for file_accessed() to support nowait 
> > > > > semantics.
> > > > > Currently it is true only with io_uring as its initial caller.
> > > > 
> > > > So why do we need to do this as part of this series?  Apparently it
> > > > hasn't caused any problems for filemap_read().
> > > > 
> > > 
> > > We need this parameter to indicate if nowait semantics should be enforced 
> > > in
> > > touch_atime(), There are locks and maybe IOs in it.
> > 
> > That's not my point.  We currently call file_accessed() and
> > touch_atime() for nowait reads and nowait writes.  You haven't done
> > anything to fix those.
> > 
> > I suspect you can trim this patchset down significantly by avoiding
> > fixing the file_accessed() problem.  And then come back with a later
> > patchset that fixes it for all nowait i/o.  Or do a separate prep series
> 
> I'm ok to do that.
> 
> > first that fixes it for the existing nowait users, and then a second
> > series to do all the directory stuff.
> > 
> > I'd do the first thing.  Just ignore the problem.  Directory atime
> > updates cause I/O so rarely that you can afford to ignore it.  Almost
> > everyone uses relatime or nodiratime.
> 
> Hi Matthew,
> The previous discussion shows this does cause issues in real
> producations: 
> https://lore.kernel.org/io-uring/2785f009-2ebb-028d-8250-d5f3a3051...@gmail.com/#:~:text=fwiw%2C%20we%27ve%20just%20recently%20had%20similar%20problems%20with%20io_uring%20read/write
> 

Then separate it out into it's own patch set so we can have a
discussion on the merits of requiring using noatime, relatime or
lazytime for really latency sensitive IO applications. Changing code
is not always the right solution...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH RFC v5 00/29] io_uring getdents

2023-08-25 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Aug 25, 2023 at 09:54:02PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> This series introduce getdents64 to io_uring, the code logic is similar
> with the snychronized version's. It first try nowait issue, and offload
> it to io-wq threads if the first try fails.
> 
> Patch1 and Patch2 are some preparation
> Patch3 supports nowait for xfs getdents code
> Patch4-11 are vfs change, include adding helpers and trylock for locks
> Patch12-29 supports nowait for involved xfs journal stuff
> note, Patch24 and 27 are actually two questions, might be removed later.
> an xfs test may come later.

You need to drop all the XFS journal stuff. It's fundamentally
broken as it stands, and we cannot support non-blocking
transactional changes without first putting a massive investment in
transaction and intent chain rollback to allow correctly undoing
partially complete modifications.

Regardless, non-blocking transactions are completely unnecessary for
a non-blocking readdir implementation. readdir should only be
touching atime, and with relatime it should only occur once every 24
hours per inode. If that's a problem, then we have noatime mount
options. Hence I just don't see any point in worrying about having a
timestamp update block occasionally...

I also don't really don't see why you need to fiddle with xfs buffer
cache semantics - it already has the functionality "nowait" buffer
reads require (i.e.  XBF_INCORE|XBF_TRYLOCK).

However, the readahead IO that the xfs readdir code issues cannot
use your defined NOWAIT semantics - it must be able to allocate
memory and issue IO. Readahead already avoids blocking on memory
allocation and blocking on IO via the XBF_READ_AHEAD flag. This sets
__GFP_NORETRY for buffer allocation and REQ_RAHEAD for IO. Hence
readahead only needs the existing XBF_TRYLOCK flag to be set to be
compatible with the required NOWAIT semantics

As for the NOIO memory allocation restrictions io_uring requires,
that should be enforced at the io_uring layer before calling into
the VFS using memalloc_noio_save/restore.  At that point no memory
allocation will trigger IO and none of the code running under NOWAIT
conditions even needs to be aware that io_uring has a GFP_NOIO
restriction on memory allocation

Please go back to the simple "do non-blocking buffer IO"
implementation we started with and don't try to solve every little
blocking problem that might exist in the VFS and filesystems...

-Dave
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 25/29] xfs: support nowait for xfs_buf_item_init()

2023-08-25 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Aug 25, 2023 at 09:54:27PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> support nowait for xfs_buf_item_init() and error out -EAGAIN to
> _xfs_trans_bjoin() when it would block.
> 
> Signed-off-by: Hao Xu 
> ---
>  fs/xfs/xfs_buf_item.c |  9 +++--
>  fs/xfs/xfs_buf_item.h |  2 +-
>  fs/xfs/xfs_buf_item_recover.c |  2 +-
>  fs/xfs/xfs_trans_buf.c| 16 +---
>  4 files changed, 22 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 023d4e0385dd..b1e63137d65b 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -827,7 +827,8 @@ xfs_buf_item_free_format(
>  int
>  xfs_buf_item_init(
>   struct xfs_buf  *bp,
> - struct xfs_mount *mp)
> + struct xfs_mount *mp,
> + bool   nowait)
>  {
>   struct xfs_buf_log_item *bip = bp->b_log_item;
>   int chunks;
> @@ -847,7 +848,11 @@ xfs_buf_item_init(
>   return 0;
>   }
>  
> - bip = kmem_cache_zalloc(xfs_buf_item_cache, GFP_KERNEL | __GFP_NOFAIL);
> + bip = kmem_cache_zalloc(xfs_buf_item_cache,
> + GFP_KERNEL | (nowait ? 0 : __GFP_NOFAIL));
> + if (!bip)
> + return -EAGAIN;
> +
>   xfs_log_item_init(mp, &bip->bli_item, XFS_LI_BUF, &xfs_buf_item_ops);
>   bip->bli_buf = bp;

I see filesystem shutdowns

> diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
> index 016371f58f26..a1e4f2e8629a 100644
> --- a/fs/xfs/xfs_trans_buf.c
> +++ b/fs/xfs/xfs_trans_buf.c
> @@ -57,13 +57,14 @@ xfs_trans_buf_item_match(
>   * If the buffer does not yet have a buf log item associated with it,
>   * then allocate one for it.  Then add the buf item to the transaction.
>   */
> -STATIC void
> +STATIC int
>  _xfs_trans_bjoin(
>   struct xfs_trans*tp,
>   struct xfs_buf  *bp,
>   int reset_recur)
>  {
>   struct xfs_buf_log_item *bip;
> + int ret;
>  
>   ASSERT(bp->b_transp == NULL);
>  
> @@ -72,7 +73,11 @@ _xfs_trans_bjoin(
>* it doesn't have one yet, then allocate one and initialize it.
>* The checks to see if one is there are in xfs_buf_item_init().
>*/
> - xfs_buf_item_init(bp, tp->t_mountp);
> + ret = xfs_buf_item_init(bp, tp->t_mountp,
> + tp->t_flags & XFS_TRANS_NOWAIT);
> + if (ret < 0)
> + return ret;
> +
>   bip = bp->b_log_item;
>   ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
>   ASSERT(!(bip->__bli_format.blf_flags & XFS_BLF_CANCEL));
> @@ -92,6 +97,7 @@ _xfs_trans_bjoin(
>   xfs_trans_add_item(tp, &bip->bli_item);
>   bp->b_transp = tp;
>  
> + return 0;
>  }
>  
>  void
> @@ -309,7 +315,11 @@ xfs_trans_read_buf_map(
>   }
>  
>   if (tp) {
> - _xfs_trans_bjoin(tp, bp, 1);
> + error = _xfs_trans_bjoin(tp, bp, 1);
> + if (error) {
> + xfs_buf_relse(bp);
> + return error;
> + }
>   trace_xfs_trans_read_buf(bp->b_log_item);

So what happens at the callers when we have a dirty transaction and
joining a buffer fails with -EAGAIN?

Apart from the fact this may well propagate -EAGAIN up to userspace,
cancelling a dirty transaction at this point will result in a
filesystem shutdown

Indeed, this can happen in the "simple" timestamp update case that
this "nowait" semantic is being aimed at. We log the inode in the
timestamp update, which dirties the log item and registers a
precommit operation to be run. We commit the
transaction, which then runs xfs_inode_item_precommit() and that
may need to attach the inode to the inode cluster buffer. This
results in:

xfs_inode_item_precommit
  xfs_imap_to_bp
xfs_trans_read_buf_map
  _xfs_trans_bjoin
xfs_buf_item_init(XFS_TRANS_NOWAIT)
  kmem_cache_zalloc(GFP_NOFS)
  
  gets -EAGAIN error
propagates -EAGAIN
  fails due to -EAGAIN

And now xfs_trans_commit() fails with a dirty transaction and the
filesystem shuts down.

IOWs, XFS_TRANS_NOWAIT as it stands is fundamentally broken. Once we
dirty an item in a transaction, we *cannot* back out of the
transaction. We *must block* in every place that could fail -
locking, memory allocation and/or IO - until the transaction
completes because we cannot undo the changes we've already made to
the dirty items in the transaction

It's even worse than that - once we have committed intents, the
whole chain of intent processing must be run to completionr. Hence
we can't tolerate backing out of

Re: [f2fs-dev] [PATCH 24/29] xfs: support nowait for xfs_buf_read_map()

2023-08-25 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Aug 25, 2023 at 09:54:26PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> This causes xfstests generic/232 hung in umount process, waiting for ail
> push, so I comment it for now, need some hints from xfs folks.
> Not a real patch.
> 
> Signed-off-by: Hao Xu 
> ---
>  fs/xfs/xfs_buf.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index cdad80e1ae25..284962a9f31a 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -828,6 +828,13 @@ xfs_buf_read_map(
>   trace_xfs_buf_read(bp, flags, _RET_IP_);
>  
>   if (!(bp->b_flags & XBF_DONE)) {
> +//   /*
> +//* Let's bypass the _xfs_buf_read() for now
> +//*/
> +//   if (flags & XBF_NOWAIT) {
> +//   xfs_buf_relse(bp);
> +//   return -EAGAIN;
> +//   }

This is *fundamentally broken*, and apart from anything else breaks
readahead.

IF we asked for a read, we cannot instantiate the buffer and then
*not issue any IO on it* and release it. That leaves an
uninitialised buffer in memory, and there's every chance that
something then trips over it and bad things happen.

A buffer like this *must* be errored out and marked stale so that
the next access to it will then re-initialise the buffer state and
trigger any preparatory work that needs to be done for the new
operation.

This comes back to my first comments that XBF_TRYLOCK cannot simpy
be replaced with XBF_NOWAIT semantics...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 28/29] xfs: support nowait semantics for xc_ctx_lock in xlog_cil_commit()

2023-08-25 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Aug 25, 2023 at 09:54:30PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> Apply trylock logic for xc_ctx_lock in xlog_cil_commit() in nowait
> case and error out -EAGAIN for xlog_cil_commit().

Again, fundamentally broken. Any error from xlog_cil_commit() will
result in a filesystem shutdown as we cannot back out from failure
with dirty log items gracefully at this point.

-Dave.

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 26/29] xfs: return -EAGAIN when nowait meets sync in transaction commit

2023-08-25 Thread Dave Chinner via Linux-f2fs-devel
On Fri, Aug 25, 2023 at 09:54:28PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> if the log transaction is a sync one, let's fail the nowait try and
> return -EAGAIN directly since sync transaction means blocked by IO.
> 
> Signed-off-by: Hao Xu 
> ---
>  fs/xfs/xfs_trans.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 7988b4c7f36e..f1f84a3dd456 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -968,12 +968,24 @@ __xfs_trans_commit(
>   xfs_csn_t   commit_seq = 0;
>   int error = 0;
>   int sync = tp->t_flags & XFS_TRANS_SYNC;
> + boolnowait = tp->t_flags & XFS_TRANS_NOWAIT;
> + boolperm_log = tp->t_flags & XFS_TRANS_PERM_LOG_RES;
>  
>   trace_xfs_trans_commit(tp, _RET_IP_);
>  
> + if (nowait && sync) {
> + /*
> +  * Currently nowait is only from xfs_vn_update_time()
> +  * so perm_log is always false here, but let's make
> +  * code general.
> +  */
> + if (perm_log)
> + xfs_defer_cancel(tp);
> + goto out_unreserve;
> + }

This is fundamentally broken.  We cannot about a transaction commit
with dirty items at this point with shutting down the filesystem.

This points to XFS_TRANS_NOWAIT being completely broken, too,
because we don't call xfs_trans_set_sync() until just before we
commit the transaction. At this point, it is -too late- for
nowait+sync to be handled gracefully, and it will *always* go bad.

IOWs, the whole transaction "nowait" semantics as designed and
implemented is not a workable solution

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 02/29] xfs: rename XBF_TRYLOCK to XBF_NOWAIT

2023-08-25 Thread Dave Chinner via Linux-f2fs-devel
42208..8cd307626939 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -45,7 +45,7 @@ struct xfs_buf;
>  
>  /* flags used only as arguments to access routines */
>  #define XBF_INCORE(1u << 29)/* lookup only, return if found in cache */
> -#define XBF_TRYLOCK   (1u << 30)/* lock requested, but do not wait */
> +#define XBF_NOWAIT(1u << 30)/* mem/lock requested, but do not wait */

That's now a really poor comment. It doesn't describe the semantics
or constraints that NOWAIT might imply.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v4 46/48] mm: shrinker: make memcg slab shrink lockless

2023-08-07 Thread Dave Chinner via Linux-f2fs-devel
clear_bit(offset, unit->map);
> + if (unlikely(!shrinker || !shrinker_try_get(shrinker))) 
> {
> + clear_bit(offset, unit->map);
> + rcu_read_unlock();
>   continue;
>   }
> + rcu_read_unlock();
>  
>   /* Call non-slab shrinkers even though kmem is disabled 
> */
>   if (!memcg_kmem_online() &&
> @@ -523,15 +546,20 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>   set_shrinker_bit(memcg, nid, 
> shrinker_id);
>   }
>   freed += ret;
> -
> - if (rwsem_is_contended(&shrinker_rwsem)) {
> - freed = freed ? : 1;
> - goto unlock;
> - }
> + shrinker_put(shrinker);

Ok, so why is this safe to call without holding the rcu read lock?
The global shrinker has to hold the rcu_read_lock() whilst calling
shrinker_put() to guarantee the validity of the list next pointer,
but we don't hold off RCU here so what guarantees a racing global
shrinker walk doesn't trip over this shrinker_put() call dropping
the refcount to zero and freeing occuring in a different context...


> +     /*
> +  * We have already exited the read-side of rcu critical section
> +  * before calling do_shrink_slab(), the shrinker_info may be
> +  * released in expand_one_shrinker_info(), so reacquire the
> +  * shrinker_info.
> +  */
> + index++;
> + goto again;

With that, what makes the use of shrinker_info in
xchg_nr_deferred_memcg() in do_shrink_slab() coherent and valid?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless

2023-08-07 Thread Dave Chinner via Linux-f2fs-devel
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote:
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index eb342994675a..f06225f18531 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,8 @@
>  
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #define SHRINKER_UNIT_BITS   BITS_PER_LONG
>  
> @@ -87,6 +89,10 @@ struct shrinker {
>   int seeks;  /* seeks to recreate an obj */
>   unsigned flags;
>  
> + refcount_t refcount;
> + struct completion done;
> + struct rcu_head rcu;

Documentation, please. What does the refcount protect, what does the
completion provide, etc.

> +
>   void *private_data;
>  
>   /* These are for internal use */
> @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, 
> const char *fmt, ...);
>  void shrinker_register(struct shrinker *shrinker);
>  void shrinker_free(struct shrinker *shrinker);
>  
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> + return refcount_inc_not_zero(&shrinker->refcount);
> +}
> +
> +static inline void shrinker_put(struct shrinker *shrinker)
> +{
> + if (refcount_dec_and_test(&shrinker->refcount))
> + complete(&shrinker->done);
> +}
> +
>  #ifdef CONFIG_SHRINKER_DEBUG
>  extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker,
> const char *fmt, ...);
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 1911c06b8af5..d318f5621862 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -2,6 +2,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "internal.h"
> @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, 
> struct mem_cgroup *memcg,
>   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> - if (!down_read_trylock(&shrinker_rwsem))
> - goto out;
> -
> - list_for_each_entry(shrinker, &shrinker_list, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>   struct shrink_control sc = {
>   .gfp_mask = gfp_mask,
>   .nid = nid,
>   .memcg = memcg,
>   };
>  
> + if (!shrinker_try_get(shrinker))
> + continue;
> +
> + /*
> +  * We can safely unlock the RCU lock here since we already
> +  * hold the refcount of the shrinker.
> +  */
> + rcu_read_unlock();
> +
>   ret = do_shrink_slab(&sc, shrinker, priority);
>   if (ret == SHRINK_EMPTY)
>   ret = 0;
>   freed += ret;
> +
>   /*
> -  * Bail out if someone want to register a new shrinker to
> -  * prevent the registration from being stalled for long periods
> -  * by parallel ongoing shrinking.
> +  * This shrinker may be deleted from shrinker_list and freed
> +  * after the shrinker_put() below, but this shrinker is still
> +  * used for the next traversal. So it is necessary to hold the
> +  * RCU lock first to prevent this shrinker from being freed,
> +  * which also ensures that the next shrinker that is traversed
> +  * will not be freed (even if it is deleted from shrinker_list
> +  * at the same time).
>*/

This needs to be moved to the head of the function, and document
the whole list walk, get, put and completion parts of the algorithm
that make it safe. There's more to this than "we hold a reference
count", especially the tricky "we might see the shrinker before it
is fully initialised" case


.
>  void shrinker_free(struct shrinker *shrinker)
>  {
>   struct dentry *debugfs_entry = NULL;
> @@ -686,9 +712,18 @@ void shrinker_free(struct shrinker *shrinker)
>   if (!shrinker)
>   return;
>  
> + if (shrinker->flags & SHRINKER_REGISTERED) {
> + shrinker_put(shrinker);
> + wait_for_completion(&shrinker->done);
> + }

Needs a comment explaining why we need to wait here...
> +
>   down_write(&shrinker_rwsem);
>   if (shrinker->flags & SHRINKER_REGISTERED) {
> - list_del(&shrinker->list);
> + /*
> +  * Lookups on the shrinker are over and will f

Re: [f2fs-dev] [PATCH v4 44/48] mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}

2023-08-07 Thread Dave Chinner via Linux-f2fs-devel
 atomic_long_add(nr, 
> &parent_unit->nr_deferred[offset]);
> + }
>   }
>   }
>   up_read(&shrinker_rwsem);
> @@ -407,7 +458,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>  {
>   struct shrinker_info *info;
>   unsigned long ret, freed = 0;
> - int i;
> + int offset, index = 0;
>  
>   if (!mem_cgroup_online(memcg))
>   return 0;
> @@ -419,56 +470,63 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, 
> int nid,
>   if (unlikely(!info))
>   goto unlock;
>  
> - for_each_set_bit(i, info->map, info->map_nr_max) {
> - struct shrink_control sc = {
> - .gfp_mask = gfp_mask,
> - .nid = nid,
> - .memcg = memcg,
> - };
> - struct shrinker *shrinker;
> + for (; index < shriner_id_to_index(info->map_nr_max); index++) {
> + struct shrinker_info_unit *unit;

This adds another layer of indent to shrink_slab_memcg(). Please
factor it first so that the code ends up being readable. Doing that
first as a separate patch will also make the actual algorithm
changes in this patch be much more obvious - this huge hunk of
diff is pretty much impossible to review...

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v4 45/48] mm: shrinker: make global slab shrink lockless

2023-08-07 Thread Dave Chinner via Linux-f2fs-devel
On Mon, Aug 07, 2023 at 07:09:33PM +0800, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
> which protects most operations such as slab shrink, registration and
> unregistration of shrinkers, etc. This can easily cause problems in the
> following cases.

> This commit uses the refcount+RCU method [5] proposed by Dave Chinner
> to re-implement the lockless global slab shrink. The memcg slab shrink is
> handled in the subsequent patch.

> ---
>  include/linux/shrinker.h | 17 ++
>  mm/shrinker.c| 70 +---
>  2 files changed, 68 insertions(+), 19 deletions(-)

There's no documentation in the code explaining how the lockless
shrinker algorithm works. It's left to the reader to work out how
this all goes together

> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index eb342994675a..f06225f18531 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,8 @@
>  
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  #define SHRINKER_UNIT_BITS   BITS_PER_LONG
>  
> @@ -87,6 +89,10 @@ struct shrinker {
>   int seeks;  /* seeks to recreate an obj */
>   unsigned flags;
>  
> + refcount_t refcount;
> + struct completion done;
> + struct rcu_head rcu;

What does the refcount protect, why do we need the completion, etc?

> +
>   void *private_data;
>  
>   /* These are for internal use */
> @@ -120,6 +126,17 @@ struct shrinker *shrinker_alloc(unsigned int flags, 
> const char *fmt, ...);
>  void shrinker_register(struct shrinker *shrinker);
>  void shrinker_free(struct shrinker *shrinker);
>  
> +static inline bool shrinker_try_get(struct shrinker *shrinker)
> +{
> + return refcount_inc_not_zero(&shrinker->refcount);
> +}
> +
> +static inline void shrinker_put(struct shrinker *shrinker)
> +{
> + if (refcount_dec_and_test(&shrinker->refcount))
> + complete(&shrinker->done);
> +}
> +
>  #ifdef CONFIG_SHRINKER_DEBUG
>  extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker,
> const char *fmt, ...);
> diff --git a/mm/shrinker.c b/mm/shrinker.c
> index 1911c06b8af5..d318f5621862 100644
> --- a/mm/shrinker.c
> +++ b/mm/shrinker.c
> @@ -2,6 +2,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "internal.h"
> @@ -577,33 +578,42 @@ unsigned long shrink_slab(gfp_t gfp_mask, int nid, 
> struct mem_cgroup *memcg,
>   if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>   return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>  
> - if (!down_read_trylock(&shrinker_rwsem))
> - goto out;
> -
> - list_for_each_entry(shrinker, &shrinker_list, list) {
> + rcu_read_lock();
> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>   struct shrink_control sc = {
>   .gfp_mask = gfp_mask,
>   .nid = nid,
>   .memcg = memcg,
>   };
>  
> + if (!shrinker_try_get(shrinker))
> + continue;
> +
> + /*
> +  * We can safely unlock the RCU lock here since we already
> +  * hold the refcount of the shrinker.
> +  */
> + rcu_read_unlock();
> +
>   ret = do_shrink_slab(&sc, shrinker, priority);
>   if (ret == SHRINK_EMPTY)
>   ret = 0;
>   freed += ret;
> +
>   /*
> -  * Bail out if someone want to register a new shrinker to
> -  * prevent the registration from being stalled for long periods
> -  * by parallel ongoing shrinking.
> +  * This shrinker may be deleted from shrinker_list and freed
> +  * after the shrinker_put() below, but this shrinker is still
> +  * used for the next traversal. So it is necessary to hold the
> +  * RCU lock first to prevent this shrinker from being freed,
> +  * which also ensures that the next shrinker that is traversed
> +  * will not be freed (even if it is deleted from shrinker_list
> +  * at the same time).
>*/

This comment really should be at the head of the function,
describing the algorithm used within the function itself. i.e. how
reference counts are used w.r.t. the rcu_read_lock() usage to
guarantee existence of the shrinker and the validity of the list
walk.

I'm not going

Re: [f2fs-dev] [PATCH v3 28/49] dm zoned: dynamically allocate the dm-zoned-meta shrinker

2023-07-27 Thread Dave Chinner via Linux-f2fs-devel
On Thu, Jul 27, 2023 at 07:20:46PM +0900, Damien Le Moal wrote:
> On 7/27/23 17:55, Qi Zheng wrote:
> >>>   goto err;
> >>>   }
> >>>   +    zmd->mblk_shrinker->count_objects = dmz_mblock_shrinker_count;
> >>> +    zmd->mblk_shrinker->scan_objects = dmz_mblock_shrinker_scan;
> >>> +    zmd->mblk_shrinker->seeks = DEFAULT_SEEKS;
> >>> +    zmd->mblk_shrinker->private_data = zmd;
> >>> +
> >>> +    shrinker_register(zmd->mblk_shrinker);
> >>
> >> I fail to see how this new shrinker API is better... Why isn't there a
> >> shrinker_alloc_and_register() function ? That would avoid adding all this 
> >> code
> >> all over the place as the new API call would be very similar to the current
> >> shrinker_register() call with static allocation.
> > 
> > In some registration scenarios, memory needs to be allocated in advance.
> > So we continue to use the previous prealloc/register_prepared()
> > algorithm. The shrinker_alloc_and_register() is just a helper function
> > that combines the two, and this increases the number of APIs that
> > shrinker exposes to the outside, so I choose not to add this helper.
> 
> And that results in more code in many places instead of less code + a simple
> inline helper in the shrinker header file...

It's not just a "simple helper" - it's a function that has to take 6
or 7 parameters with a return value that must be checked and
handled.

This was done in the first versions of the patch set - the amount of
code in each caller does not go down and, IMO, was much harder to
read and determine "this is obviously correct" that what we have
now.

> So not adding that super simple
> helper is not exactly the best choice in my opinion.

Each to their own - I much prefer the existing style/API over having
to go look up a helper function every time I want to check some
random shrinker has been set up correctly

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless

2023-07-26 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Jul 26, 2023 at 05:14:09PM +0800, Qi Zheng wrote:
> On 2023/7/26 16:08, Dave Chinner wrote:
> > On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote:
> > > @@ -122,6 +126,13 @@ void shrinker_free_non_registered(struct shrinker 
> > > *shrinker);
> > >   void shrinker_register(struct shrinker *shrinker);
> > >   void shrinker_unregister(struct shrinker *shrinker);
> > > +static inline bool shrinker_try_get(struct shrinker *shrinker)
> > > +{
> > > + return READ_ONCE(shrinker->registered) &&
> > > +refcount_inc_not_zero(&shrinker->refcount);
> > > +}
> > 
> > Why do we care about shrinker->registered here? If we don't set
> > the refcount to 1 until we have fully initialised everything, then
> > the shrinker code can key entirely off the reference count and
> > none of the lookup code needs to care about whether the shrinker is
> > registered or not.
> 
> The purpose of checking shrinker->registered here is to stop running
> shrinker after calling shrinker_free(), which can prevent the following
> situations from happening:
> 
> CPU 0 CPU 1
> 
> shrinker_try_get()
> 
>shrinker_try_get()
> 
> shrinker_put()
> shrinker_try_get()
>shrinker_put()

I don't see any race here? What is wrong with having multiple active
users at once?

> > 
> > This should use a completion, then it is always safe under
> > rcu_read_lock().  This also gets rid of the shrinker_lock spin lock,
> > which only exists because we can't take a blocking lock under
> > rcu_read_lock(). i.e:
> > 
> > 
> > void shrinker_put(struct shrinker *shrinker)
> > {
> > if (refcount_dec_and_test(&shrinker->refcount))
> > complete(&shrinker->done);
> > }
> > 
> > void shrinker_free()
> > {
> > .
> > refcount_dec(&shrinker->refcount);
> 
> I guess what you mean is shrinker_put(), because here may be the last
> refcount.

Yes, I did.

> > wait_for_completion(&shrinker->done);
> > /*
> >  * lookups on the shrinker will now all fail as refcount has
> >  * fallen to zero. We can now remove it from the lists and
> >  * free it.
> >  */
> > down_write(shrinker_rwsem);
> > list_del_rcu(&shrinker->list);
> > up_write(&shrinker_rwsem);
> > call_rcu(shrinker->rcu_head, shrinker_free_rcu_cb);
> > }
> > 
> > 
> > 
> > > @@ -686,11 +711,14 @@ EXPORT_SYMBOL(shrinker_free_non_registered);
> > >   void shrinker_register(struct shrinker *shrinker)
> > >   {
> > > - down_write(&shrinker_rwsem);
> > > - list_add_tail(&shrinker->list, &shrinker_list);
> > > - shrinker->flags |= SHRINKER_REGISTERED;
> > > + refcount_set(&shrinker->refcount, 1);
> > > +
> > > + spin_lock(&shrinker_lock);
> > > + list_add_tail_rcu(&shrinker->list, &shrinker_list);
> > > + spin_unlock(&shrinker_lock);
> > > +
> > >   shrinker_debugfs_add(shrinker);
> > > - up_write(&shrinker_rwsem);
> > > + WRITE_ONCE(shrinker->registered, true);
> > >   }
> > >   EXPORT_SYMBOL(shrinker_register);
> > 
> > This just looks wrong - you are trying to use WRITE_ONCE() as a
> > release barrier to indicate that the shrinker is now set up fully.
> > That's not necessary - the refcount is an atomic and along with the
> > rcu locks they should provides all the barriers we need. i.e.
> 
> The reason I used WRITE_ONCE() here is because the shrinker->registered
> will be read and written concurrently (read in shrinker_try_get() and
> written in shrinker_free()), which is why I added shrinker::registered
> field instead of using SHRINKER_REGISTERED flag (this can reduce the
> addition of WRITE_ONCE()/READ_ONCE()).

Using WRITE_ONCE/READ_ONCE doesn't provide memory barriers needed to
use the field like this. You need release/acquire memory ordering
here. i.e. smp_store_release()/smp_load_acquire().

As it is, the refcount_inc_not_zero() provides a control dependency,
as documented in include/linux/refcount.h, refcount_dec_and_test()
provides release memory ordering. The only thing I think we may need
is a write barrier before refcount_set(), such that if
refcount_inc_not_zero() sees a non-zero value, it is guaranteed to
see an initialised structure...

i.e. refcounts provide all the existence and initialisation
guarantees. Hence I don't see the need to use shr

Re: [f2fs-dev] [PATCH v2 44/47] mm: shrinker: make global slab shrink lockless

2023-07-26 Thread Dave Chinner via Linux-f2fs-devel
On Mon, Jul 24, 2023 at 05:43:51PM +0800, Qi Zheng wrote:
> The shrinker_rwsem is a global read-write lock in shrinkers subsystem,
> which protects most operations such as slab shrink, registration and
> unregistration of shrinkers, etc. This can easily cause problems in the
> following cases.
> 
> 1) When the memory pressure is high and there are many filesystems
>mounted or unmounted at the same time, slab shrink will be affected
>(down_read_trylock() failed).
> 
>Such as the real workload mentioned by Kirill Tkhai:
> 
>```
>One of the real workloads from my experience is start
>of an overcommitted node containing many starting
>containers after node crash (or many resuming containers
>after reboot for kernel update). In these cases memory
>pressure is huge, and the node goes round in long reclaim.
>```
> 
> 2) If a shrinker is blocked (such as the case mentioned
>in [1]) and a writer comes in (such as mount a fs),
>then this writer will be blocked and cause all
>subsequent shrinker-related operations to be blocked.
> 
> Even if there is no competitor when shrinking slab, there may still be a
> problem. The down_read_trylock() may become a perf hotspot with frequent
> calls to shrink_slab(). Because of the poor multicore scalability of
> atomic operations, this can lead to a significant drop in IPC
> (instructions per cycle).
> 
> We used to implement the lockless slab shrink with SRCU [2], but then
> kernel test robot reported -88.8% regression in
> stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4].
> 
> This commit uses the refcount+RCU method [5] proposed by Dave Chinner
> to re-implement the lockless global slab shrink. The memcg slab shrink is
> handled in the subsequent patch.
> 
> For now, all shrinker instances are converted to dynamically allocated and
> will be freed by kfree_rcu(). So we can use rcu_read_{lock,unlock}() to
> ensure that the shrinker instance is valid.
> 
> And the shrinker instance will not be run again after unregistration. So
> the structure that records the pointer of shrinker instance can be safely
> freed without waiting for the RCU read-side critical section.
> 
> In this way, while we implement the lockless slab shrink, we don't need to
> be blocked in unregister_shrinker().
> 
> The following are the test results:
> 
> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> 
> 1) Before applying this patchset:
> 
> setting to a 60 second run per stressor
> dispatching hogs: 9 ramfs
> stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
> ops/s
>   (secs)(secs)(secs)   (real time) (usr+sys 
> time)
> ramfs735238 60.00 12.37363.70 12253.05
> 1955.08
> for a 60.01s run time:
>1440.27s available CPU time
>  12.36s user time   (  0.86%)
> 363.70s system time ( 25.25%)
> 376.06s total time  ( 26.11%)
> load average: 10.79 4.47 1.69
> passed: 9: ramfs (9)
> failed: 0
> skipped: 0
> successful run completed in 60.01s (1 min, 0.01 secs)
> 
> 2) After applying this patchset:
> 
> setting to a 60 second run per stressor
> dispatching hogs: 9 ramfs
> stressor   bogo ops real time  usr time  sys time   bogo ops/s bogo 
> ops/s
>   (secs)(secs)(secs)   (real time) (usr+sys 
> time)
> ramfs746677 60.00 12.22367.75 12443.70
> 1965.13
> for a 60.01s run time:
>1440.26s available CPU time
>  12.21s user time   (  0.85%)
> 367.75s system time ( 25.53%)
> 379.96s total time  ( 26.38%)
> load average: 8.37 2.48 0.86
> passed: 9: ramfs (9)
> failed: 0
> skipped: 0
> successful run completed in 60.01s (1 min, 0.01 secs)
> 
> We can see that the ops/s has hardly changed.
> 
> [1]. 
> https://lore.kernel.org/lkml/20191129214541.3110-1-ptikhomi...@virtuozzo.com/
> [2]. 
> https://lore.kernel.org/lkml/20230313112819.38938-1-zhengqi.a...@bytedance.com/
> [3]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie@intel.com/
> [4]. https://lore.kernel.org/all/20230609081518.3039120-1-qi.zh...@linux.dev/
> [5]. https://lore.kernel.org/lkml/zijhou1d55d4h...@dread.disaster.area/
> 
> Signed-off-by: Qi Zheng 
> ---
>  include/linux/shrinker.h | 19 +++---
>  mm/shrinker.c| 75 ++--
>  mm/shrinker_debug.c  | 52 +---
>  3 files changed, 104 insertions(+), 42 deletions(-)
> 
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 36977a70bebb..335da93cccee 100644
> ---

Re: [f2fs-dev] [PATCH v2 03/47] mm: shrinker: add infrastructure for dynamically allocating shrinker

2023-07-26 Thread Dave Chinner via Linux-f2fs-devel
On Mon, Jul 24, 2023 at 05:43:10PM +0800, Qi Zheng wrote:
> Currently, the shrinker instances can be divided into the following three
> types:
> 
> a) global shrinker instance statically defined in the kernel, such as
>workingset_shadow_shrinker.
> 
> b) global shrinker instance statically defined in the kernel modules, such
>as mmu_shrinker in x86.
> 
> c) shrinker instance embedded in other structures.
> 
> For case a, the memory of shrinker instance is never freed. For case b,
> the memory of shrinker instance will be freed after synchronize_rcu() when
> the module is unloaded. For case c, the memory of shrinker instance will
> be freed along with the structure it is embedded in.
> 
> In preparation for implementing lockless slab shrink, we need to
> dynamically allocate those shrinker instances in case c, then the memory
> can be dynamically freed alone by calling kfree_rcu().
> 
> So this commit adds the following new APIs for dynamically allocating
> shrinker, and add a private_data field to struct shrinker to record and
> get the original embedded structure.
> 
> 1. shrinker_alloc()
> 
> Used to allocate shrinker instance itself and related memory, it will
> return a pointer to the shrinker instance on success and NULL on failure.
> 
> 2. shrinker_free_non_registered()
> 
> Used to destroy the non-registered shrinker instance.

This is a bit nasty

> 
> 3. shrinker_register()
> 
> Used to register the shrinker instance, which is same as the current
> register_shrinker_prepared().
> 
> 4. shrinker_unregister()

rename this "shrinker_free()" and key the two different freeing
cases on the SHRINKER_REGISTERED bit rather than mostly duplicating
the two.

void shrinker_free(struct shrinker *shrinker)
{
struct dentry *debugfs_entry = NULL;
int debugfs_id;

if (!shrinker)
return;

down_write(&shrinker_rwsem);
if (shrinker->flags & SHRINKER_REGISTERED) {
list_del(&shrinker->list);
debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
} else if (IS_ENABLED(CONFIG_SHRINKER_DEBUG)) {
kfree_const(shrinker->name);
}

if (shrinker->flags & SHRINKER_MEMCG_AWARE)
unregister_memcg_shrinker(shrinker);
up_write(&shrinker_rwsem);

if (debugfs_entry)
    shrinker_debugfs_remove(debugfs_entry, debugfs_id);

kfree(shrinker->nr_deferred);
kfree(shrinker);
}
EXPORT_SYMBOL_GPL(shrinker_free);

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v5 6/8] xfs: switch to multigrain timestamps

2023-07-18 Thread Dave Chinner via Linux-f2fs-devel
On Thu, Jul 13, 2023 at 07:00:55PM -0400, Jeff Layton wrote:
> Enable multigrain timestamps, which should ensure that there is an
> apparent change to the timestamp whenever it has been written after
> being actively observed via getattr.
> 
> Also, anytime the mtime changes, the ctime must also change, and those
> are now the only two options for xfs_trans_ichgtime. Have that function
> unconditionally bump the ctime, and warn if XFS_ICHGTIME_CHG is ever not
> set.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 6 +++---
>  fs/xfs/xfs_iops.c   | 4 ++--
>  fs/xfs/xfs_super.c  | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 0c9df8df6d4a..86f5ffce2d89 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -62,12 +62,12 @@ xfs_trans_ichgtime(
>   ASSERT(tp);
>   ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  
> - tv = current_time(inode);
> + /* If the mtime changes, then ctime must also change */
> + WARN_ON_ONCE(!(flags & XFS_ICHGTIME_CHG));

Make that an ASSERT(flags & XFS_ICHGTIME_CHG), please. There's no
need to verify this at runtime on production kernels.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 16/17] block: use iomap for writes to block devices

2023-05-23 Thread Dave Chinner via Linux-f2fs-devel
On Fri, May 19, 2023 at 04:22:01PM +0200, Hannes Reinecke wrote:
> On 4/24/23 07:49, Christoph Hellwig wrote:
> > Use iomap in buffer_head compat mode to write to block devices.
> > 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >   block/Kconfig |  1 +
> >   block/fops.c  | 33 +
> >   2 files changed, 30 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 941b2dca70db73..672b08f0096ab4 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ -5,6 +5,7 @@
> >   menuconfig BLOCK
> >  bool "Enable the block layer" if EXPERT
> >  default y
> > +   select IOMAP
> >  select SBITMAP
> >  help
> >  Provide block layer support for the kernel.
> > diff --git a/block/fops.c b/block/fops.c
> > index 318247832a7bcf..7910636f8df33b 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -15,6 +15,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include "blk.h"
> > @@ -386,6 +387,27 @@ static ssize_t blkdev_direct_IO(struct kiocb *iocb, 
> > struct iov_iter *iter)
> > return __blkdev_direct_IO(iocb, iter, bio_max_segs(nr_pages));
> >   }
> > +static int blkdev_iomap_begin(struct inode *inode, loff_t offset, loff_t 
> > length,
> > +   unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> > +{
> > +   struct block_device *bdev = I_BDEV(inode);
> > +   loff_t isize = i_size_read(inode);
> > +
> > +   iomap->bdev = bdev;
> > +   iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
> > +   if (WARN_ON_ONCE(iomap->offset >= isize))
> > +   return -EIO;
> 
> I'm hitting this during booting:
> [5.016324]  
> [5.030256]  iomap_iter+0x11a/0x350
> [5.030264]  iomap_readahead+0x1eb/0x2c0
> [5.030272]  read_pages+0x5d/0x220
> [5.030279]  page_cache_ra_unbounded+0x131/0x180
> [5.030284]  filemap_get_pages+0xff/0x5a0

Why is filemap_get_pages() using unbounded readahead? Surely
readahead should be limited to reading within EOF

> [5.030292]  filemap_read+0xca/0x320
> [5.030296]  ? aa_file_perm+0x126/0x500
> [5.040216]  ? touch_atime+0xc8/0x150
> [5.040224]  blkdev_read_iter+0xb0/0x150
> [5.040228]  vfs_read+0x226/0x2d0
> [5.040234]  ksys_read+0xa5/0xe0
> [5.040238]  do_syscall_64+0x5b/0x80
> 
> Maybe we should consider this patch:
> 
> diff --git a/block/fops.c b/block/fops.c
> index 524b8a828aad..d202fb663f25 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -386,10 +386,13 @@ static int blkdev_iomap_begin(struct inode *inode,
> loff_t offset, loff_t length,
> 
> iomap->bdev = bdev;
> iomap->offset = ALIGN_DOWN(offset, bdev_logical_block_size(bdev));
> -   if (WARN_ON_ONCE(iomap->offset >= isize))
> -   return -EIO;
> -   iomap->type = IOMAP_MAPPED;
> -   iomap->addr = iomap->offset;
> +   if (WARN_ON_ONCE(iomap->offset >= isize)) {
> +   iomap->type = IOMAP_HOLE;
> +   iomap->addr = IOMAP_NULL_ADDR;
> +   } else {
> +   iomap->type = IOMAP_MAPPED;
> +   iomap->addr = iomap->offset;
> +   }

I think Christoph's code is correct. IMO, any attempt to read beyond
the end of the device should throw out a warning and return an
error, not silently return zeros.

If readahead is trying to read beyond the end of the device, then it
really seems to me like the problem here is readahead, not the iomap
code detecting the OOB read request

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 00/23] fs-verity support for XFS

2023-04-11 Thread Dave Chinner via Linux-f2fs-devel
On Tue, Apr 11, 2023 at 07:33:19PM -0700, Eric Biggers wrote:
> On Mon, Apr 10, 2023 at 10:19:46PM -0700, Christoph Hellwig wrote:
> > Dave is going to hate me for this, but..
> > 
> > I've been looking over some of the interfaces here, and I'm starting
> > to very seriously questioning the design decisions of storing the
> > fsverity hashes in xattrs.
> > 
> > Yes, storing them beyond i_size in the file is a bit of a hack, but
> > it allows to reuse a lot of the existing infrastructure, and much
> > of fsverity is based around it.  So storing them in an xattrs causes
> > a lot of churn in the interface.  And the XFS side with special
> > casing xattr indices also seems not exactly nice.
> 
> It seems it's really just the Merkle tree caching interface that is causing
> problems, as it's currently too closely tied to the page cache?  That is just 
> an
> implementation detail that could be reworked along the lines of what is being
> discussed.
> 
> But anyway, it is up to the XFS folks.  Keep in mind there is also the option 
> of
> doing what btrfs is doing, where it stores the Merkle tree separately from the
> file data stream, but caches it past i_size in the page cache at runtime.

Right. It's not entirely simple to store metadata on disk beyond EOF
in XFS because of all the assumptions throughout the IO path and
allocator interfaces that it can allocate space beyond EOF at will
and something else will clean it up later if it is not needed. This
impacts on truncate, delayed allocation, writeback, IO completion,
EOF block removal on file close, background garbage collection,
ENOSPC/EDQUOT driven space freeing, etc.  Some of these things cross
over into iomap infrastructure, too.

AFAIC, it's far more intricate, complex and risky to try to store
merkle tree data beyond EOF than it is to put it in an xattr
namespace because IO path EOF handling bugs result in user data
corruption. This happens over and over again, no matter how careful
we are about these aspects of user data handling.

OTOH, putting the merkle tree data in a different namespace avoids
these issues completely. Yes, we now have to solve an API mismatch,
but we aren't risking the addition of IO path data corruption bugs
to every non-fsverity filesystem in production...

Hence I think copying the btrfs approach (i.e. only caching the
merkle tree data in the page cache beyond EOF) would be as far as I
think we'd want to go. Realistically, there would be little
practical difference between btrfs storing the merkle tree blocks in
a separate internal btree and XFS storing them in an internal
private xattr btree namespace.

I would, however, prefer not to have to do this at all if we could
simply map the blocks directly out of the xattr buffers as we
already do internally for all the XFS code...

> I guess there is also the issue of encryption, which hasn't come up yet since
> we're talking about fsverity support only.  The Merkle tree (including the
> fsverity_descriptor) is supposed to be encrypted, just like the file contents
> are.  Having it be stored after the file contents accomplishes that easily...
> Of course, it doesn't have to be that way; a separate key could be derived, or
> the Merkle tree blocks could be encrypted with the file contents key using
> indices past i_size, without them physically being stored in the data stream.

I'm expecting that fscrypt for XFS will include encryption of the
xattr names and values (just like we will need to do for directory
names) except for the xattrs that hold the encryption keys
themselves. That means the merkle tree blocks should get encrypted
without any extra work needing to be done anywhere.  This will
simply require the fscrypt keys to be held in a private internal
xattr namespace that isn't encrypted, but that's realtively trivial
to do...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Apr 05, 2023 at 10:54:06PM +, Eric Biggers wrote:
> On Thu, Apr 06, 2023 at 08:26:46AM +1000, Dave Chinner wrote:
> > > We could certainly think about moving to a design where fs/verity/ asks 
> > > the
> > > filesystem to just *read* a Merkle tree block, without adding it to a 
> > > cache, and
> > > then fs/verity/ implements the caching itself.  That would require some 
> > > large
> > > changes to each filesystem, though, unless we were to double-cache the 
> > > Merkle
> > > tree blocks which would be inefficient.
> > 
> > No, that's unnecessary.
> > 
> > All we need if for fsverity to require filesystems to pass it byte
> > addressable data buffers that are externally reference counted. The
> > filesystem can take a page reference before mapping the page and
> > passing the kaddr to fsverity, then unmap and drop the reference
> > when the merkle tree walk is done as per Andrey's new drop callout.
> > 
> > fsverity doesn't need to care what the buffer is made from, how it
> > is cached, what it's life cycle is, etc. The caching mechanism and
> > reference counting is entirely controlled by the filesystem callout
> > implementations, and fsverity only needs to deal with memory buffers
> > that are guaranteed to live for the entire walk of the merkle
> > tree
> 
> Sure.  Just a couple notes:
> 
> First, fs/verity/ does still need to be able to tell whether the buffer is 
> newly
> instantiated or not.

Boolean flag from the caller.

> Second, fs/verity/ uses the ahash API to do the hashing.  ahash is a
> scatterlist-based API.  Virtual addresses can still be used (see 
> sg_set_buf()),
> but the memory cannot be vmalloc'ed memory, since virt_to_page() needs to 
> work.
> Does XFS use vmalloc'ed memory for these buffers?

Not vmalloc'ed, but vmapped. we allocate the pages individually, but
then call vm_map_page() to present the higher level code with a
single contiguous memory range if it is a multi-page buffer.

We do have the backing info held in the buffer, and that's what we
use for IO. If fsverity needs a page based scatter/gather list
for hardware offload, it could ask the filesystem to provide it
for that given buffer...

> BTW, converting fs/verity/ from ahash to shash is an option; I've really never
> been a fan of the scatterlist-based crypto APIs!  The disadvantage of doing
> this, though, would be that it would remove support for all the hardware 
> crypto
> drivers.
>
> That *might* actually be okay, as that approach to crypto acceleration
> has mostly fallen out of favor, in favor of CPU-based acceleration.  But I do
> worry about e.g. someone coming out of the woodwork and saying they need to 
> use
> fsverity on a low-powered ARM board that has a crypto accelerator like CAAM, 
> and
> they MUST use their crypto accelerator to get acceptable performance.

True, but we are very unlikely to be using XFS on such small
systems and I don't think we really care about XFS performance on
android sized systems, either.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Dave Chinner via Linux-f2fs-devel
gt; > every
> > time the cache is accessed.  The problem with what you have now is that 
> > every
> > time a single 32-byte hash is accessed, a full page (potentially 64KB!) 
> > will be
> > allocated and filled.  That's not very efficient.  The need to allocate the
> > temporary page can also cause ENOMEM (which will get reported as EIO).
> > 
> > Did you consider alternatives that would work more efficiently?  I think it
> > would be worth designing something that works properly with how XFS is 
> > planned
> > to cache the Merkle tree, instead of designing a workaround.
> > ->read_merkle_tree_page was not really designed for what you are doing here.
> > 
> > How about replacing ->read_merkle_tree_page with a function that takes in a
> > Merkle tree block index (not a page index!) and hands back a (page, offset) 
> > pair
> > that identifies where the Merkle tree block's data is located?  Or (folio,
> > offset), I suppose.

{kaddr, len}, please.

> > 
> > With that, would it be possible to directly return the XFS cache?
> > 
> > - Eric
> > 
> 
> Yeah, I also don't like it, I didn't want to change fs-verity much
> so went with this workaround. But as it's ok, I will look into trying
> to pass xfs buffers to fs-verity without direct use of
> ->read_merkle_tree_page(). I think it's possible with (folio,
> offset), the xfs buffers aren't xattr value align so the 4k merkle
> tree block is stored in two pages.

I don't think this is necessary to actually merge the code. We want
it to work correctly as the primary concern, performance is a
secondary concern.

Regardless, as you mention, the xfs_buf is not made up of contiguous
pages so the merkle tree block data will be split across two
(or more) pages.  AFAICT, the fsverity code doesn't work with data
structures that span multiple disjoint pages...

Another problem is that the xfs-buf might be backed by heap memory
(e.g. 4kB fs block size on 64kB PAGE_SIZE) and so it cannot be
treated like a page cache page by the fsverity merkle tree code. We
most definitely do not want to be passing pages containing heap
memory to functions expecting to be passed lru-resident page cache
pages

That said, xfs-bufs do have a stable method of addressing the data
in the buffers, and all the XFS code uses this to access and
manipulate data directly in the buffers.

That is, xfs_buf_offset() returns a mapped kaddr that points to the
contiguous memory region containing the metadata in the buffer.  If
the xfs_buf spans multiple pages, it will return a kaddr pointing
into the contiguous vmapped memory address that maps the entire
buffer data range. If it is heap memory, it simply returns a pointer
into that heap memory. If it's a single page, then it returns the
kaddr for the data within the page.

If you move all the assumptions about how the merkle tree data is
managed out of fsverity and require the fielsystems to do the
mapping to kaddrs and reference counting to guarantee life times,
then the need for multiple different methods for reading merkle tree
data go away...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files

2023-04-05 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Apr 05, 2023 at 06:02:47PM +, Eric Biggers wrote:
> And I really hope that you don't want to do DIO to the *Merkle tree*, as that

Not for XFS - the merkle tree is not held as file data.

That said, the merkle tree in XFS is not page cache or block aligned
metadata either, so we really want the same memory buffer based
interface for the merkle tree reading so that the merkle tree code
can read directly from the xfs-buf rather than requiring us to copy
it out of the xfsbuf into temporary pages...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Apr 05, 2023 at 06:16:00PM +, Eric Biggers wrote:
> On Wed, Apr 05, 2023 at 09:38:47AM -0700, Darrick J. Wong wrote:
> > > The merkle tree pages are dropped after verification. When page is
> > > dropped xfs_buf is marked as verified. If fs-verity wants to
> > > verify again it will get the same verified buffer. If buffer is
> > > evicted it won't have verified state.
> > > 
> > > So, with enough memory pressure buffers will be dropped and need to
> > > be reverified.
> > 
> > Please excuse me if this was discussed and rejected long ago, but
> > perhaps fsverity should try to hang on to the merkle tree pages that
> > this function returns for as long as possible until reclaim comes for
> > them?
> > 
> > With the merkle tree page lifetimes extended, you then don't need to
> > attach the xfs_buf to page->private, nor does xfs have to extend the
> > buffer cache to stash XBF_VERITY_CHECKED.
> 
> Well, all the other filesystems that support fsverity (ext4, f2fs, and btrfs)
> just cache the Merkle tree pages in the inode's page cache.  It's an approach
> that I know some people aren't a fan of, but it's efficient and it works.

Which puts pages beyond EOF in the page cache. Given that XFS also
allows persistent block allocation beyond EOF, having both data in the page
cache and blocks beyond EOF that contain unrelated information is a
Real Bad Idea.

Just because putting metadata in the file data address space works
for one filesystem, it doesn't me it's a good idea or that it works
for every filesystem.

> We could certainly think about moving to a design where fs/verity/ asks the
> filesystem to just *read* a Merkle tree block, without adding it to a cache, 
> and
> then fs/verity/ implements the caching itself.  That would require some large
> changes to each filesystem, though, unless we were to double-cache the Merkle
> tree blocks which would be inefficient.

No, that's unnecessary.

All we need if for fsverity to require filesystems to pass it byte
addressable data buffers that are externally reference counted. The
filesystem can take a page reference before mapping the page and
passing the kaddr to fsverity, then unmap and drop the reference
when the merkle tree walk is done as per Andrey's new drop callout.

fsverity doesn't need to care what the buffer is made from, how it
is cached, what it's life cycle is, etc. The caching mechanism and
reference counting is entirely controlled by the filesystem callout
implementations, and fsverity only needs to deal with memory buffers
that are guaranteed to live for the entire walk of the merkle
tree

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files

2023-04-05 Thread Dave Chinner via Linux-f2fs-devel
On Wed, Apr 05, 2023 at 08:09:27AM -0700, Darrick J. Wong wrote:
> On Wed, Apr 05, 2023 at 05:01:42PM +0200, Andrey Albershteyn wrote:
> > On Tue, Apr 04, 2023 at 09:10:47AM -0700, Darrick J. Wong wrote:
> > > On Tue, Apr 04, 2023 at 04:53:15PM +0200, Andrey Albershteyn wrote:
> > > > The direct path is not supported on verity files. Attempts to use direct
> > > > I/O path on such files should fall back to buffered I/O path.
> > > > 
> > > > Signed-off-by: Andrey Albershteyn 
> > > > ---
> > > >  fs/xfs/xfs_file.c | 14 +++---
> > > >  1 file changed, 11 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > index 947b5c436172..9e072e82f6c1 100644
> > > > --- a/fs/xfs/xfs_file.c
> > > > +++ b/fs/xfs/xfs_file.c
> > > > @@ -244,7 +244,8 @@ xfs_file_dax_read(
> > > > struct kiocb*iocb,
> > > > struct iov_iter *to)
> > > >  {
> > > > -   struct xfs_inode*ip = 
> > > > XFS_I(iocb->ki_filp->f_mapping->host);
> > > > +   struct inode*inode = iocb->ki_filp->f_mapping->host;
> > > > +   struct xfs_inode*ip = XFS_I(inode);
> > > > ssize_t ret = 0;
> > > >  
> > > > trace_xfs_file_dax_read(iocb, to);
> > > > @@ -297,10 +298,17 @@ xfs_file_read_iter(
> > > >  
> > > > if (IS_DAX(inode))
> > > > ret = xfs_file_dax_read(iocb, to);
> > > > -   else if (iocb->ki_flags & IOCB_DIRECT)
> > > > +   else if (iocb->ki_flags & IOCB_DIRECT && 
> > > > !fsverity_active(inode))
> > > > ret = xfs_file_dio_read(iocb, to);
> > > > -   else
> > > > +   else {
> > > > +   /*
> > > > +* In case fs-verity is enabled, we also fallback to the
> > > > +* buffered read from the direct read path. Therefore,
> > > > +* IOCB_DIRECT is set and need to be cleared
> > > > +*/
> > > > +   iocb->ki_flags &= ~IOCB_DIRECT;
> > > > ret = xfs_file_buffered_read(iocb, to);
> > > 
> > > XFS doesn't usually allow directio fallback to the pagecache. Why
> > > would fsverity be any different?
> > 
> > Didn't know that, this is what happens on ext4 so I did the same.
> > Then it probably make sense to just error on DIRECT on verity
> > sealed file.
> 
> Thinking about this a little more -- I suppose we shouldn't just go
> breaking directio reads from a verity file if we can help it.  Is there
> a way to ask fsverity to perform its validation against some arbitrary
> memory buffer that happens to be fs-block aligned?

The memory buffer doesn't even need to be fs-block aligned - it just
needs to be a pointer to memory the kernel can read...

We also need fsverity to be able to handle being passed mapped
kernel memory rather than pages/folios for the merkle tree
interfaces. That way we can just pass it the mapped buffer memory
straight from the xfs-buf and we don't have to do the whacky "copy
from xattr xfs_bufs into pages so fsverity can take temporary
reference counts on what it thinks are page cache pages" as it walks
the merkle tree.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 16/23] xfs: add inode on-disk VERITY flag

2023-04-04 Thread Dave Chinner via Linux-f2fs-devel
On Tue, Apr 04, 2023 at 03:41:23PM -0700, Eric Biggers wrote:
> Hi Andrey,
> 
> On Tue, Apr 04, 2023 at 04:53:12PM +0200, Andrey Albershteyn wrote:
> > Add flag to mark inodes which have fs-verity enabled on them (i.e.
> > descriptor exist and tree is built).
> > 
> > Signed-off-by: Andrey Albershteyn 
> > ---
> >  fs/ioctl.c | 4 
> >  fs/xfs/libxfs/xfs_format.h | 4 +++-
> >  fs/xfs/xfs_inode.c | 2 ++
> >  fs/xfs/xfs_iops.c  | 2 ++
> >  include/uapi/linux/fs.h| 1 +
> >  5 files changed, 12 insertions(+), 1 deletion(-)
> [...]
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index b7b56871029c..5172a2eb902c 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -140,6 +140,7 @@ struct fsxattr {
> >  #define FS_XFLAG_FILESTREAM0x4000  /* use filestream 
> > allocator */
> >  #define FS_XFLAG_DAX   0x8000  /* use DAX for IO */
> >  #define FS_XFLAG_COWEXTSIZE0x0001  /* CoW extent size 
> > allocator hint */
> > +#define FS_XFLAG_VERITY0x0002  /* fs-verity sealed 
> > inode */
> >  #define FS_XFLAG_HASATTR   0x8000  /* no DIFLAG for this   */
> >  
> 
> I don't think "xfs: add inode on-disk VERITY flag" is an accurate description 
> of
> a patch that involves adding something to the UAPI.

Well it does that, but it also adds the UAPI for querying the
on-disk flag via the FS_IOC_FSGETXATTR interface as well.  It
probably should be split up into two patches.

> Should the other filesystems support this new flag too?

I think they should get it automatically now that it has been
defined for FS_IOC_FSGETXATTR and added to the generic fileattr flag
fill functions in fs/ioctl.c.

> I'd also like all ways of getting the verity flag to continue to be mentioned 
> in
> Documentation/filesystems/fsverity.rst.  The existing methods (FS_IOC_GETFLAGS
> and statx) are already mentioned there.

*nod*

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 06/23] fsverity: add drop_page() callout

2023-04-04 Thread Dave Chinner via Linux-f2fs-devel
On Tue, Apr 04, 2023 at 04:53:02PM +0200, Andrey Albershteyn wrote:
> Allow filesystem to make additional processing on verified pages
> instead of just dropping a reference. This will be used by XFS for
> internal buffer cache manipulation in further patches. The btrfs,
> ext4, and f2fs just drop the reference.
> 
> Signed-off-by: Andrey Albershteyn 
> ---
>  fs/btrfs/verity.c | 12 
>  fs/ext4/verity.c  |  6 ++
>  fs/f2fs/verity.c  |  6 ++
>  fs/verity/read_metadata.c |  4 ++--
>  fs/verity/verify.c|  6 +++---
>  include/linux/fsverity.h  | 10 ++
>  6 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/verity.c b/fs/btrfs/verity.c
> index c5ff16f9e9fa..4c2c09204bb4 100644
> --- a/fs/btrfs/verity.c
> +++ b/fs/btrfs/verity.c
> @@ -804,10 +804,22 @@ static int btrfs_write_merkle_tree_block(struct inode 
> *inode, const void *buf,
>  pos, buf, size);
>  }
>  
> +/*
> + * fsverity op that releases the reference obtained by 
> ->read_merkle_tree_page()
> + *
> + * @page:  reference to the page which can be released
> + *
> + */
> +static void btrfs_drop_page(struct page *page)
> +{
> + put_page(page);
> +}
> +
>  const struct fsverity_operations btrfs_verityops = {
>   .begin_enable_verity = btrfs_begin_enable_verity,
>   .end_enable_verity   = btrfs_end_enable_verity,
>   .get_verity_descriptor   = btrfs_get_verity_descriptor,
>   .read_merkle_tree_page   = btrfs_read_merkle_tree_page,
>   .write_merkle_tree_block = btrfs_write_merkle_tree_block,
> + .drop_page   = &btrfs_drop_page,
>  };

Ok, that's a generic put_page() call.


> diff --git a/fs/verity/verify.c b/fs/verity/verify.c
> index f50e3b5b52c9..c2fc4c86af34 100644
> --- a/fs/verity/verify.c
> +++ b/fs/verity/verify.c
> @@ -210,7 +210,7 @@ verify_data_block(struct inode *inode, struct 
> fsverity_info *vi,
>   if (is_hash_block_verified(vi, hpage, hblock_idx)) {
>   memcpy_from_page(_want_hash, hpage, hoffset, hsize);
>   want_hash = _want_hash;
> - put_page(hpage);
> + inode->i_sb->s_vop->drop_page(hpage);
>   goto descend;

fsverity_drop_page(hpage);

static inline void
fsverity_drop_page(struct inode *inode, struct page *page)
{
if (inode->i_sb->s_vop->drop_page)
inode->i_sb->s_vop->drop_page(page);
else
put_page(page);
}

And then you don't need to add the functions to each of the
filesystems nor make an indirect call just to run put_page().

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 0/4] fsverity cleanups

2022-12-14 Thread Dave Chinner
On Wed, Dec 14, 2022 at 02:43:00PM -0800, Eric Biggers wrote:
> This series implements a few cleanups that have been suggested 
> in the thread "[RFC PATCH 00/11] fs-verity support for XFS"
> (https://lore.kernel.org/linux-fsdevel/20221213172935.680971-1-aalbe...@redhat.com/T/#u).
> 
> This applies to mainline (commit 93761c93e9da).
> 
> Eric Biggers (4):
>   fsverity: optimize fsverity_file_open() on non-verity files
>   fsverity: optimize fsverity_prepare_setattr() on non-verity files
>   fsverity: optimize fsverity_cleanup_inode() on non-verity files
>   fsverity: pass pos and size to ->write_merkle_tree_block
> 
>  fs/btrfs/verity.c| 19 ---
>  fs/ext4/verity.c |  6 ++--
>  fs/f2fs/verity.c |  6 ++--
>  fs/verity/enable.c   |  4 +--
>  fs/verity/open.c | 46 -
>  include/linux/fsverity.h | 74 +---
>  6 files changed, 84 insertions(+), 71 deletions(-)

The whole series looks fairly sane to me.

Acked-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()

2022-11-03 Thread Dave Chinner
On Thu, Nov 03, 2022 at 07:45:01PM -0700, Darrick J. Wong wrote:
> On Fri, Nov 04, 2022 at 11:32:35AM +1100, Dave Chinner wrote:
> > On Thu, Nov 03, 2022 at 03:28:05PM -0700, Vishal Moola wrote:
> > > On Wed, Oct 19, 2022 at 08:01:52AM +1100, Dave Chinner wrote:
> > > > On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote:
> > > > > - BUG_ON(PageWriteback(page));
> > > > > - if (!clear_page_dirty_for_io(page))
> > > > > + BUG_ON(folio_test_writeback(folio));
> > > > > + if (!folio_clear_dirty_for_io(folio))
> > > > >   goto continue_unlock;
> > > > >  
> > > > >   trace_wbc_writepage(wbc, 
> > > > > inode_to_bdi(mapping->host));
> > > > > - error = (*writepage)(page, wbc, data);
> > > > > + error = writepage(&folio->page, wbc, data);
> > > > 
> > > > Yet, IIUC, this treats all folios as if they are single page folios.
> > > > i.e. it passes the head page of a multi-page folio to a callback
> > > > that will treat it as a single PAGE_SIZE page, because that's all
> > > > the writepage callbacks are currently expected to be passed...
> > > > 
> > > > So won't this break writeback of dirty multipage folios?
> > > 
> > > Yes, it appears it would. But it wouldn't because its already 'broken'.
> > 
> > It is? Then why isn't XFS broken on existing kernels? Oh, we don't
> > know because it hasn't been tested?
> > 
> > Seriously - if this really is broken, and this patchset further
> > propagating the brokeness, then somebody needs to explain to me why
> > this is not corrupting data in XFS.
> 
> It looks like iomap_do_writepage finds the folio size correctly
> 
>   end_pos = folio_pos(folio) + folio_size(folio);
> 
> and iomap_writpage_map will map out the correct number of blocks
> 
>   unsigned nblocks = i_blocks_per_folio(inode, folio);
> 
>   for (i = 0; i < nblocks && pos < end_pos; i++, pos += len) {
> 
> right?

Yup, that's how I read it, too.

But my recent experience with folios involved being repeatedly
burnt by edge case corruptions due to multipage folios showing up
when and where I least expected them.

Hence doing a 1:1 conversion of page based code to folio based code
and just assuming large folios will work without any testing seems
akin to playing russian roulette with loose cannons that have been
doused with napalm and then set on fire by an air-dropped barrel
bomb...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()

2022-11-03 Thread Dave Chinner
On Thu, Nov 03, 2022 at 03:28:05PM -0700, Vishal Moola wrote:
> On Wed, Oct 19, 2022 at 08:01:52AM +1100, Dave Chinner wrote:
> > On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote:
> > > Converted function to use folios throughout. This is in preparation for
> > > the removal of find_get_pages_range_tag().
> > > 
> > > Signed-off-by: Vishal Moola (Oracle) 
> > > ---
> > >  mm/page-writeback.c | 44 +++-
> > >  1 file changed, 23 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index 032a7bf8d259..087165357a5a 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -2285,15 +2285,15 @@ int write_cache_pages(struct address_space 
> > > *mapping,
> > >   int ret = 0;
> > >   int done = 0;
> > >   int error;
> > > - struct pagevec pvec;
> > > - int nr_pages;
> > > + struct folio_batch fbatch;
> > > + int nr_folios;
> > >   pgoff_t index;
> > >   pgoff_t end;/* Inclusive */
> > >   pgoff_t done_index;
> > >   int range_whole = 0;
> > >   xa_mark_t tag;
> > >  
> > > - pagevec_init(&pvec);
> > > + folio_batch_init(&fbatch);
> > >   if (wbc->range_cyclic) {
> > >   index = mapping->writeback_index; /* prev offset */
> > >   end = -1;
> > > @@ -2313,17 +2313,18 @@ int write_cache_pages(struct address_space 
> > > *mapping,
> > >   while (!done && (index <= end)) {
> > >   int i;
> > >  
> > > - nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end,
> > > - tag);
> > > - if (nr_pages == 0)
> > > + nr_folios = filemap_get_folios_tag(mapping, &index, end,
> > > + tag, &fbatch);
> > 
> > This can find and return dirty multi-page folios if the filesystem
> > enables them in the mapping at instantiation time, right?
> 
> Yup, it will.
> 
> > > +
> > > + if (nr_folios == 0)
> > >   break;
> > >  
> > > - for (i = 0; i < nr_pages; i++) {
> > > - struct page *page = pvec.pages[i];
> > > + for (i = 0; i < nr_folios; i++) {
> > > + struct folio *folio = fbatch.folios[i];
> > >  
> > > - done_index = page->index;
> > > + done_index = folio->index;
> > >  
> > > - lock_page(page);
> > > + folio_lock(folio);
> > >  
> > >   /*
> > >* Page truncated or invalidated. We can freely skip it
> > > @@ -2333,30 +2334,30 @@ int write_cache_pages(struct address_space 
> > > *mapping,
> > >* even if there is now a new, dirty page at the same
> > >* pagecache address.
> > >*/
> > > - if (unlikely(page->mapping != mapping)) {
> > > + if (unlikely(folio->mapping != mapping)) {
> > >  continue_unlock:
> > > - unlock_page(page);
> > > + folio_unlock(folio);
> > >   continue;
> > >   }
> > >  
> > > - if (!PageDirty(page)) {
> > > + if (!folio_test_dirty(folio)) {
> > >   /* someone wrote it for us */
> > >   goto continue_unlock;
> > >   }
> > >  
> > > - if (PageWriteback(page)) {
> > > + if (folio_test_writeback(folio)) {
> > >   if (wbc->sync_mode != WB_SYNC_NONE)
> > > - wait_on_page_writeback(page);
> > > + folio_wait_writeback(folio);
> > >   else
> > >   goto continue_unlock;
> > >   }
> > >  
> > > - BUG_ON(PageWriteback(page));
> > > - if (!clear_page_dirty_for_io(page))
> > > + BUG_ON(folio_test_writeback(folio));
> > > + if (!folio_clear_dirty_for_io(folio))
> > >

Re: [f2fs-dev] [PATCH v4 00/23] Convert to filemap_get_folios_tag()

2022-11-03 Thread Dave Chinner
On Thu, Nov 03, 2022 at 09:38:48AM -0700, Vishal Moola wrote:
> On Thu, Nov 3, 2022 at 12:08 AM Dave Chinner  wrote:
> >
> > On Wed, Nov 02, 2022 at 09:10:08AM -0700, Vishal Moola (Oracle) wrote:
> > > This patch series replaces find_get_pages_range_tag() with
> > > filemap_get_folios_tag(). This also allows the removal of multiple
> > > calls to compound_head() throughout.
> > > It also makes a good chunk of the straightforward conversions to folios,
> > > and takes the opportunity to introduce a function that grabs a folio
> > > from the pagecache.
> > >
> > > F2fs and Ceph have quite a lot of work to be done regarding folios, so
> > > for now those patches only have the changes necessary for the removal of
> > > find_get_pages_range_tag(), and only support folios of size 1 (which is
> > > all they use right now anyways).
> > >
> > > I've run xfstests on btrfs, ext4, f2fs, and nilfs2, but more testing may 
> > > be
> > > beneficial. The page-writeback and filemap changes implicitly work. 
> > > Testing
> > > and review of the other changes (afs, ceph, cifs, gfs2) would be 
> > > appreciated.
> >
> > Same question as last time: have you tested this with multipage
> > folios enabled? If you haven't tested XFS, then I'm guessing the
> > answer is no, and you haven't fixed the bug I pointed out in
> > the write_cache_pages() implementation
> >
> 
> I haven't tested the series with multipage folios or XFS.
> 
> I don't seem to have gotten your earlier comments, and I
> can't seem to find them on the mailing lists. Could you
> please send them again so I can take a look?

They are in the lore -fsdevel archive - no idea why you couldn't
find them

https://lore.kernel.org/linux-fsdevel/20221018210152.gh2703...@dread.disaster.area/
https://lore.kernel.org/linux-fsdevel/20221018214544.gi2703...@dread.disaster.area/

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v4 00/23] Convert to filemap_get_folios_tag()

2022-11-03 Thread Dave Chinner
On Wed, Nov 02, 2022 at 09:10:08AM -0700, Vishal Moola (Oracle) wrote:
> This patch series replaces find_get_pages_range_tag() with
> filemap_get_folios_tag(). This also allows the removal of multiple
> calls to compound_head() throughout.
> It also makes a good chunk of the straightforward conversions to folios,
> and takes the opportunity to introduce a function that grabs a folio
> from the pagecache.
> 
> F2fs and Ceph have quite a lot of work to be done regarding folios, so
> for now those patches only have the changes necessary for the removal of
> find_get_pages_range_tag(), and only support folios of size 1 (which is
> all they use right now anyways).
> 
> I've run xfstests on btrfs, ext4, f2fs, and nilfs2, but more testing may be
> beneficial. The page-writeback and filemap changes implicitly work. Testing
> and review of the other changes (afs, ceph, cifs, gfs2) would be appreciated.

Same question as last time: have you tested this with multipage
folios enabled? If you haven't tested XFS, then I'm guessing the
answer is no, and you haven't fixed the bug I pointed out in
the write_cache_pages() implementation

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 00/23] Convert to filemap_get_folios_tag()

2022-10-18 Thread Dave Chinner
On Thu, Sep 01, 2022 at 03:01:15PM -0700, Vishal Moola (Oracle) wrote:
> This patch series replaces find_get_pages_range_tag() with
> filemap_get_folios_tag(). This also allows the removal of multiple
> calls to compound_head() throughout.
> It also makes a good chunk of the straightforward conversions to folios,
> and takes the opportunity to introduce a function that grabs a folio
> from the pagecache.
> 
> F2fs and Ceph have quite alot of work to be done regarding folios, so
> for now those patches only have the changes necessary for the removal of
> find_get_pages_range_tag(), and only support folios of size 1 (which is
> all they use right now anyways).
> 
> I've run xfstests on btrfs, ext4, f2fs, and nilfs2, but more testing may be
> beneficial.

Well, that answers my question about how filesystems that enable
multi-page folios were tested: they weren't. 

I'd suggest that anyone working on further extending the
filemap/folio infrastructure really needs to be testing XFS as a
first priority, and then other filesystems as a secondary concern.

That's because XFS (via the fs/iomap infrastructure) is one of only
3 filesystems in the kernel (AFS and tmpfs are the others) that
interact with the page cache and page cache "pages" solely via folio
interfaces. As such they are able to support multi-page folios in
the page cache. All of the tested filesystems still use the fixed
PAGE_SIZE page interfaces to interact with the page cache, so they
don't actually exercise interactions with multi-page folios at all.

Hence if you are converting generic infrastructure that looks up
pages in the page cache to look up folios in the page cache, the
code that processes the returned folios also needs to be updated and
validated to ensure that it correctly handles multi-page folios. And
the only way you can do that fully at this point in time is via
testing XFS or AFS...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()

2022-10-18 Thread Dave Chinner
On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote:
> Converted function to use folios throughout. This is in preparation for
> the removal of find_get_pages_range_tag().
> 
> Signed-off-by: Vishal Moola (Oracle) 
> ---
>  mm/page-writeback.c | 44 +++-
>  1 file changed, 23 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 032a7bf8d259..087165357a5a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2285,15 +2285,15 @@ int write_cache_pages(struct address_space *mapping,
>   int ret = 0;
>   int done = 0;
>   int error;
> - struct pagevec pvec;
> - int nr_pages;
> + struct folio_batch fbatch;
> + int nr_folios;
>   pgoff_t index;
>   pgoff_t end;/* Inclusive */
>   pgoff_t done_index;
>   int range_whole = 0;
>   xa_mark_t tag;
>  
> - pagevec_init(&pvec);
> + folio_batch_init(&fbatch);
>   if (wbc->range_cyclic) {
>   index = mapping->writeback_index; /* prev offset */
>   end = -1;
> @@ -2313,17 +2313,18 @@ int write_cache_pages(struct address_space *mapping,
>   while (!done && (index <= end)) {
>   int i;
>  
> - nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end,
> - tag);
> - if (nr_pages == 0)
> + nr_folios = filemap_get_folios_tag(mapping, &index, end,
> + tag, &fbatch);

This can find and return dirty multi-page folios if the filesystem
enables them in the mapping at instantiation time, right?

> +
> + if (nr_folios == 0)
>   break;
>  
> - for (i = 0; i < nr_pages; i++) {
> - struct page *page = pvec.pages[i];
> + for (i = 0; i < nr_folios; i++) {
> + struct folio *folio = fbatch.folios[i];
>  
> - done_index = page->index;
> + done_index = folio->index;
>  
> - lock_page(page);
> + folio_lock(folio);
>  
>   /*
>* Page truncated or invalidated. We can freely skip it
> @@ -2333,30 +2334,30 @@ int write_cache_pages(struct address_space *mapping,
>* even if there is now a new, dirty page at the same
>* pagecache address.
>*/
> - if (unlikely(page->mapping != mapping)) {
> + if (unlikely(folio->mapping != mapping)) {
>  continue_unlock:
> - unlock_page(page);
> + folio_unlock(folio);
>   continue;
>   }
>  
> - if (!PageDirty(page)) {
> + if (!folio_test_dirty(folio)) {
>   /* someone wrote it for us */
>   goto continue_unlock;
>   }
>  
> - if (PageWriteback(page)) {
> + if (folio_test_writeback(folio)) {
>   if (wbc->sync_mode != WB_SYNC_NONE)
> - wait_on_page_writeback(page);
> + folio_wait_writeback(folio);
>   else
>   goto continue_unlock;
>   }
>  
> - BUG_ON(PageWriteback(page));
> - if (!clear_page_dirty_for_io(page))
> + BUG_ON(folio_test_writeback(folio));
> + if (!folio_clear_dirty_for_io(folio))
>   goto continue_unlock;
>  
>   trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
> - error = (*writepage)(page, wbc, data);
> + error = writepage(&folio->page, wbc, data);

Yet, IIUC, this treats all folios as if they are single page folios.
i.e. it passes the head page of a multi-page folio to a callback
that will treat it as a single PAGE_SIZE page, because that's all
the writepage callbacks are currently expected to be passed...

So won't this break writeback of dirty multipage folios?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v4 6/9] f2fs: don't allow DIO reads but not DIO writes

2022-08-16 Thread Dave Chinner
On Mon, Aug 15, 2022 at 05:55:45PM -0700, Eric Biggers wrote:
> On Sat, Jul 30, 2022 at 08:08:26PM -0700, Jaegeuk Kim wrote:
> > On 07/25, Eric Biggers wrote:
> > > On Sat, Jul 23, 2022 at 07:01:59PM -0700, Jaegeuk Kim wrote:
> > > > On 07/22, Eric Biggers wrote:
> > > > > From: Eric Biggers 
> > > > > 
> > > > > Currently, if an f2fs filesystem is mounted with the mode=lfs and
> > > > > io_bits mount options, DIO reads are allowed but DIO writes are not.
> > > > > Allowing DIO reads but not DIO writes is an unusual restriction, which
> > > > > is likely to be surprising to applications, namely any application 
> > > > > that
> > > > > both reads and writes from a file (using O_DIRECT).  This behavior is
> > > > > also incompatible with the proposed STATX_DIOALIGN extension to statx.
> > > > > Given this, let's drop the support for DIO reads in this 
> > > > > configuration.
> > > > 
> > > > IIRC, we allowed DIO reads since applications complained a lower 
> > > > performance.
> > > > So, I'm afraid this change will make another confusion to users. Could
> > > > you please apply the new bahavior only for STATX_DIOALIGN?
> > > > 
> > > 
> > > Well, the issue is that the proposed STATX_DIOALIGN fields cannot 
> > > represent this
> > > weird case where DIO reads are allowed but not DIO writes.  So the 
> > > question is
> > > whether this case actually matters, in which case we should make 
> > > STATX_DIOALIGN
> > > distinguish between DIO reads and DIO writes, or whether it's some odd 
> > > edge case
> > > that doesn't really matter, in which case we could just fix it or make
> > > STATX_DIOALIGN report that DIO is unsupported.  I was hoping that you had 
> > > some
> > > insight here.  What sort of applications want DIO reads but not DIO 
> > > writes?
> > > Is this common at all?
> > 
> > I think there's no specific application to use the LFS mode at this
> > moment, but I'd like to allow DIO read for zoned device which will be
> > used for Android devices.
> > 
> 
> So if the zoned device feature becomes widely adopted, then STATX_DIOALIGN 
> will
> be useless on all Android devices?  That sounds undesirable.  Are you sure 
> that
> supporting DIO reads but not DIO writes actually works?  Does it not cause
> problems for existing applications?

What purpose does DIO in only one direction actually serve? All it
means is that we're forcibly mixing buffered and direct IO to the
same file and that simply never ends well from a data coherency POV.

Hence I'd suggest that mixing DIO reads and buffered writes like
this ends up exposing uses to the worst of both worlds - all of the
problems with none of the benefits...

> What we need to do is make a decision about whether this means we should build
> in a stx_dio_direction field (indicating no support / readonly support /
> writeonly support / readwrite support) into the API from the beginning.  If we
> don't do that, then I don't think we could simply add such a field later, as 
> the
> statx_dio_*_align fields will have already been assigned their meaning.  I 
> think
> we'd instead have to "duplicate" the API, with STATX_DIOROALIGN and
> statx_dio_ro_*_align fields.  That seems uglier than building a directional
> indicator into the API from the beginning.  On the other hand, requiring all
> programs to check stx_dio_direction would add complexity to using the API.
> 
> Any thoughts on this?

Decide whether partial, single direction DIO serves a useful purpose
before trying to work out what is needed in the API to indicate that
this sort of crazy will be supported

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] f2fs: add sysfs entry to avoid FUA

2022-05-27 Thread Dave Chinner
On Fri, May 27, 2022 at 05:26:32PM -0700, Jaegeuk Kim wrote:
> On 05/28, Dave Chinner wrote:
> > On Fri, May 27, 2022 at 09:33:55PM +, Eric Biggers wrote:
> > > [+Cc linux-block for FUA, and linux-xfs for iomap]
> > 
> > linux-fsdevel should really be used for iomap stuff...
> > 
> > > 
> > > On Fri, May 27, 2022 at 01:59:55PM -0700, Jaegeuk Kim wrote:
> > > > Some UFS storage gives slower performance on FUA than write+cache_flush.
> > > > Let's give a way to manage it.
> > > > 
> > > > Signed-off-by: Jaegeuk Kim 
> > > 
> > > Should the driver even be saying that it has FUA support in this case?  
> > > If the
> > > driver didn't claim FUA support, that would also solve this problem.
> > 
> > Agreed, this is a hardware problem that need to addressed with a
> > driver quirk to stop it advertising FUA support. The high level
> > fs/iomap code should always issue FUA writes where possible and
> > the lower layers tell the block layer whether to issue the FUA as
> > a FUA or write+cache flush pair.
> 
> I was thinking to turn off FUA in driver side quickly tho, one concern
> was the bandwidth vs. latency. What if the device can support FUA having
> short latency while giving low bandwidth?

Seriously, how is a user supposed to know this sort of thing about
the hardware they are using? They don't, and to expect them to not
only know about the existing of a weird sysfs knob, let alone how it
applies to their hardware and their workload is totally
unreasonable.

If the hardware has non-deterministic FUA write performance, or
requires very careful switch over between cache flushes and FUA to
get the most out of the hardware, then that's not something we can
tune or optimise for - that's just broken hardware and the drive
should quirk the brokeness away so nobody has to care about it. Tell
the hardware manufacturer to fix their hardware, don't try to hack
around it in software and then expect the user to know how to tune
for that broken hardware.

> In that case, we still have
> a room to utilize FUA for small-sized  writes such as filesystem metadata
> writes, but avoid DIO w/ FUA for sequential write stream.

Strawman.

We don't use FUA for normal DIO writes - they only get used for
O_DSYNC writes, in which case we either use FUA if the device
supports it, or we do a normal write followed by a cache flush.
If there are metadata updates that the O_DSYNC needs to also flush,
we don't use FUA by let the fileystem issue a cache flush in the
most optimal possible after the write completes.

Either way, using O_DSYNC DIO writes for streaming, sequential data
is a really poor choice for an application to make. Normal DIO
writes followed by fdatasync() to flush the metadata and caches once
will be much faster and far more efficient than a metadata and cache
flush after every single data write, FUA or not.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] f2fs: add sysfs entry to avoid FUA

2022-05-27 Thread Dave Chinner
On Fri, May 27, 2022 at 09:33:55PM +, Eric Biggers wrote:
> [+Cc linux-block for FUA, and linux-xfs for iomap]

linux-fsdevel should really be used for iomap stuff...

> 
> On Fri, May 27, 2022 at 01:59:55PM -0700, Jaegeuk Kim wrote:
> > Some UFS storage gives slower performance on FUA than write+cache_flush.
> > Let's give a way to manage it.
> > 
> > Signed-off-by: Jaegeuk Kim 
> 
> Should the driver even be saying that it has FUA support in this case?  If the
> driver didn't claim FUA support, that would also solve this problem.

Agreed, this is a hardware problem that need to addressed with a
driver quirk to stop it advertising FUA support. The high level
fs/iomap code should always issue FUA writes where possible and
the lower layers tell the block layer whether to issue the FUA as
a FUA or write+cache flush pair.

And, quite frankly, exposing this sort of "hardware needs help" knob
as a sysfs variable is exactly the sort of thing we should never do.

Users have no idea how to tune stuff like this correctly (even if
they knew it existed!), yet we know exactly what hardware has this
problem and the kernel already has mechanisms that would allow it to
just Do The Right Thing. IOWs, we can fix this without the user even
having to know that they have garbage hardware that needs special
help

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [RFC PATCH v2 1/7] statx: add I/O alignment information

2022-05-19 Thread Dave Chinner
On Thu, May 19, 2022 at 04:06:05PM -0700, Darrick J. Wong wrote:
> On Wed, May 18, 2022 at 04:50:05PM -0700, Eric Biggers wrote:
> > From: Eric Biggers 
> > 
> > Traditionally, the conditions for when DIO (direct I/O) is supported
> > were fairly simple: filesystems either supported DIO aligned to the
> > block device's logical block size, or didn't support DIO at all.
> > 
> > However, due to filesystem features that have been added over time (e.g,
> > data journalling, inline data, encryption, verity, compression,
> > checkpoint disabling, log-structured mode), the conditions for when DIO
> > is allowed on a file have gotten increasingly complex.  Whether a
> > particular file supports DIO, and with what alignment, can depend on
> > various file attributes and filesystem mount options, as well as which
> > block device(s) the file's data is located on.
> > 
> > XFS has an ioctl XFS_IOC_DIOINFO which exposes this information to
> > applications.  However, as discussed
> > (https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebigg...@kernel.org/T/#u),
> > this ioctl is rarely used and not known to be used outside of
> > XFS-specific code.  It also was never intended to indicate when a file
> > doesn't support DIO at all, and it only exposes the minimum I/O
> > alignment, not the optimal I/O alignment which has been requested too.
> > 
> > Therefore, let's expose this information via statx().  Add the
> > STATX_IOALIGN flag and three fields associated with it:
> > 
> > * stx_mem_align_dio: the alignment (in bytes) required for user memory
> >   buffers for DIO, or 0 if DIO is not supported on the file.
> > 
> > * stx_offset_align_dio: the alignment (in bytes) required for file
> >   offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
> >   on the file.  This will only be nonzero if stx_mem_align_dio is
> >   nonzero, and vice versa.
> > 
> > * stx_offset_align_optimal: the alignment (in bytes) suggested for file
> >   offsets and I/O segment lengths to get optimal performance.  This
> >   applies to both DIO and buffered I/O.  It differs from stx_blocksize
> >   in that stx_offset_align_optimal will contain the real optimum I/O
> >   size, which may be a large value.  In contrast, for compatibility
> >   reasons stx_blocksize is the minimum size needed to avoid page cache
> >   read/write/modify cycles, which may be much smaller than the optimum
> >   I/O size.  For more details about the motivation for this field, see
> >   https://lore.kernel.org/r/20220210040304.gm59...@dread.disaster.area
> 
> Hmm.  So I guess this is supposed to be the filesystem's best guess at
> the IO size that will minimize RMW cycles in the entire stack?  i.e. if
> the user does not want RMW of pagecache pages, of file allocation units
> (if COW is enabled), of RAID stripes, or in the storage itself, then it
> should ensure that all IOs are aligned to this value?
> 
> I guess that means for XFS it's effectively max(pagesize, i_blocksize,
> bdev io_opt, sb_width, and (pretend XFS can reflink the realtime volume)
> the rt extent size)?  I didn't see a manpage update for statx(2) but
> that's mostly what I'm interested in. :)

Yup, xfs_stat_blksize() should give a good idea of what we should
do. It will end up being pretty much that, except without the need
to a mount option to turn on the sunit/swidth return, and always
taking into consideration extent size hints rather than just doing
that for RT inodes...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v4 1/8] fs: move sgid strip operation from inode_init_owner into inode_sgid_strip

2022-04-20 Thread Dave Chinner
On Wed, Apr 20, 2022 at 01:27:39AM +, xuyang2018...@fujitsu.com wrote:
> on 2022/4/19 22:05, Christian Brauner wrote:
> > On Tue, Apr 19, 2022 at 07:47:07PM +0800, Yang Xu wrote:
> >> This has no functional change. Just create and export inode_sgid_strip api 
> >> for
> >> the subsequent patch. This function is used to strip S_ISGID mode when init
> >> a new inode.
> >>
> >> Acked-by: Christian Brauner (Microsoft)
> >> Signed-off-by: Yang Xu
> >> ---
> >>   fs/inode.c | 22 ++
> >>   include/linux/fs.h |  3 ++-
> >>   2 files changed, 20 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/fs/inode.c b/fs/inode.c
> >> index 9d9b422504d1..3215e61a0021 100644
> >> --- a/fs/inode.c
> >> +++ b/fs/inode.c
> >> @@ -2246,10 +2246,8 @@ void inode_init_owner(struct user_namespace 
> >> *mnt_userns, struct inode *inode,
> >>/* Directories are special, and always inherit S_ISGID */
> >>if (S_ISDIR(mode))
> >>mode |= S_ISGID;
> >> -  else if ((mode&  (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)&&
> >> -   !in_group_p(i_gid_into_mnt(mnt_userns, dir))&&
> >> -   !capable_wrt_inode_uidgid(mnt_userns, dir, CAP_FSETID))
> >> -  mode&= ~S_ISGID;
> >> +  else
> >> +  inode_sgid_strip(mnt_userns, dir,&mode);
> >>} else
> >>inode_fsgid_set(inode, mnt_userns);
> >>inode->i_mode = mode;
> >> @@ -2405,3 +2403,19 @@ struct timespec64 current_time(struct inode *inode)
> >>return timestamp_truncate(now, inode);
> >>   }
> >>   EXPORT_SYMBOL(current_time);
> >> +
> >> +void inode_sgid_strip(struct user_namespace *mnt_userns,
> >> +const struct inode *dir, umode_t *mode)
> >> +{
> >
> > I think with Willy agreeing in an earlier version with me and you
> > needing to resend anyway I'd say have this return umode_t instead of
> > passing a pointer.
> 
> IMO, I am fine with your and Willy way. But I need a reason otherwise
> I can't convince myself why not use mode pointer directly.

You should listen to experienced developers like Willy and Christian
when they say "follow existing coding conventions".  Indeed, Darrick
has also mentioned he'd prefer it to return the new mode, and I'd
also prefer that it returns the new mode.

> I have asked you and Willy before why return umode_t value is better, 
> why not modify mode pointer directly? Since we have use mode as 
> argument, why not modify mode pointer directly in function?

If the function had mulitple return status (e.g. an error or a mode)
the convention is to pass the mode output variable by reference and
return the error status. But there is only one return value from
this function - the mode - and hence it should be returned in the
return value, not passed by reference.

Passing by reference unnecessarily makes the code more complex and
less mainatainable.  Code that returns a single value is easy to
understand, is more flexible in the way callers can use it and it's
simpler to maintain.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] generic/066: attr1 is still there after log replay on f2fs

2022-03-09 Thread Dave Chinner
On Wed, Mar 09, 2022 at 03:34:27PM +0800, Chao Yu wrote:
> On 2022/3/9 14:22, Dave Chinner wrote:
> > On Wed, Mar 09, 2022 at 12:31:17PM +0800, Chao Yu wrote:
> > > On 2022/2/28 11:57, Sun Ke via Linux-f2fs-devel wrote:
> > > > The test fail on f2fs:
> > > >xattr names and values after second fsync log replay:
> > > ># file: SCRATCH_MNT/foobar
> > > >   +user.attr1="val1"
> > > >user.attr3="val3"
> > > > 
> > > > attr1 is still there after log replay.
> > > > I guess it is f2fs's special feature to improve the performance.
> > > > 
> > > > Signed-off-by: Sun Ke 
> > > > ---
> > > > 
> > > > Is it a BUG on f2fs?
> > > 
> > > I don't think so, it fails due to f2fs doesn't follow recovery rule which
> > > btrfs/ext4/xfs does, but it doesn't mean f2fs has break posix semantics of
> > > fsync().
> > 
> > I disagree.  A failure in this test is indicative of non-conformance
> > with the Linux definition of fsync() behaviour.
> > 
> > You are right in that it does not break POSIX fsync semantics, but
> > POSIX allows "do nothing" as a valid implementation. However,
> > because of this loophole, the POSIX definition is complete garbage
> > and we do not use it.
> > 
> > That behaviour that Linux filesytsems are supposed to implement is
> > defined in the Linux fsync() man page, and it goes way beyond what
> > POSIX requires:
> > 
> > $ man fsync
> > 
> > DESCRIPTION
> >  fsync() transfers ("flushes") all modified in-core data of
> >  (i.e., modified buffer cache pages for) the file referred to by
> >  the file descriptor fd to the disk device (or other permanent
> >  storage device) so that all changed information can be retrieved
> >  even if the  system  crashes  or  is rebooted.  This includes
> >  writing through or flushing a disk cache if present.  The call
> >  blocks until the device reports that the transfer has
> >  completed.
> > 
> >  As well as flushing the file data, fsync() also flushes the
> >  metadata information associated with the file (see inode(7)).
> > 
> > 
> > IOWs, fsync() on Linux is supposed to persist all data and
> > metadata associated with the inode to stable storage such that it
> > can be retreived after a crash or reboot. "metadata information"
> > includes xattrs attached to the inode that is being fsync()d.
> 
> Quoted from g/066:
> 
> echo "hello world!" >> $SCRATCH_MNT/foobar
> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
> $SETFATTR_PROG -x user.attr1 $SCRATCH_MNT/foobar
> ln $SCRATCH_MNT/foobar $SCRATCH_MNT/foobar_link
> touch $SCRATCH_MNT/qwerty
> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/qwerty
> 
> IIUC, to match what Linux fsync() manual restricts, if we want to persist the
> xattr removal, we should call fsync() on $SCRATCH_MNT/foobar after
> "$SETFATTR_PROG -x user.attr1 $SCRATCH_MNT/foobar"? rather than calling 
> fsync()
> on unrelated $SCRATCH_MNT/qwerty.

It might look that way, but it's not that straight forward: there's
a carefully constructed object dependency chain in this test that
defines what the correct behaviour should be here.

What's the link count of $SCRATCH_MNT/foobar when
$SCRATCH_MNT/qwerty is present after recovery? Is it 1 or 2?  Does
$SCRATCH_MNT/foobar_link exist?  And if $SCRATCH_MNT/foobar_link
exists, and the link count is 2. The test doesn't even look at these
things, but if user.attr1 is not present, it means that foobar_link
and qwerty are present, $SCRATCH_MNT has a link count of 5 and
foobar has a link count of 2 because that's the dependency chain
that leads to the user.attr1 removal being recovered correctly.

So what does SCRATCH_MNT actually contain when f2fs fails this test?

These depedencies exist because we can't just randomly re-order
recovery of modifications to individual inodes and certain
operations create atomic change dependencies between inodes. It's
those atomic change dependencies that this test is actually
exercising.  i.e. the link count changes tie directory modifications
to inode modifications and this creates cross-object ordering
dependencies down the line that fsync then exposes. That's what the
second part of this test is actually exercising

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] generic/066: attr1 is still there after log replay on f2fs

2022-03-08 Thread Dave Chinner
On Wed, Mar 09, 2022 at 12:31:17PM +0800, Chao Yu wrote:
> On 2022/2/28 11:57, Sun Ke via Linux-f2fs-devel wrote:
> > The test fail on f2fs:
> >   xattr names and values after second fsync log replay:
> >   # file: SCRATCH_MNT/foobar
> >  +user.attr1="val1"
> >   user.attr3="val3"
> > 
> > attr1 is still there after log replay.
> > I guess it is f2fs's special feature to improve the performance.
> > 
> > Signed-off-by: Sun Ke 
> > ---
> > 
> > Is it a BUG on f2fs?
> 
> I don't think so, it fails due to f2fs doesn't follow recovery rule which
> btrfs/ext4/xfs does, but it doesn't mean f2fs has break posix semantics of
> fsync().

I disagree.  A failure in this test is indicative of non-conformance
with the Linux definition of fsync() behaviour.

You are right in that it does not break POSIX fsync semantics, but
POSIX allows "do nothing" as a valid implementation. However,
because of this loophole, the POSIX definition is complete garbage
and we do not use it.

That behaviour that Linux filesytsems are supposed to implement is
defined in the Linux fsync() man page, and it goes way beyond what
POSIX requires:

$ man fsync

DESCRIPTION
fsync() transfers ("flushes") all modified in-core data of
(i.e., modified buffer cache pages for) the file referred to by
the file descriptor fd to the disk device (or other permanent
storage device) so that all changed information can be retrieved
even if the  system  crashes  or  is rebooted.  This includes
writing through or flushing a disk cache if present.  The call
blocks until the device reports that the transfer has
completed.

As well as flushing the file data, fsync() also flushes the
metadata information associated with the file (see inode(7)).


IOWs, fsync() on Linux is supposed to persist all data and
metadata associated with the inode to stable storage such that it
can be retreived after a crash or reboot. "metadata information"
includes xattrs attached to the inode that is being fsync()d.

*fdatasync()* does not require xattrs to be persisted unless
they are needed to retreive data, but that's not what g/066 is
exercising.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v10 0/5] add support for direct I/O with fscrypt using blk-crypto

2022-02-09 Thread Dave Chinner
On Tue, Feb 08, 2022 at 05:10:03PM -0800, Eric Biggers wrote:
> On Mon, Jan 24, 2022 at 10:03:32AM +1100, Dave Chinner wrote:
> > > 
> > >   /* 0xa0 */
> > > 
> > >   /* File range alignment needed for best performance, in bytes. */
> > >   __u32   stx_dio_fpos_align_opt;
> > 
> > This is a common property of both DIO and buffered IO, so no need
> > for it to be dio-only property.
> > 
> > __u32   stx_offset_align_optimal;
> > 
> 
> Looking at this more closely: will stx_offset_align_optimal actually be 
> useful,
> given that st[x]_blksize already exists?

Yes, because

> From the stat(2) and statx(2) man pages:
> 
>   st_blksize
>   This field  gives  the  "preferred"  block  size  for  efficient
>   filesystem I/O.
> 
>   stx_blksize
>   The "preferred" block size for efficient filesystem I/O.  (Writ‐
>   ing  to  a file in smaller chunks may cause an inefficient read-
>   modify-rewrite.)

... historically speaking, this is intended to avoid RMW cycles for
sub-block and/or sub-PAGE_SIZE write() IOs. i.e. the practical
definition of st_blksize is the *minimum* IO size the needed to
avoid page cache RMW cycles.

However, XFS has a "-o largeio" mount option, that sets this value
to internal optimal filesytsem alignment values such as stripe unit
or even stripe width (-o largeio,swalloc). THis means it can be up
to 2GB (maybe larger?) in size.

THe problem with this is that many applications are not prepared to
see a value of, say, 16MB in st_blksize rather than 4096 bytes. An
example of such problems are applications sizing their IO buffers as
a multiple of st_blksize - we've had applications fail because they
try to use multi-GB sized IO buffers as a result of setting
st_blksize to the filesystem/storage idea of optimal IO size rather
than PAGE_SIZE.

Hence, we can't really change the value of st_blksize without
risking random breakage in userspace. hence the practical definition
of st_blksize is the *minimum* IO size that avoids RMW cycles for an
individual write() syscall, not the most efficient IO size.

> File offsets aren't explicitly mentioned, but I think it's implied they should
> be a multiple of st[x]_blksize, just like the I/O size.  Otherwise, the I/O
> would obviously require reading/writing partial blocks.

Of course it implies aligned file offsets - block aligned IO is
absolutely necessary for effcient filesystem IO. It has for pretty
much the entire of unix history...

> So, the proposed stx_offset_align_optimal field sounds like the same thing to
> me.  Is there anything I'm misunderstanding?
>
> Putting stx_offset_align_optimal behind the STATX_DIRECTIO flag would also be
> confusing if it would apply to both direct and buffered I/O.

So just name the flag STATX_IOALIGN so that it can cover generic,
buffered specific and DIO specific parameters in one hit. Simple,
yes?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v10 0/5] add support for direct I/O with fscrypt using blk-crypto

2022-01-23 Thread Dave Chinner
On Thu, Jan 20, 2022 at 06:36:03PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 21, 2022 at 10:57:55AM +1100, Dave Chinner wrote:
> Sure.  How's this?  I couldn't think of a real case of directio
> requiring different alignments for pos and bytecount, so the only real
> addition here is the alignment requirements for best performance.
> 
> struct statx {
> ...
>   /* 0x90 */
>   __u64   stx_mnt_id;
> 
>   /* Memory buffer alignment required for directio, in bytes. */
>   __u32   stx_dio_mem_align;

__32stx_mem_align_dio;

(for consistency with suggestions below)

> 
>   /* File range alignment required for directio, in bytes. */
>   __u32   stx_dio_fpos_align_min;

"fpos" is not really a user term - "offset" is the userspace term for
file position, and it's much less of a random letter salad if it's
named that way. Also, we don't need "min" in the name; the
description of the field in the man page can give all the gory
details about it being the minimum required alignment.

__u32   stx_offset_align_dio;

> 
>   /* 0xa0 */
> 
>   /* File range alignment needed for best performance, in bytes. */
>   __u32   stx_dio_fpos_align_opt;

This is a common property of both DIO and buffered IO, so no need
for it to be dio-only property.

__u32   stx_offset_align_optimal;

> 
>   /* Maximum size of a directio request, in bytes. */
>   __u32   stx_dio_max_iosize;

Unnecessary, it will always be the syscall max IO size, because the
internal DIO code will slice and dice it down to the max sizes the
hardware supports.

> #define STATX_DIRECTIO0x00001000U /* Want/got directio geometry */
> 
> How about that?

Mostly seems reasonable at a first look.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v10 0/5] add support for direct I/O with fscrypt using blk-crypto

2022-01-20 Thread Dave Chinner
On Thu, Jan 20, 2022 at 02:48:52PM -0800, Eric Biggers wrote:
> On Fri, Jan 21, 2022 at 09:04:14AM +1100, Dave Chinner wrote:
> > On Thu, Jan 20, 2022 at 01:00:27PM -0800, Darrick J. Wong wrote:
> > > On Thu, Jan 20, 2022 at 12:39:14PM -0800, Eric Biggers wrote:
> > > > On Thu, Jan 20, 2022 at 09:10:27AM -0800, Darrick J. Wong wrote:
> > > > > On Thu, Jan 20, 2022 at 12:30:23AM -0800, Christoph Hellwig wrote:
> > > > > > On Wed, Jan 19, 2022 at 11:12:10PM -0800, Eric Biggers wrote:
> > > > > > > 
> > > > > > > Given the above, as far as I know the only remaining objection to 
> > > > > > > this
> > > > > > > patchset would be that DIO constraints aren't sufficiently 
> > > > > > > discoverable
> > > > > > > by userspace.  Now, to put this in context, this is a 
> > > > > > > longstanding issue
> > > > > > > with all Linux filesystems, except XFS which has XFS_IOC_DIOINFO. 
> > > > > > >  It's
> > > > > > > not specific to this feature, and it doesn't actually seem to be 
> > > > > > > too
> > > > > > > important in practice; many other filesystem features place 
> > > > > > > constraints
> > > > > > > on DIO, and f2fs even *only* allows fully FS block size aligned 
> > > > > > > DIO.
> > > > > > > (And for better or worse, many systems using fscrypt already have
> > > > > > > out-of-tree patches that enable DIO support, and people don't 
> > > > > > > seem to
> > > > > > > have trouble with the FS block size alignment requirement.)
> > > > > > 
> > > > > > It might make sense to use this as an opportunity to implement
> > > > > > XFS_IOC_DIOINFO for ext4 and f2fs.
> > > > > 
> > > > > Hmm.  A potential problem with DIOINFO is that it doesn't explicitly
> > > > > list the /file/ position alignment requirement:
> > > > > 
> > > > > struct dioattr {
> > > > >   __u32   d_mem;  /* data buffer memory alignment 
> > > > > */
> > > > >   __u32   d_miniosz;  /* min xfer size
> > > > > */
> > > > >   __u32   d_maxiosz;  /* max xfer size
> > > > > */
> > > > > };
> > > > 
> > > > Well, the comment above struct dioattr says:
> > > > 
> > > > /*
> > > >  * Direct I/O attribute record used with XFS_IOC_DIOINFO
> > > >  * d_miniosz is the min xfer size, xfer size multiple and file 
> > > > seek offset
> > > >  * alignment.
> > > >  */
> > > > 
> > > > So d_miniosz serves that purpose already.
> > > > 
> > > > > 
> > > > > Since I /think/ fscrypt requires that directio writes be aligned to 
> > > > > file
> > > > > block size, right?
> > > > 
> > > > The file position must be a multiple of the filesystem block size, yes.
> > > > Likewise for the "minimum xfer size" and "xfer size multiple", and the 
> > > > "data
> > > > buffer memory alignment" for that matter.  So I think XFS_IOC_DIOINFO 
> > > > would be
> > > > good enough for the fscrypt direct I/O case.
> > > 
> > > Oh, ok then.  In that case, just hoist XFS_IOC_DIOINFO to the VFS and
> > > add a couple of implementations for ext4 and f2fs, and I think that'll
> > > be enough to get the fscrypt patchset moving again.
> > 
> > On the contrary, I'd much prefer to see this information added to
> > statx(). The file offset alignment info is a property of the current
> > file (e.g. XFS can have different per-file requirements depending on
> > whether the file data is hosted on the data or RT device, etc) and
> > so it's not a fixed property of the filesystem.
> > 
> > statx() was designed to be extended with per-file property
> > information, and we already have stuff like filesystem block size in
> > that syscall. Hence I would much prefer that we extend it with the
> > DIO properties we need to support rather than "create" a new VFS
> > ioctl to extract this information. We already have statx()

Re: [f2fs-dev] [PATCH v10 0/5] add support for direct I/O with fscrypt using blk-crypto

2022-01-20 Thread Dave Chinner
eph 16.2 (seastar database support?)
-diod contains a copy of fsstress
-e2fsprogs contains a copy of fsstress
-openmpi (under ifdef SGIMPI)
-partclone - actually, that has a complete copy of the xfsprogs
 libxfs/ iand include/ directory in it, so it's using
 the old libxfs_device_alignment() call that uses
 XFS_IOC_DIOINFOD, and only when builing the xfsclone
 binary.

Yup, I can count them on one 6 fingered hand, and their only use is
when XFS filesystems are specifically discovered. :)

Hence I think it would be much more useful to application developers
to include the IO alignment information in statx(), not to lift an
ioctl that is pretty much unused and unknown outside the core XFS
development environment

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] MM: introduce memalloc_retry_wait()

2021-11-16 Thread Dave Chinner
On Wed, Nov 17, 2021 at 03:28:10PM +1100, NeilBrown wrote:
> 
> Various places in the kernel - largely in filesystems - respond to a
> memory allocation failure by looping around and re-trying.
.
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index aca874d33fe6..f2f2a5b28808 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -214,6 +214,27 @@ static inline void fs_reclaim_acquire(gfp_t gfp_mask) { }
>  static inline void fs_reclaim_release(gfp_t gfp_mask) { }
>  #endif
>  
> +/* Any memory-allocation retry loop should use
> + * memalloc_retry_wait(), and pass the flags for the most
> + * constrained allocation attempt that might have failed.
> + * This provides useful documentation of where loops are,
> + * and a central place to fine tune the waiting as the MM
> + * implementation changes.
> + */
> +static inline void memalloc_retry_wait(gfp_t gfp_flags)
> +{
> + gfp_flags = current_gfp_context(gfp_flags);
> + if ((gfp_flags & __GFP_DIRECT_RECLAIM) &&
> + !(gfp_flags & __GFP_NORETRY))
> + /* Probably waited already, no need for much more */
> + schedule_timeout_uninterruptible(1);
> + else
> + /* Probably didn't wait, and has now released a lock,
> +  * so now is a good time to wait
> +  */
> + schedule_timeout_uninterruptible(HZ/50);
> +}

The existing congestion_wait() calls io_schedule_timeout() under
TASK_UNINTERRUPTIBLE conditions.

Does changing all these calls just to a plain
schedule_timeout_uninterruptible() make any difference to behaviour?
At least process accounting will appear different (uninterruptible
sleep instead of IO wait), and I suspect that the block plug
flushing in io_schedule() might be a good idea to retain for all the
filesystems that call this function from IO-related routines.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v4] fs: forbid invalid project ID

2021-07-13 Thread Dave Chinner
On Sat, Jul 10, 2021 at 10:39:59PM +0800, Wang Shilong wrote:
> From: Wang Shilong 
> 
> fileattr_set_prepare() should check if project ID
> is valid, otherwise dqget() will return NULL for
> such project ID quota.
> 
> Signed-off-by: Wang Shilong 
> ---
> v3->v3:
> only check project Id if caller is allowed
> to change and being changed.
> 
> v2->v3: move check before @fsx_projid is accessed
> and use make_kprojid() helper.
> 
> v1->v2: try to fix in the VFS
>  fs/ioctl.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 1e2204fa9963..d4fabb5421cd 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -817,6 +817,14 @@ static int fileattr_set_prepare(struct inode *inode,
>   if ((old_ma->fsx_xflags ^ fa->fsx_xflags) &
>   FS_XFLAG_PROJINHERIT)
>   return -EINVAL;
> + } else {
> + /*
> +  * Caller is allowed to change the project ID. If it is being
> +  * changed, make sure that the new value is valid.
> +  */
> + if (old_ma->fsx_projid != fa->fsx_projid &&
> + !projid_valid(make_kprojid(&init_user_ns, fa->fsx_projid)))
> +         return -EINVAL;
>   }
>  
>   /* Check extent size hints. */

Looks good. Thanks!

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v3] fs: forbid invalid project ID

2021-07-06 Thread Dave Chinner
On Fri, Jul 02, 2021 at 10:02:43AM -0400, Wang Shilong wrote:
> fileattr_set_prepare() should check if project ID
> is valid, otherwise dqget() will return NULL for
> such project ID quota.
> 
> Signed-off-by: Wang Shilong 
> ---
> v2->v3: move check before @fsx_projid is accessed
> and use make_kprojid() helper.
> 
> v1->v2: try to fix in the VFS
> ---
>  fs/ioctl.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 1e2204fa9963..d7edc92df473 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -806,6 +806,8 @@ static int fileattr_set_prepare(struct inode *inode,
>   if (err)
>   return err;
>  
> + if (!projid_valid(make_kprojid(&init_user_ns, fa->fsx_projid)))
> + return -EINVAL;
>   /*
>* Project Quota ID state is only allowed to change from within the init
>* namespace. Enforce that restriction only if we are trying to change

I still don't think this is correct - read the comment directly
below where you put this. That was the code block I was referring to
early:

/*
 * Project Quota ID state is only allowed to change from within the init
 * namespace. Enforce that restriction only if we are trying to change
 * the quota ID state. Everything else is allowed in user namespaces.
 */
if (current_user_ns() != &init_user_ns) {
if (old_ma->fsx_projid != fa->fsx_projid)
return -EINVAL;
if ((old_ma->fsx_xflags ^ fa->fsx_xflags) &
FS_XFLAG_PROJINHERIT)
return -EINVAL;
}

IOWs: if we are not changing the projid, then we should not be
checking it for validity because of the way the whole get/set
interface works.

The reason for this is that this interface is a get/set pair, where
you first have to get the values from the filesystem, then modify
them and send the set back to the filesystem. If the filesystem
sends out fsx_projid = -1 (for whatever reason), the caller must
send that same value back into the filesystem if they are not
modifying the project ID. Hence we have to accept any projid from
the caller that matches the current filesystem value, regardless of
whether it is an invalid value or not.

Therefore, we should only be checking if fa->fsx_projid is valid if
it is different to the current filesystem value *and* we are allowed
to change it. So:

/*
 * Project Quota ID state is only allowed to change from within the init
 * namespace. Enforce that restriction only if we are trying to change
 * the quota ID state. Everything else is allowed in user namespaces.
 */
if (current_user_ns() != &init_user_ns) {
if (old_ma->fsx_projid != fa->fsx_projid)
return -EINVAL;
if ((old_ma->fsx_xflags ^ fa->fsx_xflags) &
FS_XFLAG_PROJINHERIT)
return -EINVAL;
} else {

/*
 * Caller is allowed to change the project ID. If it is being
 * changed, make sure that the new value is valid.
 */
if (old_ma->fsx_projid != fa->fsx_projid &&
    !projid_valid(make_kprojid(&init_user_ns, fa->fsx_projid)))
return -EINVAL;
}

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2] fs: forbid invalid project ID

2021-06-28 Thread Dave Chinner
On Mon, Jun 28, 2021 at 08:38:01AM -0400, Wang Shilong wrote:
> fileattr_set_prepare() should check if project ID
> is valid, otherwise dqget() will return NULL for
> such project ID quota.
> 
> Signed-off-by: Wang Shilong 
> ---
> v1->v2: try to fix in the VFS
> ---
>  fs/ioctl.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 1e2204fa9963..5db5b218637b 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -845,6 +845,9 @@ static int fileattr_set_prepare(struct inode *inode,
>   if (fa->fsx_cowextsize == 0)
>   fa->fsx_xflags &= ~FS_XFLAG_COWEXTSIZE;
>  
> + if (!projid_valid(KPROJIDT_INIT(fa->fsx_projid)))
> + return -EINVAL;

This needs to go further up in this function in the section where
project IDs passed into this function are validated. Projids are
only allowed to be changed when current_user_ns() == &init_user_ns,
so this needs to be associated with that verification context.

This check should also use make_kprojid(), please, not open code
KPROJIDT_INIT.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 07/14] xfs: Refactor xfs_isilocked()

2021-06-08 Thread Dave Chinner
On Mon, Jun 07, 2021 at 04:52:17PM +0200, Jan Kara wrote:
> From: Pavel Reichl 
> 
> Refactor xfs_isilocked() to use newly introduced __xfs_rwsem_islocked().
> __xfs_rwsem_islocked() is a helper function which encapsulates checking
> state of rw_semaphores hold by inode.
> 
> Signed-off-by: Pavel Reichl 
> Suggested-by: Dave Chinner 
> Suggested-by: Eric Sandeen 
> Suggested-by: Darrick J. Wong 
> Reviewed-by: Darrick J. Wong 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Jan Kara 
> ---
>  fs/xfs/xfs_inode.c | 39 +++
>  fs/xfs/xfs_inode.h | 21 ++---
>  2 files changed, 45 insertions(+), 15 deletions(-)

As a standalone patch, this is overly elaborate and way more complex
than it needs to be. It's not really just a refactor, either,
because of the unnecessary shifting games it adds.

> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 0369eb22c1bb..6247977870bd 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -342,9 +342,34 @@ xfs_ilock_demote(
>  }
>  
>  #if defined(DEBUG) || defined(XFS_WARN)
> -int
> +static inline bool
> +__xfs_rwsem_islocked(
> + struct rw_semaphore *rwsem,
> + int lock_flags,
> + int shift)
> +{
> + lock_flags >>= shift;
> +
> + if (!debug_locks)
> + return rwsem_is_locked(rwsem);
> + /*
> +  * If the shared flag is not set, pass 0 to explicitly check for
> +  * exclusive access to the lock. If the shared flag is set, we typically
> +  * want to make sure the lock is at least held in shared mode
> +  * (i.e., shared | excl) but we don't necessarily care that it might
> +  * actually be held exclusive. Therefore, pass -1 to check whether the
> +  * lock is held in any mode rather than one of the explicit shared mode
> +  * values (1 or 2)."
> +  */
> + if (lock_flags & (1 << XFS_SHARED_LOCK_SHIFT)) {
> + return lockdep_is_held_type(rwsem, -1);
> + }
> + return lockdep_is_held_type(rwsem, 0);
> +}

Pass in a boolean value for shared/exclusive and
you can get rid of passing in the lock flags as well.

static bool
__xfs_rwsem_islocked(
struct rw_semaphore *rwsem,
boolshared)
{
if (!debug_locks)
return rwsem_is_locked(rwsem);

if (!shared)
return lockdep_is_held_type(rwsem, 0);

/*
 * We are checking that the lock is held at least in shared
 * mode but don't care that it might be held exclusively
 * (i.e. shared | excl). Hence we check if the lock is held
 * in any mode rather than an explicit shared mode.
 */
return lockdep_is_held_type(rwsem, -1);
}

> +
> +bool
>  xfs_isilocked(
> - xfs_inode_t *ip,
> + struct xfs_inode*ip,
>   uintlock_flags)
>  {
>   if (lock_flags & (XFS_ILOCK_EXCL|XFS_ILOCK_SHARED)) {
> @@ -359,15 +384,13 @@ xfs_isilocked(
>   return rwsem_is_locked(&ip->i_mmaplock.mr_lock);
>   }
>  
> - if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
> - if (!(lock_flags & XFS_IOLOCK_SHARED))
> - return !debug_locks ||
> - lockdep_is_held_type(&VFS_I(ip)->i_rwsem, 0);
> - return rwsem_is_locked(&VFS_I(ip)->i_rwsem);
> + if (lock_flags & (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED)) {
> + return __xfs_rwsem_islocked(&VFS_I(ip)->i_rwsem, lock_flags,
> + XFS_IOLOCK_FLAG_SHIFT);

Then this is simply:

return __xfs_rwsem_islocked(&VFS_I(ip)->i_rwsem,
(lock_flags & XFS_IOLOCK_SHARED));

And the conversion for the MMAPLOCK in the next patch is equally
simple.


> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index ca826cfba91c..1c0e15c480bc 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -262,12 +262,19 @@ static inline bool xfs_inode_has_bigtime(struct 
> xfs_inode *ip)
>   * Bit ranges:   1<<1  - 1<<16-1 -- iolock/ilock modes (bitfield)
>   *   1<<16 - 1<<32-1 -- lockdep annotation (integers)
>   */
> -#define  XFS_IOLOCK_EXCL (1<<0)
> -#define  XFS_IOLOCK_SHARED   (1<<1)
> -#define  XFS_ILOCK_EXCL  (1<<2)
> -#define  XFS_ILOCK_SHARED(1<<3)
> -#define  XFS_MMAPLOCK_EXCL   (1<<4)
> -#define  XFS_MMAPLOCK_SHARED (1<<5)
> +
> +#define XFS_IOLOCK_FLAG_SHIFT0
> +#def

Re: [f2fs-dev] [PATCH 07/13] xfs: Convert to use invalidate_lock

2021-05-25 Thread Dave Chinner
On Tue, May 25, 2021 at 03:50:44PM +0200, Jan Kara wrote:
> Use invalidate_lock instead of XFS internal i_mmap_lock. The intended
> purpose of invalidate_lock is exactly the same. Note that the locking in
> __xfs_filemap_fault() slightly changes as filemap_fault() already takes
> invalidate_lock.
> 
> Reviewed-by: Christoph Hellwig 
> CC: 
> CC: "Darrick J. Wong" 
> Signed-off-by: Jan Kara 
> ---
>  fs/xfs/xfs_file.c  | 12 ++-
>  fs/xfs/xfs_inode.c | 52 ++
>  fs/xfs/xfs_inode.h |  1 -
>  fs/xfs/xfs_super.c |  2 --
>  4 files changed, 36 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 396ef36dcd0a..dc9cb5c20549 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1282,7 +1282,7 @@ xfs_file_llseek(
>   *
>   * mmap_lock (MM)
>   *   sb_start_pagefault(vfs, freeze)
> - * i_mmaplock (XFS - truncate serialisation)
> + * invalidate_lock (vfs/XFS_MMAPLOCK - truncate serialisation)
>   *   page_lock (MM)
>   * i_lock (XFS - extent map serialisation)
>   */
> @@ -1303,24 +1303,26 @@ __xfs_filemap_fault(
>   file_update_time(vmf->vma->vm_file);
>   }
>  
> - xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>   if (IS_DAX(inode)) {
>   pfn_t pfn;
>  
> + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>   ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
>   (write_fault && !vmf->cow_page) ?
>&xfs_direct_write_iomap_ops :
>&xfs_read_iomap_ops);
>   if (ret & VM_FAULT_NEEDDSYNC)
>   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>   } else {
> - if (write_fault)
> + if (write_fault) {
> + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>   ret = iomap_page_mkwrite(vmf,
>   &xfs_buffered_write_iomap_ops);
> - else
> + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> + } else
>   ret = filemap_fault(vmf);
>   }
> - xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);

This seems kinda messy. filemap_fault() basically takes the
invalidate lock around the entire operation, it runs, so maybe it
would be cleaner to implement it as:

filemap_fault_locked(vmf)
{
/* does the filemap fault work */
}

filemap_fault(vmf)
{
filemap_invalidate_down_read(...)
ret = filemap_fault_locked(vmf)
filemap_invalidate_up_read(...)
return ret;
}

And that means XFS could just call filemap_fault_locked() and not 
have to do all this messy locking just to avoid holding the lock
that filemap_fault has now internalised.

> @@ -355,8 +358,11 @@ xfs_isilocked(
>  
>   if (lock_flags & (XFS_MMAPLOCK_EXCL|XFS_MMAPLOCK_SHARED)) {
>   if (!(lock_flags & XFS_MMAPLOCK_SHARED))
> - return !!ip->i_mmaplock.mr_writer;
> - return rwsem_is_locked(&ip->i_mmaplock.mr_lock);
> + return !debug_locks ||
> + lockdep_is_held_type(
> + &VFS_I(ip)->i_mapping->invalidate_lock,
> + 0);
> + return rwsem_is_locked(&VFS_I(ip)->i_mapping->invalidate_lock);
>   }



And so here we are again, losing more of our read vs write debug
checks on debug kernels when lockdep is not enabled

Can we please add rwsem_is_locked_read() and rwsem_is_locked_write()
wrappers that just look at the rwsem counter value to determine how
the lock is held? Then the mrlock_t can go away entirely

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock

2021-05-18 Thread Dave Chinner
On Fri, May 14, 2021 at 09:17:30AM -0700, Darrick J. Wong wrote:
> On Fri, May 14, 2021 at 09:19:45AM +1000, Dave Chinner wrote:
> > On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote:
> > > On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> > > > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > > > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > > > > +->fallocate implementation must be really careful to maintain page 
> > > > > > cache
> > > > > > +consistency when punching holes or performing other operations 
> > > > > > that invalidate
> > > > > > +page cache contents. Usually the filesystem needs to call
> > > > > > +truncate_inode_pages_range() to invalidate relevant range of the 
> > > > > > page cache.
> > > > > > +However the filesystem usually also needs to update its internal 
> > > > > > (and on disk)
> > > > > > +view of file offset -> disk block mapping. Until this update is 
> > > > > > finished, the
> > > > > > +filesystem needs to block page faults and reads from reloading 
> > > > > > now-stale page
> > > > > > +cache contents from the disk. VFS provides 
> > > > > > mapping->invalidate_lock for this
> > > > > > +and acquires it in shared mode in paths loading pages from disk
> > > > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem 
> > > > > > is
> > > > > > +responsible for taking this lock in its fallocate implementation 
> > > > > > and generally
> > > > > > +whenever the page cache contents needs to be invalidated because a 
> > > > > > block is
> > > > > > +moving from under a page.
> > > > > > +
> > > > > > +->copy_file_range and ->remap_file_range implementations need to 
> > > > > > serialize
> > > > > > +against modifications of file data while the operation is running. 
> > > > > > For blocking
> > > > > > +changes through write(2) and similar operations inode->i_rwsem can 
> > > > > > be used. For
> > > > > > +blocking changes through memory mapping, the filesystem can use
> > > > > > +mapping->invalidate_lock provided it also acquires it in its 
> > > > > > ->page_mkwrite
> > > > > > +implementation.
> > > > > 
> > > > > Question: What is the locking order when acquiring the invalidate_lock
> > > > > of two different files?  Is it the same as i_rwsem (increasing order 
> > > > > of
> > > > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that 
> > > > > is
> > > > > being hoisted here (increasing order of i_ino)?
> > > > > 
> > > > > The reason I ask is that remap_file_range has to do that, but I don't
> > > > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > > > > calls in xfs_ilock2_io_mmap in this series.
> > > > 
> > > > Good question. Technically, I don't think there's real need to 
> > > > establish a
> > > > single ordering because locks among different filesystems are never 
> > > > going
> > > > to be acquired together (effectively each lock type is local per sb and 
> > > > we
> > > > are free to define an ordering for each lock type differently). But to
> > > > maintain some sanity I guess having the same locking order for 
> > > > doublelock
> > > > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS 
> > > > uses
> > > > by-ino ordering? So that we don't have to consider two different orders 
> > > > in
> > > > xfs_lock_two_inodes()...
> > > 
> > > I imagine Dave will chime in on this, but I suspect the reason is
> > > hysterical raisins^Wreasons.
> > 
> > It's the locking rules that XFS has used pretty much forever.
> > Locking by inode number always guarantees the same locking order of
> > two inodes in the same filesystem, regardless of the specific
> > in-memory instances of the two inodes.
> > 
> > e.g. if we lock based on the inode structure address, in one
> > instancex, we could get A -> B, then B gets recycled and
>

Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock

2021-05-13 Thread Dave Chinner
On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote:
> On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > > +->fallocate implementation must be really careful to maintain page 
> > > > cache
> > > > +consistency when punching holes or performing other operations that 
> > > > invalidate
> > > > +page cache contents. Usually the filesystem needs to call
> > > > +truncate_inode_pages_range() to invalidate relevant range of the page 
> > > > cache.
> > > > +However the filesystem usually also needs to update its internal (and 
> > > > on disk)
> > > > +view of file offset -> disk block mapping. Until this update is 
> > > > finished, the
> > > > +filesystem needs to block page faults and reads from reloading 
> > > > now-stale page
> > > > +cache contents from the disk. VFS provides mapping->invalidate_lock 
> > > > for this
> > > > +and acquires it in shared mode in paths loading pages from disk
> > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > > +responsible for taking this lock in its fallocate implementation and 
> > > > generally
> > > > +whenever the page cache contents needs to be invalidated because a 
> > > > block is
> > > > +moving from under a page.
> > > > +
> > > > +->copy_file_range and ->remap_file_range implementations need to 
> > > > serialize
> > > > +against modifications of file data while the operation is running. For 
> > > > blocking
> > > > +changes through write(2) and similar operations inode->i_rwsem can be 
> > > > used. For
> > > > +blocking changes through memory mapping, the filesystem can use
> > > > +mapping->invalidate_lock provided it also acquires it in its 
> > > > ->page_mkwrite
> > > > +implementation.
> > > 
> > > Question: What is the locking order when acquiring the invalidate_lock
> > > of two different files?  Is it the same as i_rwsem (increasing order of
> > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > > being hoisted here (increasing order of i_ino)?
> > > 
> > > The reason I ask is that remap_file_range has to do that, but I don't
> > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > > calls in xfs_ilock2_io_mmap in this series.
> > 
> > Good question. Technically, I don't think there's real need to establish a
> > single ordering because locks among different filesystems are never going
> > to be acquired together (effectively each lock type is local per sb and we
> > are free to define an ordering for each lock type differently). But to
> > maintain some sanity I guess having the same locking order for doublelock
> > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> > by-ino ordering? So that we don't have to consider two different orders in
> > xfs_lock_two_inodes()...
> 
> I imagine Dave will chime in on this, but I suspect the reason is
> hysterical raisins^Wreasons.

It's the locking rules that XFS has used pretty much forever.
Locking by inode number always guarantees the same locking order of
two inodes in the same filesystem, regardless of the specific
in-memory instances of the two inodes.

e.g. if we lock based on the inode structure address, in one
instancex, we could get A -> B, then B gets recycled and
reallocated, then we get B -> A as the locking order for the same
two inodes.

That, IMNSHO, is utterly crazy because with non-deterministic inode
lock ordered like this you can't make consistent locking rules for
locking the physical inode cluster buffers underlying the inodes in
the situation where they also need to be locked.

We've been down this path before more than a decade ago when the
powers that be decreed that inode locking order is to be "by
structure address" rather than inode number, because "inode number
is not unique across multiple superblocks".

I'm not sure that there is anywhere that locks multiple inodes
across different superblocks, but here we are again

> It might simply be time to convert all
> three XFS inode locks to use the same ordering rules.

Careful, there lie dragons along that path because of things like
how the inode cluster buffer operations work - they all assume
ascending inode number traversal within and across inode cluster
buffers and hence we do have locking order constraints based on
inode number...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 0/12 v4] fs: Hole punch vs page cache filling races

2021-04-25 Thread Dave Chinner
ew pages in the page cache and looking up their corresponding
> block mapping. We also replace EXT4_I(inode)->i_mmap_sem with this new rwsem
> which provides necessary serialization with hole punching for ext4.
> 
>   Honza
> 
> [1] 
> https://lore.kernel.org/linux-fsdevel/caoq4uxjqnmxqmta_vbyw0su9rkrk2zobjmahcyeaevofkvq...@mail.gmail.com/
> 
> Previous versions:
> Link: 
> https://lore.kernel.org/linux-fsdevel/20210208163918.7871-1-j...@suse.cz/
> Link: http://lore.kernel.org/r/20210413105205.3093-1-j...@suse.cz
> 
> CC: ceph-de...@vger.kernel.org
> CC: Chao Yu 
> CC: Damien Le Moal 
> CC: "Darrick J. Wong" 
> CC: Hugh Dickins 
> CC: Jaegeuk Kim 
> CC: Jeff Layton 
> CC: Johannes Thumshirn 
> CC: linux-c...@vger.kernel.org
> CC: 
> CC: linux-f2fs-devel@lists.sourceforge.net
> CC: 
> CC: 
> CC: 
> CC: Miklos Szeredi 
> CC: Steve French 
> CC: Ted Tso 
> 

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 02/12] mm: Protect operations adding pages to page cache with invalidate_lock

2021-04-25 Thread Dave Chinner
On Fri, Apr 23, 2021 at 07:29:31PM +0200, Jan Kara wrote:
> Currently, serializing operations such as page fault, read, or readahead
> against hole punching is rather difficult. The basic race scheme is
> like:
> 
> fallocate(FALLOC_FL_PUNCH_HOLE)   read / fault / ..
>   truncate_inode_pages_range()
>   cache here>
>   
> 
> Now the problem is in this way read / page fault / readahead can
> instantiate pages in page cache with potentially stale data (if blocks
> get quickly reused). Avoiding this race is not simple - page locks do
> not work because we want to make sure there are *no* pages in given
> range. inode->i_rwsem does not work because page fault happens under
> mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> the performance for mixed read-write workloads suffer.
> 
> So create a new rw_semaphore in the address_space - invalidate_lock -
> that protects adding of pages to page cache for page faults / reads /
> readahead.
.
> diff --git a/fs/inode.c b/fs/inode.c
> index a047ab306f9a..43596dd8b61e 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -191,6 +191,9 @@ int inode_init_always(struct super_block *sb, struct 
> inode *inode)
>   mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
>   mapping->private_data = NULL;
>   mapping->writeback_index = 0;
> + init_rwsem(&mapping->invalidate_lock);
> + lockdep_set_class(&mapping->invalidate_lock,
> +   &sb->s_type->invalidate_lock_key);
>   inode->i_private = NULL;
>   inode->i_mapping = mapping;
>   INIT_HLIST_HEAD(&inode->i_dentry);  /* buggered by rcu freeing */

Oh, lockdep. That might be a problem here.

The XFS_MMAPLOCK has non-trivial lockdep annotations so that it is
tracked as nesting properly against the IOLOCK and the ILOCK. When
you end up using xfs_ilock(XFS_MMAPLOCK..) to lock this, XFS will
add subclass annotations to the lock and they are going to be
different to the locking that the VFS does.

We'll see this from xfs_lock_two_inodes() (e.g. in
xfs_swap_extents()) and xfs_ilock2_io_mmap() during reflink
oper.

Oh. The page cache copy done when breaking a shared extent needs
to lock out page faults on both the source and destination, but it
still needs to be able to populate the page cache of both the source
and destination file.

 and vfs_dedupe_file_range_compare() has to be able to read
pages from both the source and destination file to determine that
the contents are identical and that's done while we hold the
XFS_MMAPLOCK exclusively so the compare is atomic w.r.t. all other
user data modification operations being run

I now have many doubts that this "serialise page faults by locking
out page cache instantiation" method actually works as a generic
mechanism. It's not just page cache invalidation that relies on
being able to lock out page faults: copy-on-write and deduplication
both require the ability to populate the page cache with source data
while page faults are locked out so the data can be compared/copied
atomically with the extent level manipulations and so user data
modifications cannot occur until the physical extent manipulation
operation has completed.

Having only just realised this is a problem, no solution has
immediately popped into my mind. I'll chew on it over the weekend,
but I'm not hopeful at this point...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v2 04/12] fat: only specify I_DIRTY_TIME when needed in fat_update_time()

2021-01-11 Thread Dave Chinner
On Mon, Jan 11, 2021 at 11:50:27AM -0800, Eric Biggers wrote:
> On Mon, Jan 11, 2021 at 11:52:01AM +0100, Christoph Hellwig wrote:
> > On Fri, Jan 08, 2021 at 11:58:55PM -0800, Eric Biggers wrote:
> > > + if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false))
> > > + dirty_flags |= I_DIRTY_SYNC;
> > 
> > fat does not support i_version updates, so this bit can be skipped.
> 
> Is that really the case?  Any filesystem (including fat) can be mounted with
> "iversion", which causes SB_I_VERSION to be set.

That's a bug. Filesystems taht don't support persistent i_version on
disk need to clear SB_I_VERSION in their mount and remount paths
because the VFS iversion mount option was a complete screwup from
the start.

> A lot of filesystems (including fat) don't store i_version to disk, but it 
> looks
> like it will still get updated in-memory.  Could anything be relying on that?

If they do, then they are broken by definition. i_version as
reported to observers is defined as monotonically increasing with
every change to the inode. i.e. it never goes backwards. Which, of
course, it will do if you crash or even just unmount/mount a
filesystem that doesn't persist it.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 1/7] fscrypt: Add functions for direct I/O support

2020-07-26 Thread Dave Chinner
On Sun, Jul 26, 2020 at 07:59:46PM -0700, Eric Biggers wrote:
> On Mon, Jul 27, 2020 at 10:58:48AM +1000, Dave Chinner wrote:
> > On Sat, Jul 25, 2020 at 07:49:20PM -0700, Eric Biggers wrote:
> > > On Sat, Jul 25, 2020 at 10:14:41AM +1000, Dave Chinner wrote:
> > > > > +bool fscrypt_dio_supported(struct kiocb *iocb, struct iov_iter *iter)
> > > > > +{
> > > > > + const struct inode *inode = file_inode(iocb->ki_filp);
> > > > > + const unsigned int blocksize = i_blocksize(inode);
> > > > > +
> > > > > + /* If the file is unencrypted, no veto from us. */
> > > > > + if (!fscrypt_needs_contents_encryption(inode))
> > > > > + return true;
> > > > > +
> > > > > + /* We only support direct I/O with inline crypto, not fs-layer 
> > > > > crypto */
> > > > > + if (!fscrypt_inode_uses_inline_crypto(inode))
> > > > > + return false;
> > > > > +
> > > > > + /*
> > > > > +  * Since the granularity of encryption is filesystem blocks, 
> > > > > the I/O
> > > > > +  * must be block aligned -- not just disk sector aligned.
> > > > > +  */
> > > > > + if (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(iter), 
> > > > > blocksize))
> > > > > + return false;
> > > > 
> > > > Doesn't this force user buffers to be filesystem block size aligned,
> > > > instead of 512 byte aligned as is typical for direct IO?
> > > > 
> > > > That's going to cause applications that work fine on normal
> > > > filesystems becaues the memalign() buffers to 512 bytes or logical
> > > > block device sector sizes (as per the open(2) man page) to fail on
> > > > encrypted volumes, and it's not going to be obvious to users as to
> > > > why this happens.
> > > 
> > > The status quo is that direct I/O on encrypted files falls back to 
> > > buffered I/O.
> > 
> > Largely irrelevant.
> > 
> > You claimed in another thread that performance is a key feature that
> > inline encryption + DIO provides. Now you're implying that failing
> > to provide that performance doesn't really matter at all.
> > 
> > > So this patch is strictly an improvement; it's making direct I/O work in 
> > > a case
> > > where previously it didn't work.
> > 
> > Improvements still need to follow longstanding conventions. And,
> > IMO, it's not an improvement if the feature results in 
> > unpredictable performance for userspace applications.
.

> The open() man page does mention that O_DIRECT I/O typically needs to be 
> aligned
> to logical_block_size; however it also says "In Linux alignment restrictions
> vary by filesystem and kernel version and might be absent entirely."

Now you are just language-laywering. I'll quote from the next
paragraph in the man page:

"Since Linux 2.6.0, alignment to the logical block size of
the underlying storage (typically 512 bytes) suffices. The
logi cal block size can be determined using the ioctl(2)
BLKSSZGET operation [...]"

There's the longstanding convention I've been talking about, clearly
splet out. Applications that follow this convention (and there are
lots) should just work. Having code that works correctly on one file
but not on another -on the same filesystem- despite doing all the
right things is not at all user friendly.

What I really care about is that new functionality works without
requiring applications to be rewritten to cater specifically for
some whacky foilble in a new feature.

fscrypt is generic functionality and hardware acceleration of crypt
functions are only going to get more common in future. Hence the
combination of the two needs to *play nicely* with the vast library
of existing userspace software that is already out there.

We need to get this stuff correct right from the start, otherwise
we're just leaving a huge mountain of technical debt for our future
selfs to have to clean up.

> The other examples of falling back to buffered I/O are relevant, since they 
> show
> that similar issues are already being dealt with in the (rare) use cases of
> O_DIRECT.  So I don't think the convention is as strong as you think it is...

The convention is there so that applications that *expect* to be
using direct IO can do so -reliably-. Breaking conventions that
people have become accustomed to just because it is convenient for

Re: [f2fs-dev] [PATCH v6 1/7] fscrypt: Add functions for direct I/O support

2020-07-26 Thread Dave Chinner
On Sat, Jul 25, 2020 at 07:49:20PM -0700, Eric Biggers wrote:
> On Sat, Jul 25, 2020 at 10:14:41AM +1000, Dave Chinner wrote:
> > > +bool fscrypt_dio_supported(struct kiocb *iocb, struct iov_iter *iter)
> > > +{
> > > + const struct inode *inode = file_inode(iocb->ki_filp);
> > > + const unsigned int blocksize = i_blocksize(inode);
> > > +
> > > + /* If the file is unencrypted, no veto from us. */
> > > + if (!fscrypt_needs_contents_encryption(inode))
> > > + return true;
> > > +
> > > + /* We only support direct I/O with inline crypto, not fs-layer crypto */
> > > + if (!fscrypt_inode_uses_inline_crypto(inode))
> > > + return false;
> > > +
> > > + /*
> > > +  * Since the granularity of encryption is filesystem blocks, the I/O
> > > +  * must be block aligned -- not just disk sector aligned.
> > > +  */
> > > + if (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(iter), blocksize))
> > > + return false;
> > 
> > Doesn't this force user buffers to be filesystem block size aligned,
> > instead of 512 byte aligned as is typical for direct IO?
> > 
> > That's going to cause applications that work fine on normal
> > filesystems becaues the memalign() buffers to 512 bytes or logical
> > block device sector sizes (as per the open(2) man page) to fail on
> > encrypted volumes, and it's not going to be obvious to users as to
> > why this happens.
> 
> The status quo is that direct I/O on encrypted files falls back to buffered 
> I/O.

Largely irrelevant.

You claimed in another thread that performance is a key feature that
inline encryption + DIO provides. Now you're implying that failing
to provide that performance doesn't really matter at all.

> So this patch is strictly an improvement; it's making direct I/O work in a 
> case
> where previously it didn't work.

Improvements still need to follow longstanding conventions. And,
IMO, it's not an improvement if the feature results in 
unpredictable performance for userspace applications.

i.e. there is no point in enabling direct IO if it is unpredictably
going to fall back to the buffered IO path when applications are
coded to the guidelines the man page said they should use. Such
problems are an utter PITA to diagnose in the field, and on those
grounds alone the current implementation gets a NACK.

> Note that there are lots of other cases where ext4 and f2fs fall back to
> buffered I/O; see ext4_dio_supported() and f2fs_force_buffered_io().  So this
> isn't a new problem.

No shit, sherlock. But that's also irrelevant to the discussion at
hand - claiming "we can fall back to buffered IO" doesn't address
the problem I've raised. It's just an excuse for not fixing it.

Indeed, the problem is easy to fix - fscrypt only cares that the
user IO offset and length is DUN aligned.  fscrypt does not care
that the user memory buffer is filesystem block aligned - user
memory buffer alignment is an underlying hardware DMA constraint -
and so fscrypt_dio_supported() needs to relax or remove the user
memroy buffer alignment constraint so that it follows existing
conventions

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto

2020-07-25 Thread Dave Chinner
On Sun, Jul 26, 2020 at 09:47:51AM +1000, Dave Chinner wrote:
> On Fri, Jul 24, 2020 at 10:41:32AM -0700, Eric Biggers wrote:
> > But again, as far as I can tell, fs/iomap/direct-io.c currently *does* 
> > guarantee
> > that *if* the input is fully filesystem-block-aligned and if blocksize <=
> > PAGE_SIZE, then the issued I/O is also filesystem-block-aligned.
> 
> Please listen to what I'm saying, Eric.
> 
> The -current iomap implementation- may provide that behaviour. That
> doesn't mean we guarantee that behaviour. i.e. the iomap -design-
> does not guaranteee that behaviour, and we don't guarantee such
> behaviour into the future. And we won't guarantee this behaviour -
> even though the current implementation may provide it - because the
> rest of the IO stack below iomap does not provide iomap with that
> guarantee.
> 
> Hence if iomap cannot get a guarantee that IO it issues won't get
> split at some arbitrary boundary, it cannot provide filesystems with
> that guarantee.

BTW, if you want iomap_dio_rw() to provide an arbitrary bio
alignment guarantee at the iomap layer, then it should be returned
in the iomap along with the extent mapping. That could then be used
instead of the bdev logical block size. That won't guarantee the
behaviour of the rest of the stack, but it would provide a defined
IO submission behaviour that iomap would have to guarantee into the
future...

That would also remove the need to duplicate the alignment checks
in the filesystem for fscrypt DIO...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto

2020-07-25 Thread Dave Chinner
On Fri, Jul 24, 2020 at 10:41:32AM -0700, Eric Biggers wrote:
> On Fri, Jul 24, 2020 at 03:31:30PM +1000, Dave Chinner wrote:
> > On Thu, Jul 23, 2020 at 08:46:28PM -0700, Eric Biggers wrote:
> > > On Fri, Jul 24, 2020 at 11:39:10AM +1000, Dave Chinner wrote:
> > > > fscrypt_inode_uses_inline_crypto() ends up being:
> > > > 
> > > > if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode) &&
> > > > inode->i_crypt_info->ci_inlinecrypt)
> > > > 
> > > > I note there are no checks for inode->i_crypt_info being non-null,
> > > > and I note that S_ENCRYPTED is set on the inode when the on-disk
> > > > encrypted flag is encountered, not when inode->i_crypt_info is set.
> > > > 
> > > 
> > > ->i_crypt_info is set when the file is opened, so it's guaranteed to be 
> > > set for
> > > any I/O.  So the case you're concerned about just doesn't happen.
> > 
> > Ok. The connection is not obvious to someone who doesn't know the
> > fscrypt code inside out.
> > 
> > > > > Note that currently, I don't think iomap_dio_bio_actor() would handle 
> > > > > an
> > > > > encrypted file with blocksize > PAGE_SIZE correctly, as the I/O could 
> > > > > be split
> > > > > in the middle of a filesystem block (even after the filesystem 
> > > > > ensures that
> > > > > direct I/O on encrypted files is fully filesystem-block-aligned, 
> > > > > which we do ---
> > > > > see the rest of this patchset), which isn't allowed on encrypted 
> > > > > files.
> > > > 
> > > > That can already happen unless you've specifically restricted DIO
> > > > alignments in the filesystem code. i.e. Direct IO already supports
> > > > sub-block ranges and alignment, and we can already do user DIO on
> > > > sub-block, sector aligned ranges just fine. And the filesystem can
> > > > already split the iomap on sub-block alignments and ranges if it
> > > > needs to because the iomap uses byte range addressing, not sector or
> > > > block based addressing.
> > > > 
> > > > So either you already have a situation where the 2^32 offset can
> > > > land *inside* a filesystem block, or the offset is guaranteed to be
> > > > filesystem block aligned and so you'll never get this "break an IO
> > > > on sub-block alignment" problem regardless of the filesystem block
> > > > size...
> > > > 
> > > > Either way, it's not an iomap problem - it's a filesystem mapping
> > > > problem...
> > > > 
> > > 
> > > I think you're missing the point here.  Currently, the granularity of 
> > > encryption
> > > (a.k.a. "data unit size") is always filesystem blocks, so that's the 
> > > minimum we
> > > can directly read or write to an encrypted file.  This has nothing to do 
> > > with
> > > the IV wraparound case also being discussed.
> > 
> > So when you change the subject, please make it *really obvious* so
> > that people don't think you are still talking about the same issue.
> > 
> > > For example, changing a single bit in the plaintext of a filesystem block 
> > > may
> > > result in the entire block's ciphertext changing.  (The exact behavior 
> > > depends
> > > on the cryptographic algorithm that is used.)
> > > 
> > > That's why this patchset makes ext4 only allow direct I/O on encrypted 
> > > files if
> > > the I/O is fully filesystem-block-aligned.  Note that this might be a more
> > > strict alignment requirement than the bdev_logical_block_size().
> > > 
> > > As long as the iomap code only issues filesystem-block-aligned bios, 
> > > *given
> > > fully filesystem-block-aligned inputs*, we're fine.  That appears to be 
> > > the case
> > > currently.
> > 
> > The actual size and shape of the bios issued by direct IO (both old
> > code and newer iomap code) is determined by the user supplied iov,
> > the size of the biovec array allocated in the bio, and the IO
> > constraints of the underlying hardware.  Hence direct IO does not
> > guarantee alignment to anything larger than the underlying block
> > device logical sector size because there's no guarantee when or
> > where a bio will fi

Re: [f2fs-dev] [PATCH v6 1/7] fscrypt: Add functions for direct I/O support

2020-07-24 Thread Dave Chinner
On Fri, Jul 24, 2020 at 06:44:55PM +, Satya Tangirala wrote:
> From: Eric Biggers 
> 
> Introduce fscrypt_dio_supported() to check whether a direct I/O request
> is unsupported due to encryption constraints.
> 
> Also introduce fscrypt_limit_io_blocks() to limit how many blocks can be
> added to a bio being prepared for direct I/O. This is needed for
> filesystems that use the iomap direct I/O implementation to avoid DUN
> wraparound in the middle of a bio (which is possible with the
> IV_INO_LBLK_32 IV generation method). Elsewhere fscrypt_mergeable_bio()
> is used for this, but iomap operates on logical ranges directly, so
> filesystems using iomap won't have a chance to call fscrypt_mergeable_bio()
> on every block added to a bio. So we need this function which limits a
> logical range in one go.
> 
> Signed-off-by: Eric Biggers 
> Co-developed-by: Satya Tangirala 
> Signed-off-by: Satya Tangirala 
> ---
>  fs/crypto/crypto.c   |  8 +
>  fs/crypto/inline_crypt.c | 74 
>  include/linux/fscrypt.h  | 18 ++
>  3 files changed, 100 insertions(+)
> 
> diff --git a/fs/crypto/crypto.c b/fs/crypto/crypto.c
> index 9212325763b0..f72f22a718b2 100644
> --- a/fs/crypto/crypto.c
> +++ b/fs/crypto/crypto.c
> @@ -69,6 +69,14 @@ void fscrypt_free_bounce_page(struct page *bounce_page)
>  }
>  EXPORT_SYMBOL(fscrypt_free_bounce_page);
>  
> +/*
> + * Generate the IV for the given logical block number within the given file.
> + * For filenames encryption, lblk_num == 0.
> + *
> + * Keep this in sync with fscrypt_limit_io_blocks().  
> fscrypt_limit_io_blocks()
> + * needs to know about any IV generation methods where the low bits of IV 
> don't
> + * simply contain the lblk_num (e.g., IV_INO_LBLK_32).
> + */
>  void fscrypt_generate_iv(union fscrypt_iv *iv, u64 lblk_num,
>const struct fscrypt_info *ci)
>  {
> diff --git a/fs/crypto/inline_crypt.c b/fs/crypto/inline_crypt.c
> index d7aecadf33c1..4cdf807b89b9 100644
> --- a/fs/crypto/inline_crypt.c
> +++ b/fs/crypto/inline_crypt.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "fscrypt_private.h"
>  
> @@ -362,3 +363,76 @@ bool fscrypt_mergeable_bio_bh(struct bio *bio,
>   return fscrypt_mergeable_bio(bio, inode, next_lblk);
>  }
>  EXPORT_SYMBOL_GPL(fscrypt_mergeable_bio_bh);
> +
> +/**
> + * fscrypt_dio_supported() - check whether a direct I/O request is 
> unsupported
> + *due to encryption constraints
> + * @iocb: the file and position the I/O is targeting
> + * @iter: the I/O data segment(s)
> + *
> + * Return: true if direct I/O is supported
> + */
> +bool fscrypt_dio_supported(struct kiocb *iocb, struct iov_iter *iter)
> +{
> + const struct inode *inode = file_inode(iocb->ki_filp);
> + const unsigned int blocksize = i_blocksize(inode);
> +
> + /* If the file is unencrypted, no veto from us. */
> + if (!fscrypt_needs_contents_encryption(inode))
> + return true;
> +
> + /* We only support direct I/O with inline crypto, not fs-layer crypto */
> + if (!fscrypt_inode_uses_inline_crypto(inode))
> + return false;
> +
> + /*
> +  * Since the granularity of encryption is filesystem blocks, the I/O
> +  * must be block aligned -- not just disk sector aligned.
> +  */
> + if (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(iter), blocksize))
> + return false;

Doesn't this force user buffers to be filesystem block size aligned,
instead of 512 byte aligned as is typical for direct IO?

That's going to cause applications that work fine on normal
filesystems becaues the memalign() buffers to 512 bytes or logical
block device sector sizes (as per the open(2) man page) to fail on
encrypted volumes, and it's not going to be obvious to users as to
why this happens.

XFS has XFS_IOC_DIOINFO to expose exactly this information to
userspace on a per-file basis. Other filesystem and VFS developers
have said for the past 15 years "we don't need no stinking DIOINFO".
The same people shot down adding optional IO alignment
constraint fields to statx() a few years ago, too.

Yet here were are again, with alignment of DIO buffers being an
issue that userspace needs to know about

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto

2020-07-23 Thread Dave Chinner
On Thu, Jul 23, 2020 at 08:46:28PM -0700, Eric Biggers wrote:
> On Fri, Jul 24, 2020 at 11:39:10AM +1000, Dave Chinner wrote:
> > fscrypt_inode_uses_inline_crypto() ends up being:
> > 
> > if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode) &&
> > inode->i_crypt_info->ci_inlinecrypt)
> > 
> > I note there are no checks for inode->i_crypt_info being non-null,
> > and I note that S_ENCRYPTED is set on the inode when the on-disk
> > encrypted flag is encountered, not when inode->i_crypt_info is set.
> > 
> 
> ->i_crypt_info is set when the file is opened, so it's guaranteed to be set 
> for
> any I/O.  So the case you're concerned about just doesn't happen.

Ok. The connection is not obvious to someone who doesn't know the
fscrypt code inside out.

> > > Note that currently, I don't think iomap_dio_bio_actor() would handle an
> > > encrypted file with blocksize > PAGE_SIZE correctly, as the I/O could be 
> > > split
> > > in the middle of a filesystem block (even after the filesystem ensures 
> > > that
> > > direct I/O on encrypted files is fully filesystem-block-aligned, which we 
> > > do ---
> > > see the rest of this patchset), which isn't allowed on encrypted files.
> > 
> > That can already happen unless you've specifically restricted DIO
> > alignments in the filesystem code. i.e. Direct IO already supports
> > sub-block ranges and alignment, and we can already do user DIO on
> > sub-block, sector aligned ranges just fine. And the filesystem can
> > already split the iomap on sub-block alignments and ranges if it
> > needs to because the iomap uses byte range addressing, not sector or
> > block based addressing.
> > 
> > So either you already have a situation where the 2^32 offset can
> > land *inside* a filesystem block, or the offset is guaranteed to be
> > filesystem block aligned and so you'll never get this "break an IO
> > on sub-block alignment" problem regardless of the filesystem block
> > size...
> > 
> > Either way, it's not an iomap problem - it's a filesystem mapping
> > problem...
> > 
> 
> I think you're missing the point here.  Currently, the granularity of 
> encryption
> (a.k.a. "data unit size") is always filesystem blocks, so that's the minimum 
> we
> can directly read or write to an encrypted file.  This has nothing to do with
> the IV wraparound case also being discussed.

So when you change the subject, please make it *really obvious* so
that people don't think you are still talking about the same issue.

> For example, changing a single bit in the plaintext of a filesystem block may
> result in the entire block's ciphertext changing.  (The exact behavior depends
> on the cryptographic algorithm that is used.)
> 
> That's why this patchset makes ext4 only allow direct I/O on encrypted files 
> if
> the I/O is fully filesystem-block-aligned.  Note that this might be a more
> strict alignment requirement than the bdev_logical_block_size().
> 
> As long as the iomap code only issues filesystem-block-aligned bios, *given
> fully filesystem-block-aligned inputs*, we're fine.  That appears to be the 
> case
> currently.

The actual size and shape of the bios issued by direct IO (both old
code and newer iomap code) is determined by the user supplied iov,
the size of the biovec array allocated in the bio, and the IO
constraints of the underlying hardware.  Hence direct IO does not
guarantee alignment to anything larger than the underlying block
device logical sector size because there's no guarantee when or
where a bio will fill up.

To guarantee alignment of what ends up at the hardware, you have to
set the block device parameters (e.g. logical sector size)
appropriately all the way down the stack. You also need to ensure
that the filesystem is correctly aligned on the block device so that
filesystem blocks don't overlap things like RAID stripe boundaires,
linear concat boundaries, etc.

IOWs, to constrain alignment in the IO path, you need to configure
you system correct so that the information provided to iomap for IO
alignment matches your requirements. This is not somethign iomap can
do itself; everything from above needs to be constrained by the
filesystem using iomap, everything sent below by iomap is
constrained by the block device config.

> (It's possible that in the future we'll support other encryption data unit
> sizes, perhaps powers of 2 from 512 to filesystem block size.  But for now the
> filesystem block size has been good enough for everyone,

Not the case. fscrypt use in enterprise environments needs support
for block size < page size so that it can be deployed on 64kB page
size machines without requiring 64kB filesystem block sizes.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto

2020-07-23 Thread Dave Chinner
On Thu, Jul 23, 2020 at 04:03:45PM -0700, Eric Biggers wrote:
> Hi Dave,
> 
> On Fri, Jul 24, 2020 at 08:07:52AM +1000, Dave Chinner wrote:
> > > > > @@ -183,11 +184,16 @@ static void
> > > > >  iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t 
> > > > > pos,
> > > > >   unsigned len)
> > > > >  {
> > > > > + struct inode *inode = file_inode(dio->iocb->ki_filp);
> > > > >   struct page *page = ZERO_PAGE(0);
> > > > >   int flags = REQ_SYNC | REQ_IDLE;
> > > > >   struct bio *bio;
> > > > >  
> > > > >   bio = bio_alloc(GFP_KERNEL, 1);
> > > > > +
> > > > > + /* encrypted direct I/O is guaranteed to be fs-block aligned */
> > > > > + WARN_ON_ONCE(fscrypt_needs_contents_encryption(inode));
> > > > 
> > > > Which means you are now placing a new constraint on this code in
> > > > that we cannot ever, in future, zero entire blocks here.
> > > > 
> > > > This code can issue arbitrary sized zeroing bios - multiple entire fs 
> > > > blocks
> > > > blocks if necessary - so I think constraining it to only support
> > > > partial block zeroing by adding a warning like this is no correct.
> > > 
> > > In v3 and earlier this instead had the code to set an encryption context:
> > > 
> > >   fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> > > GFP_KERNEL);
> > > 
> > > Would you prefer that, even though the call to 
> > > fscrypt_set_bio_crypt_ctx() would
> > 
> > Actually, I have no idea what that function does. It's not in a
> > 5.8-rc6 kernel, and it's not in this patchset
> 
> The cover letter mentions that this patchset is based on fscrypt/master.

Which the reader is left to work out where it exists. If you are
going to say "based on", then a pointer to the actual tree like
this:

> That is, "master" of https://git.kernel.org/pub/scm/fs/fscrypt/fscrypt.git

is somewhat helpful.

> fscrypt_set_bio_crypt_ctx() was introduced by
> "fscrypt: add inline encryption support" on that branch.

Way too much static inline function abstraction.

fscrypt_inode_uses_inline_crypto() ends up being:

if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode) &&
inode->i_crypt_info->ci_inlinecrypt)

I note there are no checks for inode->i_crypt_info being non-null,
and I note that S_ENCRYPTED is set on the inode when the on-disk
encrypted flag is encountered, not when inode->i_crypt_info is set.

Further, I note that the local variable for ci is fetched before
fscrypt_inode_uses_inline_crypto() is run, so leaving a landmine for
people who try to access ci before checking if inline crypto is
enabled. Or, indeed, if S_ENCRYPTED is set and the crypt_info has
not been set up, fscrypt_set_bio_crypt_ctx() will oops in the DIO
path.

> > > always be a no-op currently (since for now, iomap_dio_zero() will never be
> > > called with an encrypted file) and thus wouldn't be properly tested?
> > 
> > Same can be said for this WARN_ON_ONCE() code :)
> > 
> > But, in the interests of not leaving landmines, if a fscrypt context
> > is needed to be attached to the bio for data IO in direct IO, it
> > should be attached to all bios that are allocated in the dio path
> > rather than leave a landmine for people in future to trip over.
> 
> My concern is that if we were to pass the wrong 'lblk' to
> fscrypt_set_bio_crypt_ctx(), we wouldn't catch it because it's not tested.
> Passing the wrong 'lblk' would cause the data to be encrypted/decrypted
> incorrectly.

When you implement sub-block DIO alignment for fscrypt enabled
filesystems, fsx will tell you pretty quickly if you screwed up

> Note that currently, I don't think iomap_dio_bio_actor() would handle an
> encrypted file with blocksize > PAGE_SIZE correctly, as the I/O could be split
> in the middle of a filesystem block (even after the filesystem ensures that
> direct I/O on encrypted files is fully filesystem-block-aligned, which we do 
> ---
> see the rest of this patchset), which isn't allowed on encrypted files.

That can already happen unless you've specifically restricted DIO
alignments in the filesystem code. i.e. Direct IO already supports
sub-block ranges and alignment, and we can already do user DIO on
sub-block, sector aligned ranges just fine. And the filesystem can
already split the iomap on sub-block 

Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto

2020-07-23 Thread Dave Chinner
On Wed, Jul 22, 2020 at 03:34:04PM -0700, Eric Biggers wrote:
> On Thu, Jul 23, 2020 at 07:16:29AM +1000, Dave Chinner wrote:
> > On Mon, Jul 20, 2020 at 11:37:35PM +, Satya Tangirala wrote:
> > > From: Eric Biggers 
> > > 
> > > Wire up iomap direct I/O with the fscrypt additions for direct I/O.
> > > This allows ext4 to support direct I/O on encrypted files when inline
> > > encryption is enabled.
> > > 
> > > This change consists of two parts:
> > > 
> > > - Set a bio_crypt_ctx on bios for encrypted files, so that the file
> > >   contents get encrypted (or decrypted).
> > > 
> > > - Ensure that encryption data unit numbers (DUNs) are contiguous within
> > >   each bio.  Use the new function fscrypt_limit_io_pages() for this,
> > >   since the iomap code works directly with logical ranges and thus
> > >   doesn't have a chance to call fscrypt_mergeable_bio() on each page.
> > > 
> > > Note that fscrypt_limit_io_pages() is normally a no-op, as normally the
> > > DUNs simply increment along with the logical blocks.  But it's needed to
> > > handle an edge case in one of the fscrypt IV generation methods.
> > > 
> > > Signed-off-by: Eric Biggers 
> > > Co-developed-by: Satya Tangirala 
> > > Signed-off-by: Satya Tangirala 
> > > ---
> > >  fs/iomap/direct-io.c | 12 +++-
> > >  1 file changed, 11 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index ec7b78e6feca..12064daa3e3d 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -6,6 +6,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -183,11 +184,16 @@ static void
> > >  iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
> > >   unsigned len)
> > >  {
> > > + struct inode *inode = file_inode(dio->iocb->ki_filp);
> > >   struct page *page = ZERO_PAGE(0);
> > >   int flags = REQ_SYNC | REQ_IDLE;
> > >   struct bio *bio;
> > >  
> > >   bio = bio_alloc(GFP_KERNEL, 1);
> > > +
> > > + /* encrypted direct I/O is guaranteed to be fs-block aligned */
> > > + WARN_ON_ONCE(fscrypt_needs_contents_encryption(inode));
> > 
> > Which means you are now placing a new constraint on this code in
> > that we cannot ever, in future, zero entire blocks here.
> > 
> > This code can issue arbitrary sized zeroing bios - multiple entire fs blocks
> > blocks if necessary - so I think constraining it to only support
> > partial block zeroing by adding a warning like this is no correct.
> 
> In v3 and earlier this instead had the code to set an encryption context:
> 
>   fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> GFP_KERNEL);
> 
> Would you prefer that, even though the call to fscrypt_set_bio_crypt_ctx() 
> would

Actually, I have no idea what that function does. It's not in a
5.8-rc6 kernel, and it's not in this patchset

> always be a no-op currently (since for now, iomap_dio_zero() will never be
> called with an encrypted file) and thus wouldn't be properly tested?

Same can be said for this WARN_ON_ONCE() code :)

But, in the interests of not leaving landmines, if a fscrypt context
is needed to be attached to the bio for data IO in direct IO, it
should be attached to all bios that are allocated in the dio path
rather than leave a landmine for people in future to trip over.

> BTW, iomap_dio_zero() is actually limited to one page, so it's not quite
> "arbitrary sizes".

Yup, but that's an implentation detail, not a design constraint.
i.e. I typically review/talk about how stuff functions at a
design/architecture level, not how it's been implemented in the
code.

e.g. block size > page size patches in progress make use of the
"arbitrary length" capability of the design:

https://lore.kernel.org/linux-xfs/20181107063127.3902-7-da...@fromorbit.com/

> iomap is used for other filesystem operations too, so we need to consider when
> to actually do the limiting.  I don't think we should break up the extents
> returned FS_IOC_FIEMAP, for example.  FIEMAP already has a defined behavior.
> Also, it would be weird for the list of extents that FIEMAP returns to change
> depending on whether the filesystem is mounted with '-o inlinecrypt' or not.

We don't need to care

Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto

2020-07-22 Thread Dave Chinner
On Mon, Jul 20, 2020 at 11:37:35PM +, Satya Tangirala wrote:
> From: Eric Biggers 
> 
> Wire up iomap direct I/O with the fscrypt additions for direct I/O.
> This allows ext4 to support direct I/O on encrypted files when inline
> encryption is enabled.
> 
> This change consists of two parts:
> 
> - Set a bio_crypt_ctx on bios for encrypted files, so that the file
>   contents get encrypted (or decrypted).
> 
> - Ensure that encryption data unit numbers (DUNs) are contiguous within
>   each bio.  Use the new function fscrypt_limit_io_pages() for this,
>   since the iomap code works directly with logical ranges and thus
>   doesn't have a chance to call fscrypt_mergeable_bio() on each page.
> 
> Note that fscrypt_limit_io_pages() is normally a no-op, as normally the
> DUNs simply increment along with the logical blocks.  But it's needed to
> handle an edge case in one of the fscrypt IV generation methods.
> 
> Signed-off-by: Eric Biggers 
> Co-developed-by: Satya Tangirala 
> Signed-off-by: Satya Tangirala 
> ---
>  fs/iomap/direct-io.c | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index ec7b78e6feca..12064daa3e3d 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -183,11 +184,16 @@ static void
>  iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>   unsigned len)
>  {
> + struct inode *inode = file_inode(dio->iocb->ki_filp);
>   struct page *page = ZERO_PAGE(0);
>   int flags = REQ_SYNC | REQ_IDLE;
>   struct bio *bio;
>  
>   bio = bio_alloc(GFP_KERNEL, 1);
> +
> + /* encrypted direct I/O is guaranteed to be fs-block aligned */
> + WARN_ON_ONCE(fscrypt_needs_contents_encryption(inode));

Which means you are now placing a new constraint on this code in
that we cannot ever, in future, zero entire blocks here.

This code can issue arbitrary sized zeroing bios - multiple entire fs blocks
blocks if necessary - so I think constraining it to only support
partial block zeroing by adding a warning like this is no correct.

> @@ -253,6 +259,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   ret = nr_pages;
>   goto out;
>   }
> + nr_pages = fscrypt_limit_io_pages(inode, pos, nr_pages);

So if "pos" overlaps a 2^32 offset when a certain mode is used, we
have to break up the IO?

If true, I'm not sure that this belongs here. Limiting the size of
the IOs because of filesytem contraints belongs in the filesystem
extent mapping code. That's the point where the IO is broken up into
maximally sized chunks the filesystem can issue as a contiguous
range. If the fscrypt code is breaking that "contiguous IO range"
because of the mode being used, the fs mapping code should break
the mapping at the boundery where the IO needs to be broken.

Hence the dio mapping code here will never build IOs that cross this
-filesystem- encryption limitation, and we don't need this fscrypt
code in the direct IO path at all.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 1/4] fs: introduce SB_INLINECRYPT

2020-06-23 Thread Dave Chinner
On Mon, Jun 22, 2020 at 06:50:17PM -0700, Eric Biggers wrote:
> On Tue, Jun 23, 2020 at 10:46:36AM +1000, Dave Chinner wrote:
> > On Wed, Jun 17, 2020 at 08:19:35PM -0700, Eric Biggers wrote:
> > > Are you objecting to the use of a SB_* flag, or just to showing the flag 
> > > in
> > > show_sb_opts() instead of in the individual filesystems?  Note that the 
> > > SB_*
> > > flag was requested by Christoph
> > > (https://lkml.kernel.org/r/20191031183217.gf23...@infradead.org/,
> > > https://lkml.kernel.org/r/20191031212103.ga6...@infradead.org/).  We 
> > > originally
> > > used a function fscrypt_operations::inline_crypt_enabled() instead.
> > 
> > I'm objecting to the layering violations of having the filesystem
> > control the mount option parsing and superblock feature flags, but
> > then having no control over whether features that the filesystem has
> > indicated to the VFS it is using get emitted as a mount option or
> > not, and then having the VFS code unconditionally override the
> > functionality that the filesystem uses because it thinks it's a
> > mount option the filesystem supports
> > 
> > For example, the current mess that has just come to light:
> > filesystems like btrfs and XFS v5 which set SB_IVERSION
> > unconditionally (i.e. it's not a mount option!) end up having that
> > functionality turned off on remount because the VFS conflates
> > MS_IVERSION with SB_IVERSION and so unconditionally clears
> > SB_IVERSION because MS_IVERSION is not set on remount by userspace.
> > Which userspace will never set be because the filesystems don't put
> > "iversion" in their mount option strings because -its not a mount
> > option- for those filesystems.
> > 
> > See the problem?  MS_IVERSION should be passed to the filesystem to
> > deal with as a mount option, not treated as a flag to directly
> > change SB_IVERSION in the superblock.
> > 
> > We really need to stop with the "global mount options for everyone
> > at the VFS" and instead pass everything down to the filesystems to
> > parse appropriately. Yes, provide generic helper functions to deal
> > with the common flags that everything supports, but the filesystems
> > should be masking off mount options they doesn't support changing
> > before changing their superblock feature support mask
> 
> I think the MS_* flags are best saved for mount options that are applicable to
> many/most filesystems and are mostly/entirely implementable at the VFS level.

That's the theory, but so far it's caused nothing but pain.

In reality, I think ithe only sane way forward if to stop mount
option parsing in userspace (i.e. no new MS_* flags) for any new
functionality as it only leads to future pain. i.e. all new mount
options should be parsed entirely in the kernel by the filesystem
parsing code

> I don't think "inlinecrypt" qualifies, since while it will be shared by the
> block device-based filesystems that support fscrypt, that is only 2 
> filesystems
> currently; and while some of the code needed to implement it is shared in
> fs/crypto/, there are still substantial filesystem-specific hooks needed.

Right. I wasn't suggesting this patchset should use an MS_ flag -
it was pointing out the problem with the VFS code using SB_ flags to
indicate enabled filesystem functionality unconditionally as a mount
option that can be changed by userspace.

> Hence this patchset intentionally does *not* allocate an MS_INLINECRYPT flag.
> 
> I believe that already addresses half of your concern, as it means
> SB_INLINECRYPT can only be set/cleared by the filesystem itself, not by the 
> VFS.
> (But the commit message could use an explanation of this.)
> 
> The other half would be addressed by the following change, right?

Yes, it does. Thanks, Eric!

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 1/4] fs: introduce SB_INLINECRYPT

2020-06-22 Thread Dave Chinner
On Wed, Jun 17, 2020 at 08:19:35PM -0700, Eric Biggers wrote:
> On Thu, Jun 18, 2020 at 11:19:12AM +1000, Dave Chinner wrote:
> > On Wed, Jun 17, 2020 at 07:57:29AM +, Satya Tangirala wrote:
> > > Introduce SB_INLINECRYPT, which is set by filesystems that wish to use
> > > blk-crypto for file content en/decryption. This flag maps to the
> > > '-o inlinecrypt' mount option which multiple filesystems will implement,
> > > and code in fs/crypto/ needs to be able to check for this mount option
> > > in a filesystem-independent way.
> > > 
> > > Signed-off-by: Satya Tangirala 
> > > ---
> > >  fs/proc_namespace.c | 1 +
> > >  include/linux/fs.h  | 1 +
> > >  2 files changed, 2 insertions(+)
> > > 
> > > diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> > > index 3059a9394c2d..e0ff1f6ac8f1 100644
> > > --- a/fs/proc_namespace.c
> > > +++ b/fs/proc_namespace.c
> > > @@ -49,6 +49,7 @@ static int show_sb_opts(struct seq_file *m, struct 
> > > super_block *sb)
> > >   { SB_DIRSYNC, ",dirsync" },
> > >   { SB_MANDLOCK, ",mand" },
> > >   { SB_LAZYTIME, ",lazytime" },
> > > + { SB_INLINECRYPT, ",inlinecrypt" },
> > >   { 0, NULL }
> > >   };
> > >   const struct proc_fs_opts *fs_infop;
> > 
> > NACK.
> > 
> > SB_* flgs are for functionality enabled on the superblock, not for
> > indicating mount options that have been set by the user.
> 
> That's an interesting claim, given that most SB_* flags are for mount options.
> E.g.:
> 
>   ro => SB_RDONLY
>   nosuid => SB_NOSUID
>   nodev => SB_NODEV
>   noexec => SB_NOEXEC
>   sync => SB_SYNCHRONOUS
>   mand => SB_MANDLOCK
>   noatime => SB_NOATIME
>   nodiratime => SB_NODIRATIME
>   lazytime => SB_LAZYTIME

Yes, they *reflect* options set by mount options, but this is all so
screwed up because the split of superblock functionality from the
mount option API (i.e. the MS_* flag introduction to avoid having
the superblock feature flags being directly defined by the userspace
mount API) was never followed through to properly separate the
implementation of *active superblock feature flags* from the *user
specified mount API flags*.

Yes, the UAPI definitions were separated, but the rest of the
interface wasn't and only works because of the "MS* flag exactly
equal to the SB* flag" hack that was used. So now people have no
idea when to use one or the other and we're repeatedly ending up
with broken mount option parsing because SB flags are used where MS
flags should be used and vice versa.

We've made a damn mess of mount options, and the fscontext stuff
hasn't fixed any of this ... mess. It's just stirred it around and so
nobody really knows what they are supposed to with mount options
right now.

> > If the mount options are directly parsed by the filesystem option
> > parser (as is done later in this patchset), then the mount option
> > setting should be emitted by the filesystem's ->show_options
> > function, not a generic function.
> > 
> > The option string must match what the filesystem defines, not
> > require separate per-filesystem and VFS definitions of the same
> > option that people could potentially get wrong (*cough* i_version vs
> > iversion *cough*)
> 
> Are you objecting to the use of a SB_* flag, or just to showing the flag in
> show_sb_opts() instead of in the individual filesystems?  Note that the SB_*
> flag was requested by Christoph
> (https://lkml.kernel.org/r/20191031183217.gf23...@infradead.org/,
> https://lkml.kernel.org/r/20191031212103.ga6...@infradead.org/).  We 
> originally
> used a function fscrypt_operations::inline_crypt_enabled() instead.

I'm objecting to the layering violations of having the filesystem
control the mount option parsing and superblock feature flags, but
then having no control over whether features that the filesystem has
indicated to the VFS it is using get emitted as a mount option or
not, and then having the VFS code unconditionally override the
functionality that the filesystem uses because it thinks it's a
mount option the filesystem supports

For example, the current mess that has just come to light:
filesystems like btrfs and XFS v5 which set SB_IVERSION
unconditionally (i.e. it's not a mount option!) end up having that
functionality turned off on remount because the VFS conflates
MS_IVERSION with SB_IVERSION and so unconditionally clears
SB_IVERSION because MS_IVERSION is not set on remount by userspace.
Which user

Re: [f2fs-dev] [PATCH 1/4] fs: introduce SB_INLINECRYPT

2020-06-17 Thread Dave Chinner
On Wed, Jun 17, 2020 at 07:57:29AM +, Satya Tangirala wrote:
> Introduce SB_INLINECRYPT, which is set by filesystems that wish to use
> blk-crypto for file content en/decryption. This flag maps to the
> '-o inlinecrypt' mount option which multiple filesystems will implement,
> and code in fs/crypto/ needs to be able to check for this mount option
> in a filesystem-independent way.
> 
> Signed-off-by: Satya Tangirala 
> ---
>  fs/proc_namespace.c | 1 +
>  include/linux/fs.h  | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 3059a9394c2d..e0ff1f6ac8f1 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -49,6 +49,7 @@ static int show_sb_opts(struct seq_file *m, struct 
> super_block *sb)
>   { SB_DIRSYNC, ",dirsync" },
>   { SB_MANDLOCK, ",mand" },
>   { SB_LAZYTIME, ",lazytime" },
> + { SB_INLINECRYPT, ",inlinecrypt" },
>   { 0, NULL }
>   };
>   const struct proc_fs_opts *fs_infop;

NACK.

SB_* flgs are for functionality enabled on the superblock, not for
indicating mount options that have been set by the user.

If the mount options are directly parsed by the filesystem option
parser (as is done later in this patchset), then the mount option
setting should be emitted by the filesystem's ->show_options
function, not a generic function.

The option string must match what the filesystem defines, not
require separate per-filesystem and VFS definitions of the same
option that people could potentially get wrong (*cough* i_version vs
iversion *cough*)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 2/4] fs: avoid double-writing the inode on a lazytime expiration

2020-03-25 Thread Dave Chinner
On Wed, Mar 25, 2020 at 01:28:23PM +0100, Christoph Hellwig wrote:
> In the case that an inode has dirty timestamp for longer than the
> lazytime expiration timeout (or if all such inodes are being flushed
> out due to a sync or syncfs system call), we need to inform the file
> system that the inode is dirty so that the inode's timestamps can be
> copied out to the on-disk data structures.  That's because if the file
> system supports lazytime, it will have ignored the dirty_inode(inode,
> I_DIRTY_TIME) notification when the timestamp was modified in memory.q
> Previously, this was accomplished by calling mark_inode_dirty_sync(),
> but that has the unfortunate side effect of also putting the inode the
> writeback list, and that's not necessary in this case, since we will
> immediately call write_inode() afterwards.  Replace the call to
> mark_inode_dirty_sync() with a new lazytime_expired method to clearly
> separate out this case.


hmmm. Doesn't this cause issues with both iput() and
vfs_fsync_range() because they call mark_inode_dirty_sync() on
I_DIRTY_TIME inodes to move them onto the writeback list so they are
appropriately expired when the inode is written back.

i.e.:


> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 2094386af8ac..e5aafd40dd0f 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -612,19 +612,13 @@ xfs_fs_destroy_inode(
>  }
>  
>  static void
> -xfs_fs_dirty_inode(
> - struct inode*inode,
> - int flag)
> +xfs_fs_lazytime_expired(
> + struct inode*inode)
>  {
>   struct xfs_inode*ip = XFS_I(inode);
>   struct xfs_mount*mp = ip->i_mount;
>   struct xfs_trans*tp;
>  
> - if (!(inode->i_sb->s_flags & SB_LAZYTIME))
> - return;
> - if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
> - return;
> -
>   if (xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp))
>   return;
>   xfs_ilock(ip, XFS_ILOCK_EXCL);
> @@ -1053,7 +1047,7 @@ xfs_fs_free_cached_objects(
>  static const struct super_operations xfs_super_operations = {
>   .alloc_inode= xfs_fs_alloc_inode,
>   .destroy_inode  = xfs_fs_destroy_inode,
> - .dirty_inode= xfs_fs_dirty_inode,
> + .lazytime_expired   = xfs_fs_lazytime_expired,
>   .drop_inode = xfs_fs_drop_inode,
>   .put_super  = xfs_fs_put_super,
>   .sync_fs= xfs_fs_sync_fs,

This means XFS no longer updates/logs the current timestamp because
->dirty_inode(I_DIRTY_SYNC) is no longer called for XFS) before
->fsync flushes the inode data and metadata changes to the journal.
Hence the current in-memory timestamps are not present in the log
before the fsync is run as so we violate the fsync guarantees
lazytime gives for timestamp updates

I haven't quite got it straight in my head if the iput() case has
similar problems, but the fsync case definitely looks broken.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] writeback: avoid double-writing the inode on a lazytime expiration

2020-03-12 Thread Dave Chinner
On Thu, Mar 12, 2020 at 07:34:45AM -0700, Christoph Hellwig wrote:
> On Thu, Mar 12, 2020 at 11:07:17AM +1100, Dave Chinner wrote:
> > > That's true, but when the timestamps were originally modified,
> > > dirty_inode() will be called with flag == I_DIRTY_TIME, which will
> > > *not* be a no-op; which is to say, XFS will force the timestamps to be
> > > updated on disk when the timestamps are first dirtied, because it
> > > doesn't support I_DIRTY_TIME.
> > 
> > We log the initial timestamp change, and then ignore timestamp
> > updates until the dirty time expires and the inode is set
> > I_DIRTY_SYNC via __mark_inode_dirty_sync(). IOWs, on expiry, we have
> > time stamps that may be 24 hours out of date in memory, and they
> > still need to be flushed to the journal.
> > 
> > However, your change does not mark the inode dirtying on expiry
> > anymore, so...
> > 
> > > So I think we're fine.
> > 
> > ... we're not fine. This breaks XFS and any other filesystem that
> > relies on a I_DIRTY_SYNC notification to handle dirty time expiry
> > correctly.
> 
> I haven't seen the original mail this replies to,

The original problem was calling mark_inode_dirty_sync() on expiry
during inode writeback was causing the inode to be put back on the
dirty inode list and so ext4 was flushing it twice - once on expiry
and once 5 seconds later on the next background writeback pass.

This is a problem that XFS does not have because it does not
implement ->write_inode...

> but if we could
> get the lazytime expirty by some other means (e.g. an explicit
> callback), XFS could opt out of all the VFS inode tracking again,
> which would simplify a few things.

Yes, that would definitely make things simpler for XFS, and it would
also solve the problem that the generic lazytime expiry code has

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] writeback: avoid double-writing the inode on a lazytime expiration

2020-03-11 Thread Dave Chinner
On Wed, Mar 11, 2020 at 08:57:49AM -0400, Theodore Y. Ts'o wrote:
> On Tue, Mar 10, 2020 at 08:20:09PM -0700, Eric Biggers wrote:
> > Thanks Ted!  This fixes the fscrypt test failure.
> > 
> > However, are you sure this works correctly on all filesystems?  I'm not sure
> > about XFS.  XFS only implements ->dirty_inode(), not ->write_inode(), and 
> > in its
> > ->dirty_inode() it does:
>   ...
> > if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
> > return;
> 
> That's true, but when the timestamps were originally modified,
> dirty_inode() will be called with flag == I_DIRTY_TIME, which will
> *not* be a no-op; which is to say, XFS will force the timestamps to be
> updated on disk when the timestamps are first dirtied, because it
> doesn't support I_DIRTY_TIME.

We log the initial timestamp change, and then ignore timestamp
updates until the dirty time expires and the inode is set
I_DIRTY_SYNC via __mark_inode_dirty_sync(). IOWs, on expiry, we have
time stamps that may be 24 hours out of date in memory, and they
still need to be flushed to the journal.

However, your change does not mark the inode dirtying on expiry
anymore, so...

> So I think we're fine.

... we're not fine. This breaks XFS and any other filesystem that
relies on a I_DIRTY_SYNC notification to handle dirty time expiry
correctly.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH] writeback: avoid double-writing the inode on a lazytime expiration

2020-03-11 Thread Dave Chinner
On Tue, Mar 10, 2020 at 08:20:09PM -0700, Eric Biggers wrote:
> On Fri, Mar 06, 2020 at 09:00:43PM -0500, Theodore Ts'o wrote:
> > In the case that an inode has dirty timestamp for longer than the
> > lazytime expiration timeout (or if all such inodes are being flushed
> > out due to a sync or syncfs system call), we need to inform the file
> > system that the inode is dirty so that the inode's timestamps can be
> > copied out to the on-disk data structures.  That's because if the file
> > system supports lazytime, it will have ignored the dirty_inode(inode,
> > I_DIRTY_TIME) notification when the timestamp was modified in memory.q
> > 
> > Previously, this was accomplished by calling mark_inode_dirty_sync(),
> > but that has the unfortunate side effect of also putting the inode the
> > writeback list, and that's not necessary in this case, since we will
> > immediately call write_inode() afterwards.
> > 
> > Eric Biggers noticed that this was causing problems for fscrypt after
> > the key was removed[1].
> > 
> > [1] https://lore.kernel.org/r/20200306004555.gb225...@gmail.com
> > 
> > Reported-by: Eric Biggers 
> > Signed-off-by: Theodore Ts'o 
> > ---
> >  fs/fs-writeback.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 76ac9c7d32ec..32101349ba97 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -1504,8 +1504,9 @@ __writeback_single_inode(struct inode *inode, struct 
> > writeback_control *wbc)
> >  
> > spin_unlock(&inode->i_lock);
> >  
> > -   if (dirty & I_DIRTY_TIME)
> > -   mark_inode_dirty_sync(inode);
> > +   /* This was a lazytime expiration; we need to tell the file system */
> > +   if (dirty & I_DIRTY_TIME_EXPIRED && inode->i_sb->s_op->dirty_inode)
> > +   inode->i_sb->s_op->dirty_inode(inode, I_DIRTY_TIME_EXPIRED);
> > /* Don't write the inode if only I_DIRTY_PAGES was set */
> > if (dirty & ~I_DIRTY_PAGES) {
> > int err = write_inode(inode, wbc);
> > -- 
> 
> Thanks Ted!  This fixes the fscrypt test failure.
> 
> However, are you sure this works correctly on all filesystems?  I'm not sure
> about XFS.  XFS only implements ->dirty_inode(), not ->write_inode(), and in 
> its
> ->dirty_inode() it does:
> 
>   static void
>   xfs_fs_dirty_inode(
>   struct inode*inode,
>   int flag)
>   {
>   struct xfs_inode*ip = XFS_I(inode);
>   struct xfs_mount*mp = ip->i_mount;
>   struct xfs_trans*tp;
> 
>   if (!(inode->i_sb->s_flags & SB_LAZYTIME))
>   return;
>   if (flag != I_DIRTY_SYNC || !(inode->i_state & I_DIRTY_TIME))
>   return;
> 
>   if (xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp))
>   return;
>   xfs_ilock(ip, XFS_ILOCK_EXCL);
>   xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>   xfs_trans_log_inode(tp, ip, XFS_ILOG_TIMESTAMP);
>   xfs_trans_commit(tp);
>   }
> 
> 
> So flag=I_DIRTY_TIME_EXPIRED will be a no-op.
> 
> Maybe you should be using I_DIRTY_SYNC instead?  Or perhaps XFS should be
> checking for either I_DIRTY_TIME_EXPIRED or I_DIRTY_SYNC?

Right, XFS does not use the VFS inode writeback code at all - we
track all metadata changes internally via journalling. The VFS uses
I_DIRTY_SYNC to indicate and inode is metadata dirty and a writeback
candidate. Hence if we need to mark an inode dirty for integrity
purposes for _any reason_, then I_DIRTY_SYNC is the correct flag to
be passing to ->dirty_inode.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor

2020-02-18 Thread Dave Chinner
On Tue, Feb 18, 2020 at 10:04:15PM -0800, Matthew Wilcox wrote:
> On Wed, Feb 19, 2020 at 02:29:00PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote:
> > > @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t 
> > > pos, loff_t length,
> > >   }
> > >   ret = iomap_readpage_actor(inode, pos + done, length - done,
> > >   ctx, iomap, srcmap);
> > > + if (WARN_ON(ret == 0))
> > > + break;
> > 
> > This error case now leaks ctx->cur_page
> 
> Yes ... and I see the consequence.  I mean, this is a "shouldn't happen",
> so do we want to put effort into cleanup here ...

Well, the normal thing for XFS is that a production kernel cleans up
and handles the error gracefully with a WARN_ON_ONCE, while a debug
kernel build will chuck a tanty and burn the house down so to make
the developers aware that there is a "should not happen" situation
occurring

> > > @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, 
> > > struct list_head *pages,
> > >  done:
> > >   if (ctx.bio)
> > >   submit_bio(ctx.bio);
> > > - if (ctx.cur_page) {
> > > - if (!ctx.cur_page_in_bio)
> > > - unlock_page(ctx.cur_page);
> > > - put_page(ctx.cur_page);
> > > - }
> > > + BUG_ON(ctx.cur_page);
> > 
> > And so will now trigger both a warn and a bug
> 
> ... or do we just want to run slap bang into this bug?
> 
> Option 1: Remove the check for 'ret == 0' altogether, as we had it before.
> That puts us into endless loop territory for a failure mode, and it's not
> parallel with iomap_readpage().
> 
> Option 2: Remove the WARN_ON from the check.  Then we just hit the BUG_ON,
> but we don't know why we did it.
> 
> Option 3: Set cur_page to NULL.  We'll hit the WARN_ON, avoid the BUG_ON,
> might end up with a page in the page cache which is never unlocked.

None of these are appealing.

> Option 4: Do the unlock/put page dance before setting the cur_page to NULL.
> We might double-unlock the page.

why would we double unlock the page?

Oh, the readahead cursor doesn't handle the case of partial page
submission, which would result in IO completion unlocking the page.

Ok, that's what the ctx.cur_page_in_bio check is used to detect i.e.
if we've got a page that the readahead cursor points at, and we
haven't actually added it to a bio, then we can leave it to the
read_pages() to unlock and clean up. If it's in a bio, then IO
completion will unlock it and so we only have to drop the submission
reference and move the readahead cursor forwards so read_pages()
doesn't try to unlock this page. i.e:

/* clean up partial page submission failures */
if (ctx.cur_page && ctx.cur_page_in_bio) {
put_page(ctx.cur_page);
readahead_next(rac);
}

looks to me like it will handle the case of "ret == 0" in the actor
function just fine.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 00/19] Change readahead API

2020-02-18 Thread Dave Chinner
On Tue, Feb 18, 2020 at 07:48:32PM -0800, Matthew Wilcox wrote:
> On Wed, Feb 19, 2020 at 02:45:25PM +1100, Dave Chinner wrote:
> > On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote:
> > > On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> > > > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > > > > Latest version in your git tree:
> > > > > 
> > > > > $ ▶ glo -n 5 willy/readahead
> > > > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > > > > ff63497fcb98 iomap: Convert from readpages to readahead
> > > > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > > > > 8115bcca7312 fuse: Convert from readpages to readahead
> > > > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > > > > $
> > > > > 
> > > > > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > > > > 
> > > > > [2.423116] [ cut here ]
> > > > > [2.424957] list_add double add: new=ea000efff4c8, 
> > > > > prev=8883bfffee60, next=ea000efff4c8.
> > > > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 
> > > > > __list_add_valid+0x67/0x70
> > > > > [2.457484] Call Trace:
> > > > > [2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > > > > [2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > > > > [2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > > > > [2.461712]  lru_add_drain_cpu+0x8d/0x160
> > > > > [2.462787]  lru_add_drain+0x18/0x20
> > > > 
> > > > Are you sure that was 4be497096c04 ?  I ask because there was a
> > > 
> > > Yes, because it's the only version I've actually merged into my
> > > working tree, compiled and tried to run. :P
> > > 
> > > > version pushed to that git tree that did contain a list double-add
> > > > (due to a mismerge when shuffling patches).  I noticed it and fixed
> > > > it, and 4be497096c04 doesn't have that problem.  I also test with
> > > > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> > > > probabilistic because it'll depend on the timing between whatever other
> > > > list is being used and the page actually being added to the LRU.
> > > 
> > > I'll see if I can reproduce it.
> > 
> > Just updated to a current TOT Linus kernel and your latest branch,
> > and so far this is 100% reproducable.
> > 
> > Not sure how I'm going to debug it yet, because it's init that is
> > triggering it
> 
> Eric found it ...

Yeah, just saw that and am applying his patch to test it...

> still not sure why I don't see it.

No readahead configured on your device?


Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 00/19] Change readahead API

2020-02-18 Thread Dave Chinner
On Wed, Feb 19, 2020 at 08:26:52AM +1100, Dave Chinner wrote:
> On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> > On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > > Latest version in your git tree:
> > > 
> > > $ ▶ glo -n 5 willy/readahead
> > > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > > ff63497fcb98 iomap: Convert from readpages to readahead
> > > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > > 8115bcca7312 fuse: Convert from readpages to readahead
> > > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > > $
> > > 
> > > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > > 
> > > [2.423116] [ cut here ]
> > > [2.424957] list_add double add: new=ea000efff4c8, 
> > > prev=8883bfffee60, next=ea000efff4c8.
> > > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 
> > > __list_add_valid+0x67/0x70
> > > [2.457484] Call Trace:
> > > [2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > > [2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > > [2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > > [2.461712]  lru_add_drain_cpu+0x8d/0x160
> > > [2.462787]  lru_add_drain+0x18/0x20
> > 
> > Are you sure that was 4be497096c04 ?  I ask because there was a
> 
> Yes, because it's the only version I've actually merged into my
> working tree, compiled and tried to run. :P
> 
> > version pushed to that git tree that did contain a list double-add
> > (due to a mismerge when shuffling patches).  I noticed it and fixed
> > it, and 4be497096c04 doesn't have that problem.  I also test with
> > CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> > probabilistic because it'll depend on the timing between whatever other
> > list is being used and the page actually being added to the LRU.
> 
> I'll see if I can reproduce it.

Just updated to a current TOT Linus kernel and your latest branch,
and so far this is 100% reproducable.

Not sure how I'm going to debug it yet, because it's init that is
triggering it

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 19/19] mm: Use memalloc_nofs_save in readahead path

2020-02-18 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:46:13AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Ensure that memory allocations in the readahead path do not attempt to
> reclaim file-backed pages, which could lead to a deadlock.  It is
> possible, though unlikely this is the root cause of a problem observed
> by Cong Wang.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reported-by: Cong Wang 
> Suggested-by: Michal Hocko 
> ---
>  mm/readahead.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 94d499cfb657..8f9c0dba24e7 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "internal.h"
>  
> @@ -174,6 +175,18 @@ void page_cache_readahead_limit(struct address_space 
> *mapping,
>   ._nr_pages = 0,
>   };
>  
> + /*
> +  * Partway through the readahead operation, we will have added
> +  * locked pages to the page cache, but will not yet have submitted
> +  * them for I/O.  Adding another page may need to allocate memory,
> +  * which can trigger memory reclaim.  Telling the VM we're in
> +  * the middle of a filesystem operation will cause it to not
> +  * touch file-backed pages, preventing a deadlock.  Most (all?)
> +  * filesystems already specify __GFP_NOFS in their mapping's
> +  * gfp_mask, but let's be explicit here.
> +  */
> + unsigned int nofs = memalloc_nofs_save();
> +

So doesn't this largely remove the need for all the gfp flag futzing
in the readahead path? i.e. almost all readahead allocations are now
going to be GFP_NOFS | GFP_NORETRY | GFP_NOWARN ?

If so, shouldn't we just strip all the gfp flags and masking out of
the readahead path altogether?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 18/19] iomap: Convert from readpages to readahead

2020-02-18 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:46:12AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in iomap.  Convert XFS and ZoneFS to
> use it.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/iomap/buffered-io.c | 91 +++---
>  fs/iomap/trace.h   |  2 +-
>  fs/xfs/xfs_aops.c  | 13 +++---
>  fs/zonefs/super.c  |  7 ++--
>  include/linux/iomap.h  |  3 +-
>  5 files changed, 42 insertions(+), 74 deletions(-)

All pretty straight forward...

> @@ -416,6 +384,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   break;
>   done += ret;
>   if (offset_in_page(pos + done) == 0) {
> + readahead_next(ctx->rac);
>   if (!ctx->cur_page_in_bio)
>   unlock_page(ctx->cur_page);
>   put_page(ctx->cur_page);

Though now I look at the addition here, this might be better
restructured to mention how we handle partial page submission such as:

done += ret;

/*
 * Keep working on a partially complete page, otherwise ready
 * the ctx for the next page to be acted on.
 */
if (offset_in_page(pos + done))
continue;

if (!ctx->cur_page_in_bio)
unlock_page(ctx->cur_page);
put_page(ctx->cur_page);
    ctx->cur_page = NULL;
readahead_next(ctx->rac);
}

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 17/19] iomap: Restructure iomap_readpages_actor

2020-02-18 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:46:11AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> By putting the 'have we reached the end of the page' condition at the end
> of the loop instead of the beginning, we can remove the 'submit the last
> page' code from iomap_readpages().  Also check that iomap_readpage_actor()
> didn't return 0, which would lead to an endless loop.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/iomap/buffered-io.c | 25 -
>  1 file changed, 12 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index cb3511eb152a..44303f370b2d 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct iomap_readpage_ctx *ctx = data;
> - loff_t done, ret;
> + loff_t ret, done = 0;
>  
> - for (done = 0; done < length; done += ret) {
> - if (ctx->cur_page && offset_in_page(pos + done) == 0) {
> - if (!ctx->cur_page_in_bio)
> - unlock_page(ctx->cur_page);
> - put_page(ctx->cur_page);
> - ctx->cur_page = NULL;
> - }
> + while (done < length) {
>   if (!ctx->cur_page) {
>   ctx->cur_page = iomap_next_page(inode, ctx->pages,
>   pos, length, &done);
> @@ -418,6 +412,15 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   }
>   ret = iomap_readpage_actor(inode, pos + done, length - done,
>   ctx, iomap, srcmap);
> + if (WARN_ON(ret == 0))
> + break;

This error case now leaks ctx->cur_page

> + done += ret;
> + if (offset_in_page(pos + done) == 0) {
> + if (!ctx->cur_page_in_bio)
> + unlock_page(ctx->cur_page);
> + put_page(ctx->cur_page);
> + ctx->cur_page = NULL;
> + }
>   }
>  
>   return done;
> @@ -451,11 +454,7 @@ iomap_readpages(struct address_space *mapping, struct 
> list_head *pages,
>  done:
>   if (ctx.bio)
>   submit_bio(ctx.bio);
> -     if (ctx.cur_page) {
> - if (!ctx.cur_page_in_bio)
> - unlock_page(ctx.cur_page);
> - put_page(ctx.cur_page);
> - }
> + BUG_ON(ctx.cur_page);

And so will now trigger both a warn and a bug

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 16/19] fuse: Convert from readpages to readahead

2020-02-18 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:46:09AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in fuse.  Switching away from the
> read_cache_pages() helper gets rid of an implicit call to put_page(),
> so we can get rid of the get_page() call in fuse_readpages_fill().
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Looks OK.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 14/19] ext4: Convert from readpages to readahead

2020-02-18 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:46:05AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in ext4
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

There's nothing I can see in this that would cause that list
corruption I saw with ext4.

I'll re-introduce the patch and see if it falls over again.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 13/19] erofs: Convert compressed files from readpages to readahead

2020-02-18 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:46:03AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in erofs.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/erofs/zdata.c | 29 +
>  1 file changed, 9 insertions(+), 20 deletions(-)

Looks fine.

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 12/19] erofs: Convert uncompressed files from readpages to readahead

2020-02-18 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:46:01AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in erofs
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/erofs/data.c  | 39 +---
>  fs/erofs/zdata.c |  2 +-
>  include/trace/events/erofs.h |  6 +++---
>  3 files changed, 18 insertions(+), 29 deletions(-)

Looks fine from the perspective of page iteration and error handling.

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 11/19] btrfs: Convert from readpages to readahead

2020-02-18 Thread Dave Chinner
On Tue, Feb 18, 2020 at 01:12:28PM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:57:58PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:59AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" 

> > >  
> > > - if (nr) {
> > > - u64 contig_start = page_offset(pagepool[0]);
> > > + ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);
> > 
> > Ok, yes it does. :)
> > 
> > I don't see how readahead_for_each_batch() guarantees that, though.
> 
> I ... don't see how it doesn't?  We start at rac->_start and iterate
> through the consecutive pages in the page cache.  readahead_for_each_batch()
> does assume that __do_page_cache_readahead() has its current behaviour
> of putting the pages in the page cache in order, and kicks off a new
> call to ->readahead() every time it has to skip an index for whatever
> reason (eg page already in page cache).

And there is the comment I was looking for while reading
readahead_for_each_batch() :)

> 
> > > - if (bio)
> > > - return submit_one_bio(bio, 0, bio_flags);
> > > - return 0;
> > > + if (bio) {
> > > + if (submit_one_bio(bio, 0, bio_flags))
> > > + return;
> > > + }
> > >  }
> > 
> > Shouldn't that just be
> > 
> > if (bio)
> > submit_one_bio(bio, 0, bio_flags);
> 
> It should, but some overzealous person decided to mark submit_one_bio()
> as __must_check, so I have to work around that.

/me looks at code

Ngggh.

I rather dislike functions that are named in a way that look like
they belong to core kernel APIs but in reality are local static
functions.

I'd ask for this to be fixed if it was generic code, but it's btrfs
specific code so they can deal with the ugliness of their own
creation. :/

> > Confusing when put alongside rac->_batch_count counting the number
> > of pages in the batch, and "batch" being the index into the page
> > array, and they aren't the same counts
> 
> Yes.  Renamed to 'i'.
> 
> > > + XA_STATE(xas, &rac->mapping->i_pages, rac->_start);
> > > + struct page *page;
> > > +
> > > + rac->_batch_count = 0;
> > > + xas_for_each(&xas, page, rac->_start + rac->_nr_pages - 1) {
> > 
> > That just iterates pages in the start,end doesn't it? What
> > guarantees that this fills the array with a contiguous page range?
> 
> The behaviour of __do_page_cache_readahead().  Dave Howells also has a
> usecase for xas_for_each_contig(), so I'm going to add that soon.
> 
> > > + VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > + VM_BUG_ON_PAGE(PageTail(page), page);
> > > + array[batch++] = page;
> > > + rac->_batch_count += hpage_nr_pages(page);
> > > + if (PageHead(page))
> > > + xas_set(&xas, rac->_start + rac->_batch_count);
> > 
> > What on earth does this do? Comments please!
> 
>   /*
>* The page cache isn't using multi-index entries yet,
>* so xas_for_each() won't do the right thing for
>* large pages.  This can be removed once the page cache
>* is converted.
>*/

Oh, it's changing the internal xarray lookup cursor position to
point at the correct next page index? Perhaps it's better to say
that instead of "won't do the right thing"?

> > > +#define readahead_for_each_batch(rac, array, size, nr)   
> > > \
> > > + for (; (nr = readahead_page_batch(rac, array, size));   \
> > > +         readahead_next(rac))
> > 
> > I had to go look at the caller to work out what "size" refered to
> > here.
> > 
> > This is complex enough that it needs proper API documentation.
> 
> How about just:
> 
> -#define readahead_for_each_batch(rac, array, size, nr) \
> -   for (; (nr = readahead_page_batch(rac, array, size));   \
> +#define readahead_for_each_batch(rac, array, array_sz, nr) \
> +   for (; (nr = readahead_page_batch(rac, array, array_sz));   \

Yup, that's fine - now the macro documents itself.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 09/19] mm: Add page_cache_readahead_limit

2020-02-18 Thread Dave Chinner
On Tue, Feb 18, 2020 at 11:54:04AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:31:10PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" 
> > > 
> > > ext4 and f2fs have duplicated the guts of the readahead code so
> > > they can read past i_size.  Instead, separate out the guts of the
> > > readahead code so they can call it directly.
> > 
> > Gross and nasty (hosting non-stale data beyond EOF in the page
> > cache, that is).
> 
> I thought you meant sneaking changes into the VFS (that were rejected) by
> copying VFS code and modifying it ...

Well, now that you mention it... :P

> > > +/**
> > > + * page_cache_readahead_limit - Start readahead beyond a file's i_size.
> > > + * @mapping: File address space.
> > > + * @file: This instance of the open file; used for authentication.
> > > + * @offset: First page index to read.
> > > + * @end_index: The maximum page index to read.
> > > + * @nr_to_read: The number of pages to read.
> > > + * @lookahead_size: Where to start the next readahead.
> > > + *
> > > + * This function is for filesystems to call when they want to start
> > > + * readahead potentially beyond a file's stated i_size.  If you want
> > > + * to start readahead on a normal file, you probably want to call
> > > + * page_cache_async_readahead() or page_cache_sync_readahead() instead.
> > > + *
> > > + * Context: File is referenced by caller.  Mutexes may be held by caller.
> > > + * May sleep, but will not reenter filesystem to reclaim memory.
> > >   */
> > > -void __do_page_cache_readahead(struct address_space *mapping,
> > > - struct file *filp, pgoff_t offset, unsigned long nr_to_read,
> > > - unsigned long lookahead_size)
> > > +void page_cache_readahead_limit(struct address_space *mapping,
> > 
> > ... I don't think the function name conveys it's purpose. It's
> > really a ranged readahead that ignores where i_size lies. i.e
> > 
> > page_cache_readahead_range(mapping, start, end, nr_to_read)
> > 
> > seems like a better API to me, and then you can drop the "start
> > readahead beyond i_size" comments and replace it with "Range is not
> > limited by the inode's i_size and hence can be used to read data
> > stored beyond EOF into the page cache."
> 
> I'm concerned that calling it 'range' implies "I want to read between
> start and end" rather than "I want to read nr_to_read at start, oh but
> don't go past end".
> 
> Maybe the right way to do this is have the three callers cap nr_to_read.
> Well, the one caller ... after all, f2fs and ext4 have no desire to
> cap the length.  Then we can call it page_cache_readahead_exceed() or
> page_cache_readahead_dangerous() or something else like that to make it
> clear that you shouldn't be calling it.

Fair point.

And in reading this, it occurred to me that what we are enabling is
an "out of bounds" readahead function. so
page_cache_readahead_OOB() or *_unbounded() might be a better name

>   * Like add_to_page_cache_locked, but used to add newly allocated pages:
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9dd431fa16c9..cad26287ad8b 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -142,45 +142,43 @@ static void read_pages(struct readahead_control *rac, 
> struct list_head *pages)
>   blk_finish_plug(&plug);
>  }
>  
> -/*
> - * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
> - * the pages first, then submits them for I/O. This avoids the very bad
> - * behaviour which would occur if page allocations are causing VM writeback.
> - * We really don't want to intermingle reads and writes like that.
> +/**
> + * page_cache_readahead_exceed - Start unchecked readahead.
> + * @mapping: File address space.
> + * @file: This instance of the open file; used for authentication.
> + * @index: First page index to read.
> + * @nr_to_read: The number of pages to read.
> + * @lookahead_size: Where to start the next readahead.
> + *
> + * This function is for filesystems to call when they want to start
> + * readahead beyond a file's stated i_size.  This is almost certainly
> + * not the function you want to call.  Use page_cache_async_readahead()
> + * or page_cache_sync_readahead() instead.
> + *
> + * Context: File is referenced by caller.  Mutexes may be held by caller.
> + * May sleep, but will not reenter filesystem to reclaim memory.

Yup, looks much better.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 08/19] mm: Add readahead address space operation

2020-02-18 Thread Dave Chinner
On Tue, Feb 18, 2020 at 08:10:04AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:21:47PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:54AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" 
> > > 
> > > This replaces ->readpages with a saner interface:
> > >  - Return void instead of an ignored error code.
> > >  - Pages are already in the page cache when ->readahead is called.
> > 
> > Might read better as:
> > 
> >  - Page cache is already populates with locked pages when
> >->readahead is called.
> 
> Will do.
> 
> > >  - Implementation looks up the pages in the page cache instead of
> > >having them passed in a linked list.
> > 
> > Add:
> > 
> >  - cleanup of unused readahead handled by ->readahead caller, not
> >the method implementation.
> 
> The readpages caller does that cleanup too, so it's not an advantage
> to the readahead interface.

Right. I kinda of read the list as "the reasons the new API is sane"
not as "how readahead is different to readpages"

> > >  ``readpages``
> > >   called by the VM to read pages associated with the address_space
> > >   object.  This is essentially just a vector version of readpage.
> > >   Instead of just one page, several pages are requested.
> > >   readpages is only used for read-ahead, so read errors are
> > >   ignored.  If anything goes wrong, feel free to give up.
> > > + This interface is deprecated; implement readahead instead.
> > 
> > What is the removal schedule for the deprecated interface? 
> 
> I mentioned that in the cover letter; once Dave Howells has the fscache
> branch merged, I'll do the remaining filesystems.  Should be within the
> next couple of merge windows.

Sure, but I like to see actual release tags with the deprecation
notice so that it's obvious to the reader as to whether this is
something new and/or when they can expect it to go away.

> > > +/* The byte offset into the file of this readahead block */
> > > +static inline loff_t readahead_offset(struct readahead_control *rac)
> > > +{
> > > + return (loff_t)rac->_start * PAGE_SIZE;
> > > +}
> > 
> > Urk. Didn't an early page use "offset" for the page index? That
> > was was "mm: Remove 'page_offset' from readahead loop" did, right?
> > 
> > That's just going to cause confusion to have different units for
> > readahead "offsets"
> 
> We are ... not consistent anywhere in the VM/VFS with our naming.
> Unfortunately.
> 
> $ grep -n offset mm/filemap.c 
> 391: * @start:offset in bytes where the range starts
> ...
> 815:  pgoff_t offset = old->index;
> ...
> 2020: unsigned long offset;  /* offset into pagecache page */
> ...
> 2257: *ppos = ((loff_t)index << PAGE_SHIFT) + offset;
> 
> That last one's my favourite.  Not to mention the fine distinction you
> and I discussed recently between offset_in_page() and page_offset().
> 
> Best of all, even our types encode the ambiguity of an 'offset'.  We have
> pgoff_t and loff_t (replacing the earlier off_t).
> 
> So, new rule.  'pos' is the number of bytes into a file.  'index' is the
> number of PAGE_SIZE pages into a file.  We don't use the word 'offset'
> at all.  'length' as a byte count and 'count' as a page count seem like
> fine names to me.

That sounds very reasonable to me. Another patchset in the making? :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 07/19] mm: Put readahead pages in cache earlier

2020-02-18 Thread Dave Chinner
On Tue, Feb 18, 2020 at 07:42:22AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 05:14:59PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:52AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" 
> > > 
> > > At allocation time, put the pages in the cache unless we're using
> > > ->readpages.  Add the readahead_for_each() iterator for the benefit of
> > > the ->readpage fallback.  This iterator supports huge pages, even though
> > > none of the filesystems to be converted do yet.
> > 
> > This could be better written - took me some time to get my head
> > around it and the code.
> > 
> > "When populating the page cache for readahead, mappings that don't
> > use ->readpages need to have their pages added to the page cache
> > before ->readpage is called. Do this insertion earlier so that the
> > pages can be looked up immediately prior to ->readpage calls rather
> > than passing them on a linked list. This early insert functionality
> > is also required by the upcoming ->readahead method that will
> > replace ->readpages.
> > 
> > Optimise and simplify the readpage loop by adding a
> > readahead_for_each() iterator to provide the pages we need to read.
> > This iterator also supports huge pages, even though none of the
> > filesystems have been converted to use them yet."
> 
> Thanks, I'll use that.
> 
> > > +static inline struct page *readahead_page(struct readahead_control *rac)
> > > +{
> > > + struct page *page;
> > > +
> > > + if (!rac->_nr_pages)
> > > + return NULL;
> > 
> > H.
> > 
> > > +
> > > + page = xa_load(&rac->mapping->i_pages, rac->_start);
> > > + VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > + rac->_batch_count = hpage_nr_pages(page);
> > 
> > So we could have rac->_nr_pages = 2, and then we get an order 2
> > large page returned, and so rac->_batch_count = 4.
> 
> Well, no, we couldn't.  rac->_nr_pages is incremented by 4 when we add
> an order-2 page to the readahead.

I don't see any code that does that. :)

i.e. we aren't actually putting high order pages into the page
cache here - page_alloc() allocates order-0 pages) - so there's
nothing in the patch that tells me how rac->_nr_pages behaves
when allocating large pages...

IOWs, we have an undocumented assumption in the implementation...

> I can put a
>   BUG_ON(rac->_batch_count > rac->_nr_pages)
> in here to be sure to catch any logic error like that.

Definitely necessary given that we don't insert large pages for
readahead yet. A comment explaining the assumptions that the
code makes for large pages is probably in order, too.

> > > - page->index = offset;
> > > - list_add(&page->lru, &page_pool);
> > > + if (use_list) {
> > > + page->index = offset;
> > > + list_add(&page->lru, &page_pool);
> > > + } else if (add_to_page_cache_lru(page, mapping, offset,
> > > + gfp_mask) < 0) {
> > > + put_page(page);
> > > + goto read;
> > > + }
> > 
> > Ok, so that's why you put read code at the end of the loop. To turn
> > the code into spaghetti :/
> > 
> > How much does this simplify down when we get rid of ->readpages and
> > can restructure the loop? This really seems like you're trying to
> > flatten two nested loops into one by the use of goto
> 
> I see it as having two failure cases in this loop.  One for "page is
> already present" (which already existed) and one for "allocated a page,
> but failed to add it to the page cache" (which used to be done later).
> I didn't want to duplicate the "call read_pages()" code.  So I reshuffled
> the code rather than add a nested loop.  I don't think the nested loop
> is easier to read (we'll be at 5 levels of indentation for some statements).
> Could do it this way ...

Can we move the update of @rac inside read_pages()? The next
start offset^Windex we start at is rac._start + rac._nr_pages, right?

so read_pages() could do:

{
if (readahead_count(rac)) {
/* do readahead */
}

/* advance the readahead cursor */
rac->_start += rac->_nr_pages;
rac._nr_pages = 0;
}

and then we only need to call read_pages() in these cases and so
the requirement for avoiding duplicating code is avoided...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 04/19] mm: Rearrange readahead loop

2020-02-18 Thread Dave Chinner
On Tue, Feb 18, 2020 at 05:57:36AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 04:08:24PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote:
> > > From: "Matthew Wilcox (Oracle)" 
> > > 
> > > Move the declaration of 'page' to inside the loop and move the 'kick
> > > off a fresh batch' code to the end of the function for easier use in
> > > subsequent patches.
> > 
> > Stale? the "kick off" code is moved to the tail of the loop, not the
> > end of the function.
> 
> Braino; I meant to write end of the loop.
> 
> > > @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space 
> > > *mapping,
> > >   page = xa_load(&mapping->i_pages, page_offset);
> > >   if (page && !xa_is_value(page)) {
> > >   /*
> > > -  * Page already present?  Kick off the current batch of
> > > -  * contiguous pages before continuing with the next
> > > -  * batch.
> > > +  * Page already present?  Kick off the current batch
> > > +  * of contiguous pages before continuing with the
> > > +  * next batch.  This page may be the one we would
> > > +  * have intended to mark as Readahead, but we don't
> > > +  * have a stable reference to this page, and it's
> > > +  * not worth getting one just for that.
> > >*/
> > > - if (readahead_count(&rac))
> > > - read_pages(&rac, &page_pool, gfp_mask);
> > > - rac._nr_pages = 0;
> > > - continue;
> > > + goto read;
> > >   }
> > >  
> > >   page = __page_cache_alloc(gfp_mask);
> > > @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space 
> > > *mapping,
> > >   if (page_idx == nr_to_read - lookahead_size)
> > >   SetPageReadahead(page);
> > >   rac._nr_pages++;
> > > + continue;
> > > +read:
> > > + if (readahead_count(&rac))
> > > + read_pages(&rac, &page_pool, gfp_mask);
> > > + rac._nr_pages = 0;
> > >   }
> > 
> > Also, why? This adds a goto from branched code that continues, then
> > adds a continue so the unbranched code doesn't execute the code the
> > goto jumps to. In absence of any explanation, this isn't an
> > improvement and doesn't make any sense...
> 
> I thought I was explaining it ... "for easier use in subsequent patches".

Sorry, my braino there. :) I commented on the problem with the first
part of the sentence, then the rest of the sentence completely
failed to sink in.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 03/19] mm: Use readahead_control to pass arguments

2020-02-18 Thread Dave Chinner
On Tue, Feb 18, 2020 at 05:56:18AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 04:03:00PM +1100, Dave Chinner wrote:
> > On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote:
> > > +static void read_pages(struct readahead_control *rac, struct list_head 
> > > *pages,
> > > + gfp_t gfp)
> > >  {
> > > + const struct address_space_operations *aops = rac->mapping->a_ops;
> > >   struct blk_plug plug;
> > >   unsigned page_idx;
> > 
> > Splitting out the aops rather than the mapping here just looks
> > weird, especially as you need the mapping later in the function.
> > Using aops doesn't even reduce the code side
> 
> It does in subsequent patches ... I agree it looks a little weird here,
> but I think in the final form, it makes sense:

Ok. Perhaps just an additional commit comment to say "read_pages() is
changed to be aops centric as @rac abstracts away all other
implementation details by the end of the patchset."

> > > + if (readahead_count(&rac))
> > > + read_pages(&rac, &page_pool, gfp_mask);
> > > + rac._nr_pages = 0;
> > 
> > Hmmm. Wondering ig it make sense to move the gfp_mask to the readahead
> > control structure - if we have to pass the gfp_mask down all the
> > way along side the rac, then I think it makes sense to do that...
> 
> So we end up removing it later on in this series, but I do wonder if
> it would make sense anyway.  By the end of the series, we still have
> this in iomap:
> 
> if (ctx->rac) /* same as readahead_gfp_mask */
> gfp |= __GFP_NORETRY | __GFP_NOWARN;
> 
> and we could get rid of that by passing gfp flags down in the rac.  On the
> other hand, I don't know why it doesn't just use readahead_gfp_mask()
> here anyway ... Christoph?

mapping->gfp_mask is awful. Is it a mask, or is it a valid set of
allocation flags? Or both?  Some callers to mapping_gfp_constraint()
uses it as a mask, some callers to mapping_gfp_constraint() use it
as base flags that context specific flags get masked out of,
readahead_gfp_mask() callers use it as the entire set of gfp flags
for allocation.

That whole API sucks - undocumented as to what it's suposed to do
and how it's supposed to be used. Hence it's difficult to use
correctly or understand whether it's being used correctly. And
reading callers only leads to more confusion and crazy code like in
do_mpage_readpage() where readahead returns a mask that are used as
base flags and normal reads return a masked set of base flags...

The iomap code is obviously correct when it comes to gfp flag
manipulation. We start with GFP_KERNEL context, then constrain it
via the mask held in mapping->gfp_mask, then if it's readahead we
allow the allocation to silently fail.

Simple to read and understand code, versus having weird code that
requires the reader to decipher an undocumented and inconsistent API
to understand how the gfp flags have been calculated and are valid.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 00/19] Change readahead API

2020-02-18 Thread Dave Chinner
On Tue, Feb 18, 2020 at 05:42:30AM -0800, Matthew Wilcox wrote:
> On Tue, Feb 18, 2020 at 03:56:33PM +1100, Dave Chinner wrote:
> > Latest version in your git tree:
> > 
> > $ ▶ glo -n 5 willy/readahead
> > 4be497096c04 mm: Use memalloc_nofs_save in readahead path
> > ff63497fcb98 iomap: Convert from readpages to readahead
> > 26aee60e89b5 iomap: Restructure iomap_readpages_actor
> > 8115bcca7312 fuse: Convert from readpages to readahead
> > 3db3d10d9ea1 f2fs: Convert from readpages to readahead
> > $
> > 
> > merged into a 5.6-rc2 tree fails at boot on my test vm:
> > 
> > [2.423116] [ cut here ]
> > [2.424957] list_add double add: new=ea000efff4c8, 
> > prev=8883bfffee60, next=ea000efff4c8.
> > [2.428259] WARNING: CPU: 4 PID: 1 at lib/list_debug.c:29 
> > __list_add_valid+0x67/0x70
> > [2.457484] Call Trace:
> > [2.458171]  __pagevec_lru_add_fn+0x15f/0x2c0
> > [2.459376]  pagevec_lru_move_fn+0x87/0xd0
> > [2.460500]  ? pagevec_move_tail_fn+0x2d0/0x2d0
> > [2.461712]  lru_add_drain_cpu+0x8d/0x160
> > [2.462787]  lru_add_drain+0x18/0x20
> 
> Are you sure that was 4be497096c04 ?  I ask because there was a

Yes, because it's the only version I've actually merged into my
working tree, compiled and tried to run. :P

> version pushed to that git tree that did contain a list double-add
> (due to a mismerge when shuffling patches).  I noticed it and fixed
> it, and 4be497096c04 doesn't have that problem.  I also test with
> CONFIG_DEBUG_LIST turned on, but this problem you hit is going to be
> probabilistic because it'll depend on the timing between whatever other
> list is being used and the page actually being added to the LRU.

I'll see if I can reproduce it.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 11/19] btrfs: Convert from readpages to readahead

2020-02-17 Thread Dave Chinner
the start,end doesn't it? What
guarantees that this fills the array with a contiguous page range?

> + VM_BUG_ON_PAGE(!PageLocked(page), page);
> + VM_BUG_ON_PAGE(PageTail(page), page);
> + array[batch++] = page;
> + rac->_batch_count += hpage_nr_pages(page);
> + if (PageHead(page))
> + xas_set(&xas, rac->_start + rac->_batch_count);

What on earth does this do? Comments please!

> +
> + if (batch == size)
> + break;
> + }
> +
> + return batch;
> +}

Seems a bit big for an inline function.

> +
> +#define readahead_for_each_batch(rac, array, size, nr)   
> \
> + for (; (nr = readahead_page_batch(rac, array, size));   \
> + readahead_next(rac))

I had to go look at the caller to work out what "size" refered to
here.

This is complex enough that it needs proper API documentation.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 10/19] fs: Convert mpage_readpages to mpage_readahead

2020-02-17 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:45:58AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Implement the new readahead aop and convert all callers (block_dev,
> exfat, ext2, fat, gfs2, hpfs, isofs, jfs, nilfs2, ocfs2, omfs, qnx6,
> reiserfs & udf).  The callers are all trivial except for GFS2 & OCFS2.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Junxiao Bi  # ocfs2
> ---
>  drivers/staging/exfat/exfat_super.c |  7 +++---
>  fs/block_dev.c  |  7 +++---
>  fs/ext2/inode.c | 10 +++-
>  fs/fat/inode.c  |  7 +++---
>  fs/gfs2/aops.c  | 23 ++---
>  fs/hpfs/file.c  |  7 +++---
>  fs/iomap/buffered-io.c  |  2 +-
>  fs/isofs/inode.c|  7 +++---
>  fs/jfs/inode.c  |  7 +++---
>  fs/mpage.c  | 38 +
>  fs/nilfs2/inode.c   | 15 +++-
>  fs/ocfs2/aops.c | 34 ++
>  fs/omfs/file.c  |  7 +++---
>  fs/qnx6/inode.c |  7 +++---
>  fs/reiserfs/inode.c |  8 +++---
>  fs/udf/inode.c  |  7 +++---
>  include/linux/mpage.h   |  4 +--
>  mm/migrate.c|  2 +-
>  18 files changed, 73 insertions(+), 126 deletions(-)

That's actually pretty simple changeover. Nothing really scary
there. :)

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 09/19] mm: Add page_cache_readahead_limit

2020-02-17 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:45:56AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> ext4 and f2fs have duplicated the guts of the readahead code so
> they can read past i_size.  Instead, separate out the guts of the
> readahead code so they can call it directly.

Gross and nasty (hosting non-stale data beyond EOF in the page
cache, that is).

Code is pretty simple, but...

>  }
>  
> -/*
> - * __do_page_cache_readahead() actually reads a chunk of disk.  It allocates
> - * the pages first, then submits them for I/O. This avoids the very bad
> - * behaviour which would occur if page allocations are causing VM writeback.
> - * We really don't want to intermingle reads and writes like that.
> +/**
> + * page_cache_readahead_limit - Start readahead beyond a file's i_size.
> + * @mapping: File address space.
> + * @file: This instance of the open file; used for authentication.
> + * @offset: First page index to read.
> + * @end_index: The maximum page index to read.
> + * @nr_to_read: The number of pages to read.
> + * @lookahead_size: Where to start the next readahead.
> + *
> + * This function is for filesystems to call when they want to start
> + * readahead potentially beyond a file's stated i_size.  If you want
> + * to start readahead on a normal file, you probably want to call
> + * page_cache_async_readahead() or page_cache_sync_readahead() instead.
> + *
> + * Context: File is referenced by caller.  Mutexes may be held by caller.
> + * May sleep, but will not reenter filesystem to reclaim memory.
>   */
> -void __do_page_cache_readahead(struct address_space *mapping,
> - struct file *filp, pgoff_t offset, unsigned long nr_to_read,
> - unsigned long lookahead_size)
> +void page_cache_readahead_limit(struct address_space *mapping,

... I don't think the function name conveys it's purpose. It's
really a ranged readahead that ignores where i_size lies. i.e

page_cache_readahead_range(mapping, start, end, nr_to_read)

seems like a better API to me, and then you can drop the "start
readahead beyond i_size" comments and replace it with "Range is not
limited by the inode's i_size and hence can be used to read data
stored beyond EOF into the page cache."

Also: "This is almost certainly not the function you want to call.
Use page_cache_async_readahead or page_cache_sync_readahead()
instead."

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 08/19] mm: Add readahead address space operation

2020-02-17 Thread Dave Chinner
agemap.h
> index 3613154e79e4..bd4291f78f41 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -665,6 +665,24 @@ static inline void readahead_next(struct 
> readahead_control *rac)
>  #define readahead_for_each(rac, page)
> \
>   for (; (page = readahead_page(rac)); readahead_next(rac))
>  
> +/* The byte offset into the file of this readahead block */
> +static inline loff_t readahead_offset(struct readahead_control *rac)
> +{
> + return (loff_t)rac->_start * PAGE_SIZE;
> +}

Urk. Didn't an early page use "offset" for the page index? That
was was "mm: Remove 'page_offset' from readahead loop" did, right?

That's just going to cause confusion to have different units for
readahead "offsets"

> +
> +/* The number of bytes in this readahead block */
> +static inline loff_t readahead_length(struct readahead_control *rac)
> +{
> + return (loff_t)rac->_nr_pages * PAGE_SIZE;
> +}
> +
> +/* The index of the first page in this readahead block */
> +static inline unsigned int readahead_index(struct readahead_control *rac)
> +{
> + return rac->_start;
> +}

Based on this, I suspect the earlier patch should use "index" rather
than "offset" when walking the page cache indexes...

> +
>  /* The number of pages in this readahead block */
>  static inline unsigned int readahead_count(struct readahead_control *rac)
>  {
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 9e430daae42f..975ff5e387be 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -121,7 +121,13 @@ static void read_pages(struct readahead_control *rac, 
> struct list_head *pages)
>  
>   blk_start_plug(&plug);
>  
> - if (aops->readpages) {
> + if (aops->readahead) {
> + aops->readahead(rac);
> + readahead_for_each(rac, page) {
> + unlock_page(page);
> + put_page(page);
> + }

This needs a comment to explain the unwinding that needs to be done
here. I'm not going to remember in a year's time that this is just
for the pages that weren't submitted by ->readahead

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 07/19] mm: Put readahead pages in cache earlier

2020-02-17 Thread Dave Chinner
break;
> - page->index = offset;
> - list_add(&page->lru, &page_pool);
> + if (use_list) {
> + page->index = offset;
> + list_add(&page->lru, &page_pool);
> + } else if (add_to_page_cache_lru(page, mapping, offset,
> + gfp_mask) < 0) {
> + put_page(page);
> + goto read;
> + }

Ok, so that's why you put read code at the end of the loop. To turn
the code into spaghetti :/

How much does this simplify down when we get rid of ->readpages and
can restructure the loop? This really seems like you're trying to
flatten two nested loops into one by the use of goto

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 06/19] mm: rename readahead loop variable to 'i'

2020-02-17 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:45:50AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Change the type of page_idx to unsigned long, and rename it -- it's
> just a loop counter, not a page index.
> 
> Suggested-by: John Hubbard 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Looks fine.

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 05/19] mm: Remove 'page_offset' from readahead loop

2020-02-17 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:45:48AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Eliminate the page_offset variable which was confusing with the
> 'offset' parameter and record the start of each consecutive run of
> pages in the readahead_control.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/readahead.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)

Looks ok, but having the readahead dispatch out of line from the
case that triggers it makes it hard to follow.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 04/19] mm: Rearrange readahead loop

2020-02-17 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:45:45AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Move the declaration of 'page' to inside the loop and move the 'kick
> off a fresh batch' code to the end of the function for easier use in
> subsequent patches.

Stale? the "kick off" code is moved to the tail of the loop, not the
end of the function.

> @@ -183,14 +183,14 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   page = xa_load(&mapping->i_pages, page_offset);
>   if (page && !xa_is_value(page)) {
>   /*
> -  * Page already present?  Kick off the current batch of
> -  * contiguous pages before continuing with the next
> -  * batch.
> +  * Page already present?  Kick off the current batch
> +  * of contiguous pages before continuing with the
> +  * next batch.  This page may be the one we would
> +  * have intended to mark as Readahead, but we don't
> +  * have a stable reference to this page, and it's
> +  * not worth getting one just for that.
>*/
> - if (readahead_count(&rac))
> - read_pages(&rac, &page_pool, gfp_mask);
> - rac._nr_pages = 0;
> - continue;
> + goto read;
>   }
>  
>   page = __page_cache_alloc(gfp_mask);
> @@ -201,6 +201,11 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   if (page_idx == nr_to_read - lookahead_size)
>   SetPageReadahead(page);
>   rac._nr_pages++;
> + continue;
> +read:
> + if (readahead_count(&rac))
> + read_pages(&rac, &page_pool, gfp_mask);
> + rac._nr_pages = 0;
>   }

Also, why? This adds a goto from branched code that continues, then
adds a continue so the unbranched code doesn't execute the code the
goto jumps to. In absence of any explanation, this isn't an
improvement and doesn't make any sense...

-- 
Dave Chinner
da...@fromorbit.com


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v6 03/19] mm: Use readahead_control to pass arguments

2020-02-17 Thread Dave Chinner
On Mon, Feb 17, 2020 at 10:45:44AM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> In this patch, only between __do_page_cache_readahead() and
> read_pages(), but it will be extended in upcoming patches.  Also add
> the readahead_count() accessor.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  include/linux/pagemap.h | 17 +
>  mm/readahead.c  | 36 +---
>  2 files changed, 38 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ccb14b6a16b5..982ecda2d4a2 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -630,6 +630,23 @@ static inline int add_to_page_cache(struct page *page,
>   return error;
>  }
>  
> +/*
> + * Readahead is of a block of consecutive pages.
> + */
> +struct readahead_control {
> + struct file *file;
> + struct address_space *mapping;
> +/* private: use the readahead_* accessors instead */
> + pgoff_t _start;
> + unsigned int _nr_pages;
> +};
> +
> +/* The number of pages in this readahead block */
> +static inline unsigned int readahead_count(struct readahead_control *rac)
> +{
> + return rac->_nr_pages;
> +}
> +
>  static inline unsigned long dir_pages(struct inode *inode)
>  {
>   return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >>
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 12d13b7792da..15329309231f 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -113,26 +113,29 @@ int read_cache_pages(struct address_space *mapping, 
> struct list_head *pages,
>  
>  EXPORT_SYMBOL(read_cache_pages);
>  
> -static void read_pages(struct address_space *mapping, struct file *filp,
> - struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
> +static void read_pages(struct readahead_control *rac, struct list_head 
> *pages,
> + gfp_t gfp)
>  {
> + const struct address_space_operations *aops = rac->mapping->a_ops;
>   struct blk_plug plug;
>   unsigned page_idx;

Splitting out the aops rather than the mapping here just looks
weird, especially as you need the mapping later in the function.
Using aops doesn't even reduce the code side

>  
>   blk_start_plug(&plug);
>  
> - if (mapping->a_ops->readpages) {
> - mapping->a_ops->readpages(filp, mapping, pages, nr_pages);
> + if (aops->readpages) {
> + aops->readpages(rac->file, rac->mapping, pages,
> + readahead_count(rac));
>   /* Clean up the remaining pages */
>   put_pages_list(pages);
>   goto out;
>   }
>  
> - for (page_idx = 0; page_idx < nr_pages; page_idx++) {
> + for (page_idx = 0; page_idx < readahead_count(rac); page_idx++) {
>   struct page *page = lru_to_page(pages);
>   list_del(&page->lru);
> - if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
> - mapping->a_ops->readpage(filp, page);
> + if (!add_to_page_cache_lru(page, rac->mapping, page->index,
> + gfp))
> + aops->readpage(rac->file, page);

... it just makes this less readable by splitting the if() over two
lines...

>   put_page(page);
>   }
>  
> @@ -155,9 +158,13 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>   unsigned long end_index;/* The last page we want to read */
>   LIST_HEAD(page_pool);
>   int page_idx;
> - unsigned int nr_pages = 0;
>   loff_t isize = i_size_read(inode);
>   gfp_t gfp_mask = readahead_gfp_mask(mapping);
> + struct readahead_control rac = {
> + .mapping = mapping,
> + .file = filp,
> + ._nr_pages = 0,
> + };

No need to initialise _nr_pages to zero, leaving it out will do the
same thing.

>  
>   if (isize == 0)
>   return;
> @@ -180,10 +187,9 @@ void __do_page_cache_readahead(struct address_space 
> *mapping,
>* contiguous pages before continuing with the next
>* batch.
>*/
> - if (nr_pages)
> - read_pages(mapping, filp, &page_pool, nr_pages,
> - gfp_mask);
> -     nr_pages = 0;
> + if (readahead_count(&rac))
> + read_pages(&rac, &page_pool, gfp_mask);
> + rac._

  1   2   3   >