from:"Darrick J. Wong"

Re: [f2fs-dev] [PATCH v4 02/22] iomap: Allow filesystems set IO block zeroing size

2024-06-21 Thread Darrick J. Wong

On Thu, Jun 13, 2024 at 11:31:35AM +0100, John Garry wrote:
> On 12/06/2024 22:32, Darrick J. Wong wrote:
> > > unsigned int fs_block_size = i_blocksize(inode), pad;
> > > + u64 io_block_size = iomap->io_block_size;
> > I wonder, should iomap be nice and not require filesystems to set
> > io_block_size themselves unless they really need it?
> 
> That's what I had in v3, like:
> 
> if (iomap->io_block_size)
>   io_block_size = iomap->io_block_size;
> else
>   io_block_size = i_block_size(inode)
> 
> but it was suggested to change that (to like what I have here).

oh, ok.  Ignore that comment, then. :)

> > Anyone working on
> > an iomap port while this patchset is in progress may or may not remember
> > to add this bit if they get their port merged after atomicwrites is
> > merged; and you might not remember to prevent the bitrot if the reverse
> > order happens.
> 
> Sure, I get your point.
> 
> However, OTOH, if we check xfs_bmbt_to_iomap(), it does set all or close to
> all members of struct iomap, so we are just continuing that trend, i.e. it
> is the job of the FS callback to set all these members.
> 
> > 
> > u64 io_block_size = iomap->io_block_size ?: i_blocksize(inode);
> > 
> > >   loff_t length = iomap_length(iter);
> > >   loff_t pos = iter->pos;
> > >   blk_opf_t bio_opf;
> > > @@ -287,6 +287,7 @@ static loff_t iomap_dio_bio_iter(const struct 
> > > iomap_iter *iter,
> > >   int nr_pages, ret = 0;
> > >   size_t copied = 0;
> > >   size_t orig_count;
> > > + unsigned int pad;
> > >   if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) 
> > > ||
> > >   !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > > @@ -355,7 +356,14 @@ static loff_t iomap_dio_bio_iter(const struct 
> > > iomap_iter *iter,
> > >   if (need_zeroout) {
> > >   /* zero out from the start of the block to the write 
> > > offset */
> > > - pad = pos & (fs_block_size - 1);
> > > + if (is_power_of_2(io_block_size)) {
> > > + pad = pos & (io_block_size - 1);
> > > + } else {
> > > + loff_t _pos = pos;
> > > +
> > > + pad = do_div(_pos, io_block_size);
> > > + }
> > Please don't opencode this twice.
> > 
> > static unsigned int offset_in_block(loff_t pos, u64 blocksize)
> > {
> > if (likely(is_power_of_2(blocksize)))
> > return pos & (blocksize - 1);
> > return do_div(pos, blocksize);
> > }
> 
> ok, fine
> 
> > 
> > pad = offset_in_block(pos, io_block_size);
> > if (pad)
> > ...
> > 
> > Also, what happens if pos-pad points to a byte before the mapping?
> 
> It's the job of the FS to map in something aligned to io_block_size. Having
> said that, I don't think we are doing that for XFS (which sets io_block_size
> > i_block_size(inode)), so I need to check that.

  You can only play with the mapping that the fs gave you.
If xfs doesn't give you a big enough mapping, then that's a programming
bug to WARN_ON_ONCE about and return EIO.

I hadn't realized that the ->iomap_begin function is required to
provide mappings that are aligned to io_block_size.

> 
> > 
> > > +
> > >   if (pad)
> > >   iomap_dio_zero(iter, dio, pos - pad, pad);
> > >   }
> > > @@ -429,9 +437,16 @@ static loff_t iomap_dio_bio_iter(const struct 
> > > iomap_iter *iter,
> > >   if (need_zeroout ||
> > >   ((dio->flags & IOMAP_DIO_WRITE) && pos >= 
> > > i_size_read(inode))) {
> > >   /* zero out from the end of the write to the end of the 
> > > block */
> > > - pad = pos & (fs_block_size - 1);
> > > + if (is_power_of_2(io_block_size)) {
> > > + pad = pos & (io_block_size - 1);
> > > + } else {
> > > + loff_t _pos = pos;
> > > +
> > > + pad = do_div(_pos, io_block_size);
> > > + }
> > > +
> > >   if (pad)
> > > - iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
> > > + iomap_dio_zero(iter, dio, pos, io_block_size - pad);
> > What if pos + io_block_s

Re: [f2fs-dev] [PATCH v4 01/22] fs: Add generic_atomic_write_valid_size()

2024-06-20 Thread Darrick J. Wong

On Thu, Jun 13, 2024 at 08:35:53AM +0100, John Garry wrote:
> On 12/06/2024 22:10, Darrick J. Wong wrote:
> > On Fri, Jun 07, 2024 at 02:38:58PM +, John Garry wrote:
> > > Add a generic helper for FSes to validate that an atomic write is
> > > appropriately sized (along with the other checks).
> > > 
> > > Signed-off-by: John Garry 
> > > ---
> > >   include/linux/fs.h | 12 
> > >   1 file changed, 12 insertions(+)
> > > 
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 069cbab62700..e13d34f8c24e 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -3645,4 +3645,16 @@ bool generic_atomic_write_valid(loff_t pos, struct 
> > > iov_iter *iter)
> > >   return true;
> > >   }
> > > +static inline
> > > +bool generic_atomic_write_valid_size(loff_t pos, struct iov_iter *iter,
> > > + unsigned int unit_min, unsigned int unit_max)
> > > +{
> > > + size_t len = iov_iter_count(iter);
> > > +
> > > + if (len < unit_min || len > unit_max)
> > > + return false;
> > > +
> > > + return generic_atomic_write_valid(pos, iter);
> > > +}
> > 
> > Now that I look back at "fs: Initial atomic write support" I wonder why
> > not pass the iocb and the iov_iter instead of pos and the iov_iter?
> 
> The original user of generic_atomic_write_valid() [blkdev_dio_unaligned() or
> blkdev_dio_invalid() with the rename] used these same args, so I just went
> with that.

Don't let the parameter types of static blockdev helpers determine the
VFS API that filesystems need to implement untorn writes.

In the block layer enablement patch, this could easily be:

bool generic_atomic_write_valid(const struct kiocb *iocb,
const struct iov_iter *iter)
{
size_t len = iov_iter_count(iter);

if (!iter_is_ubuf(iter))
return false;

if (!is_power_of_2(len))
return false;

if (!IS_ALIGNED(iocb->ki_pos, len))
return false;

return true;
}

Then this becomes:

bool generic_atomic_write_valid_size(const struct kiocb *iocb,
 const struct iov_iter *iter,
 unsigned int unit_min,
 unsigned int unit_max)
{
size_t len = iov_iter_count(iter);

if (len < unit_min || len > unit_max)
return false;

return generic_atomic_write_valid(iocb, iter);
}

Yes, that means you have to rearrange the calling conventions of
blkdev_dio_invalid a little bit, but the first two arguments match
->read_iter and ->write_iter.  Filesystem writers can see that the first
two arguments are the first two parameters to foofs_write_iter() and
focus on the hard part, which is figuring out unit_{min,max}.

static ssize_t
xfs_file_dio_write(
struct kiocb*iocb,
struct iov_iter *from)
{
...
if ((iocb->ki_flags & IOCB_ATOMIC) &&
!generic_atomic_write_valid_size(iocb, from,
i_blocksize(inode),
XFS_FSB_TO_B(mp, ip->i_extsize)))
return -EINVAL;
}


> > And can these be collapsed into a single generic_atomic_write_checks()
> > function?
> 
> bdev file operations would then need to use
> generic_atomic_write_valid_size(), and there is no unit_min and unit_max
> size there, apart from bdev awu min and max. And if I checked them, we would
> be duplicating checks (of awu min and max) in the block layer.

Fair enough, I concede this point.

--D

> 
> Cheers,
> John
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v4 03/22] xfs: Use extent size granularity for iomap->io_block_size

2024-06-12 Thread Darrick J. Wong

On Fri, Jun 07, 2024 at 02:39:00PM +, John Garry wrote:
> Currently iomap->io_block_size is set to the i_blocksize() value for the
> inode.
> 
> Expand the sub-fs block size zeroing to now cover RT extents, by calling
> setting iomap->io_block_size as xfs_inode_alloc_unitsize().
> 
> In xfs_iomap_write_unwritten(), update the unwritten range fsb to cover
> this extent granularity.
> 
> In xfs_file_dio_write(), handle a write which is not aligned to extent
> size granularity as unaligned. Since the extent size granularity need not
> be a power-of-2, handle this also.
> 
> Signed-off-by: John Garry 
> ---
>  fs/xfs/xfs_file.c  | 24 +++-
>  fs/xfs/xfs_inode.c | 17 +++--
>  fs/xfs/xfs_inode.h |  1 +
>  fs/xfs/xfs_iomap.c |  8 +++-
>  4 files changed, 38 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index b240ea5241dc..24fe3c2e03da 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -601,7 +601,7 @@ xfs_file_dio_write_aligned(
>  }
>  
>  /*
> - * Handle block unaligned direct I/O writes
> + * Handle unaligned direct IO writes.
>   *
>   * In most cases direct I/O writes will be done holding IOLOCK_SHARED, 
> allowing
>   * them to be done in parallel with reads and other direct I/O writes.  
> However,
> @@ -630,9 +630,9 @@ xfs_file_dio_write_unaligned(
>   ssize_t ret;
>  
>   /*
> -  * Extending writes need exclusivity because of the sub-block zeroing
> -  * that the DIO code always does for partial tail blocks beyond EOF, so
> -  * don't even bother trying the fast path in this case.
> +  * Extending writes need exclusivity because of the sub-block/extent
> +  * zeroing that the DIO code always does for partial tail blocks
> +  * beyond EOF, so don't even bother trying the fast path in this case.

Hummm.  So let's say the fsblock size is 4k, the rt extent size is 16k,
and you want to write bytes 8192-12287 of a file.  Currently we'd use
xfs_file_dio_write_aligned for that, but now we'd use
xfs_file_dio_write_unaligned?  Even though we don't need zeroing or any
of that stuff?

>*/
>   if (iocb->ki_pos > isize || iocb->ki_pos + count >= isize) {
>   if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -698,11 +698,25 @@ xfs_file_dio_write(
>   struct xfs_inode*ip = XFS_I(file_inode(iocb->ki_filp));
>   struct xfs_buftarg  *target = xfs_inode_buftarg(ip);
>   size_t  count = iov_iter_count(from);
> + boolunaligned;
> + u64 unitsize;
>  
>   /* direct I/O must be aligned to device logical sector size */
>   if ((iocb->ki_pos | count) & target->bt_logical_sectormask)
>   return -EINVAL;
> - if ((iocb->ki_pos | count) & ip->i_mount->m_blockmask)
> +
> + unitsize = xfs_inode_alloc_unitsize(ip);
> + if (!is_power_of_2(unitsize)) {
> + if (isaligned_64(iocb->ki_pos, unitsize) &&
> + isaligned_64(count, unitsize))
> + unaligned = false;
> + else
> + unaligned = true;
> + } else {
> + unaligned = (iocb->ki_pos | count) & (unitsize - 1);
> + }

Didn't I already write this?

> + if (unaligned)

if (!xfs_is_falloc_aligned(ip, iocb->ki_pos, count))

>   return xfs_file_dio_write_unaligned(ip, iocb, from);
>   return xfs_file_dio_write_aligned(ip, iocb, from);
>  }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 58fb7a5062e1..93ad442f399b 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -4264,15 +4264,20 @@ xfs_break_layouts(
>   return error;
>  }
>  
> -/* Returns the size of fundamental allocation unit for a file, in bytes. */

Don't delete the comment, it has useful return type information.

/*
 * Returns the size of fundamental allocation unit for a file, in
 * fsblocks.
 */

>  unsigned int
> -xfs_inode_alloc_unitsize(
> +xfs_inode_alloc_unitsize_fsb(
>   struct xfs_inode*ip)
>  {
> - unsigned intblocks = 1;
> -
>   if (XFS_IS_REALTIME_INODE(ip))
> - blocks = ip->i_mount->m_sb.sb_rextsize;
> + return ip->i_mount->m_sb.sb_rextsize;
> +
> + return 1;
> +}
>  
> - return XFS_FSB_TO_B(ip->i_mount, blocks);
> +/* Returns the size of fundamental allocation unit for a file, in bytes. */
> +unsigned int
> +xfs_inode_alloc_unitsize(
> + struct xfs_inode*ip)
> +{
> + return XFS_FSB_TO_B(ip->i_mount, xfs_inode_alloc_unitsize_fsb(ip));
>  }
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 292b90b5f2ac..90d2fa837117 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -643,6 +643,7 @@ int xfs_inode_reload_unlinked(struct xfs_inode *ip);
>  bool xfs_ifork_zapped(const struct xfs_inode *ip, int whichfork);
>  void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
>

Re: [f2fs-dev] [PATCH v4 02/22] iomap: Allow filesystems set IO block zeroing size

2024-06-12 Thread Darrick J. Wong

On Fri, Jun 07, 2024 at 02:38:59PM +, John Garry wrote:
> Allow filesystems to set the io_block_size for sub-fs block size zeroing,
> as in future we will want to extend this feature to support zeroing of
> block sizes of larger than the inode block size.
> 
> The value in io_block_size does not have to be a power-of-2, so fix up
> zeroing code to handle that.
> 
> Signed-off-by: John Garry 
> ---
>  block/fops.c  |  1 +
>  fs/btrfs/inode.c  |  1 +
>  fs/erofs/data.c   |  1 +
>  fs/erofs/zmap.c   |  1 +
>  fs/ext2/inode.c   |  1 +
>  fs/ext4/extents.c |  1 +
>  fs/ext4/inode.c   |  1 +
>  fs/f2fs/data.c|  1 +
>  fs/fuse/dax.c |  1 +
>  fs/gfs2/bmap.c|  1 +
>  fs/hpfs/file.c|  1 +
>  fs/iomap/direct-io.c  | 23 +++
>  fs/xfs/xfs_iomap.c|  1 +
>  fs/zonefs/file.c  |  2 ++
>  include/linux/iomap.h |  2 ++
>  15 files changed, 35 insertions(+), 4 deletions(-)
> 
> diff --git a/block/fops.c b/block/fops.c
> index 9d6d86ebefb9..020443078630 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -402,6 +402,7 @@ static int blkdev_iomap_begin(struct inode *inode, loff_t 
> offset, loff_t length,
>   iomap->addr = iomap->offset;
>   iomap->length = isize - iomap->offset;
>   iomap->flags |= IOMAP_F_BUFFER_HEAD; /* noop for !CONFIG_BUFFER_HEAD */
> + iomap->io_block_size = i_blocksize(inode);
>   return 0;
>  }
>  



> diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c
> index 1bb8d97cd9ae..5d2718faf520 100644
> --- a/fs/hpfs/file.c
> +++ b/fs/hpfs/file.c
> @@ -149,6 +149,7 @@ static int hpfs_iomap_begin(struct inode *inode, loff_t 
> offset, loff_t length,
>   iomap->addr = IOMAP_NULL_ADDR;
>   iomap->length = 1 << blkbits;
>   }
> + iomap->io_block_size = i_blocksize(inode);

HPFS does iomap now?  Yikes.

>  
>   hpfs_unlock(sb);
>   return 0;
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f3b43d223a46..5be8d886ab4a 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -277,7 +277,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter 
> *iter,
>  {
>   const struct iomap *iomap = >iomap;
>   struct inode *inode = iter->inode;
> - unsigned int fs_block_size = i_blocksize(inode), pad;
> + u64 io_block_size = iomap->io_block_size;

I wonder, should iomap be nice and not require filesystems to set
io_block_size themselves unless they really need it?  Anyone working on
an iomap port while this patchset is in progress may or may not remember
to add this bit if they get their port merged after atomicwrites is
merged; and you might not remember to prevent the bitrot if the reverse
order happens.

u64 io_block_size = iomap->io_block_size ?: i_blocksize(inode);

>   loff_t length = iomap_length(iter);
>   loff_t pos = iter->pos;
>   blk_opf_t bio_opf;
> @@ -287,6 +287,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter 
> *iter,
>   int nr_pages, ret = 0;
>   size_t copied = 0;
>   size_t orig_count;
> + unsigned int pad;
>  
>   if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>   !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> @@ -355,7 +356,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter 
> *iter,
>  
>   if (need_zeroout) {
>   /* zero out from the start of the block to the write offset */
> - pad = pos & (fs_block_size - 1);
> + if (is_power_of_2(io_block_size)) {
> + pad = pos & (io_block_size - 1);
> + } else {
> + loff_t _pos = pos;
> +
> + pad = do_div(_pos, io_block_size);
> + }

Please don't opencode this twice.

static unsigned int offset_in_block(loff_t pos, u64 blocksize)
{
if (likely(is_power_of_2(blocksize)))
return pos & (blocksize - 1);
return do_div(pos, blocksize);
}

pad = offset_in_block(pos, io_block_size);
if (pad)
...

Also, what happens if pos-pad points to a byte before the mapping?

> +
>   if (pad)
>   iomap_dio_zero(iter, dio, pos - pad, pad);
>   }
> @@ -429,9 +437,16 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter 
> *iter,
>   if (need_zeroout ||
>   ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>   /* zero out from the end of the write to the end of the block */
> - pad = pos & (fs_block_size - 1);
> + if (is_power_of_2(io_block_size)) {
> + pad = pos & (io_block_size - 1);
> + } else {
> + loff_t _pos = pos;
> +
> + pad = do_div(_pos, io_block_size);
> + }
> +
>   if (pad)
> - iomap_dio_zero(iter, dio, pos, fs_block_size - pad);
> +

Re: [f2fs-dev] [PATCH v4 01/22] fs: Add generic_atomic_write_valid_size()

2024-06-12 Thread Darrick J. Wong

On Fri, Jun 07, 2024 at 02:38:58PM +, John Garry wrote:
> Add a generic helper for FSes to validate that an atomic write is
> appropriately sized (along with the other checks).
> 
> Signed-off-by: John Garry 
> ---
>  include/linux/fs.h | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 069cbab62700..e13d34f8c24e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3645,4 +3645,16 @@ bool generic_atomic_write_valid(loff_t pos, struct 
> iov_iter *iter)
>   return true;
>  }
>  
> +static inline
> +bool generic_atomic_write_valid_size(loff_t pos, struct iov_iter *iter,
> + unsigned int unit_min, unsigned int unit_max)
> +{
> + size_t len = iov_iter_count(iter);
> +
> + if (len < unit_min || len > unit_max)
> + return false;
> +
> + return generic_atomic_write_valid(pos, iter);
> +}

Now that I look back at "fs: Initial atomic write support" I wonder why
not pass the iocb and the iov_iter instead of pos and the iov_iter?
And can these be collapsed into a single generic_atomic_write_checks()
function?

--D

> +
>  #endif /* _LINUX_FS_H */
> -- 
> 2.31.1
> 
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH] tracing/treewide: Remove second parameter of __assign_str()

2024-05-17 Thread Darrick J. Wong

On Thu, May 16, 2024 at 01:34:54PM -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> [
>This is a treewide change. I will likely re-create this patch again in
>the second week of the merge window of v6.10 and submit it then. Hoping
>to keep the conflicts that it will cause to a minimum.
> ]
> 
> With the rework of how the __string() handles dynamic strings where it
> saves off the source string in field in the helper structure[1], the
> assignment of that value to the trace event field is stored in the helper
> value and does not need to be passed in again.
> 
> This means that with:
> 
>   __string(field, mystring)
> 
> Which use to be assigned with __assign_str(field, mystring), no longer
> needs the second parameter and it is unused. With this, __assign_str()
> will now only get a single parameter.
> 
> There's over 700 users of __assign_str() and because coccinelle does not
> handle the TRACE_EVENT() macro I ended up using the following sed script:
> 
>   git grep -l __assign_str | while read a ; do
>   sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file;
>   mv /tmp/test-file $a;
>   done
> 
> I then searched for __assign_str() that did not end with ';' as those
> were multi line assignments that the sed script above would fail to catch.
> 
> Note, the same updates will need to be done for:
> 
>   __assign_str_len()
>   __assign_rel_str()
>   __assign_rel_str_len()
> 
> I tested this with both an allmodconfig and an allyesconfig (build only for 
> both).
> 
> [1] 
> https://lore.kernel.org/linux-trace-kernel/2024011442.634192...@goodmis.org/
> 
> Cc: Masami Hiramatsu 
> Cc: Mathieu Desnoyers 
> Cc: Linus Torvalds 
> Cc: Julia Lawall 
> Signed-off-by: Steven Rostedt (Google) 

/me finds this pretty magical, but such is the way of macros.
Thanks for being much smarter about them than me. :)

Acked-by: Darrick J. Wong# xfs

--D


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v3 01/13] fs: fiemap: add physical_length field to extents

2024-04-09 Thread Darrick J. Wong

On Wed, Apr 03, 2024 at 03:22:42AM -0400, Sweet Tea Dorminy wrote:
> Some filesystems support compressed extents which have a larger logical
> size than physical, and for those filesystems, it can be useful for
> userspace to know how much space those extents actually use. For
> instance, the compsize [1] tool for btrfs currently uses btrfs-internal,
> root-only ioctl to find the actual disk space used by a file; it would
> be better and more useful for this information to require fewer
> privileges and to be usable on more filesystems. Therefore, use one of
> the padding u64s in the fiemap extent structure to return the actual
> physical length; and, for now, return this as equal to the logical
> length.
> 
> [1] https://github.com/kilobyte/compsize
> 
> Signed-off-by: Sweet Tea Dorminy 
> ---
>  Documentation/filesystems/fiemap.rst | 28 +---
>  fs/ioctl.c   |  3 ++-
>  include/uapi/linux/fiemap.h  | 32 ++--
>  3 files changed, 47 insertions(+), 16 deletions(-)
> 
> diff --git a/Documentation/filesystems/fiemap.rst 
> b/Documentation/filesystems/fiemap.rst
> index 93fc96f760aa..c2bfa107c8d7 100644
> --- a/Documentation/filesystems/fiemap.rst
> +++ b/Documentation/filesystems/fiemap.rst
> @@ -80,14 +80,24 @@ Each extent is described by a single fiemap_extent 
> structure as
>  returned in fm_extents::
>  
>  struct fiemap_extent {
> - __u64   fe_logical;  /* logical offset in bytes for the start of
> - * the extent */
> - __u64   fe_physical; /* physical offset in bytes for the start
> - * of the extent */
> - __u64   fe_length;   /* length in bytes for the extent */
> - __u64   fe_reserved64[2];
> - __u32   fe_flags;/* FIEMAP_EXTENT_* flags for this extent */
> - __u32   fe_reserved[3];
> +/*
> + * logical offset in bytes for the start of
> + * the extent from the beginning of the file
> + */
> +__u64 fe_logical;
> +/*
> + * physical offset in bytes for the start
> + * of the extent from the beginning of the disk
> + */
> +__u64 fe_physical;
> +/* logical length in bytes for this extent */
> +__u64 fe_logical_length;
> +/* physical length in bytes for this extent */
> +__u64 fe_physical_length;
> +__u64 fe_reserved64[1];
> +/* FIEMAP_EXTENT_* flags for this extent */
> +__u32 fe_flags;
> +__u32 fe_reserved[3];
>  };
>  
>  All offsets and lengths are in bytes and mirror those on disk.  It is valid
> @@ -175,6 +185,8 @@ FIEMAP_EXTENT_MERGED
>userspace would be highly inefficient, the kernel will try to merge most
>adjacent blocks into 'extents'.
>  
> +FIEMAP_EXTENT_HAS_PHYS_LEN
> +  This will be set if the file system populated the physical length field.

Just out of curiosity, should filesystems set this flag and
fe_physical_length if fe_physical_length == fe_logical_length?
Or just leave both blank?

>  VFS -> File System Implementation
>  -
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 661b46125669..8afd32e1a27a 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -138,7 +138,8 @@ int fiemap_fill_next_extent(struct fiemap_extent_info 
> *fieinfo, u64 logical,
>   memset(, 0, sizeof(extent));
>   extent.fe_logical = logical;
>   extent.fe_physical = phys;
> - extent.fe_length = len;
> + extent.fe_logical_length = len;
> + extent.fe_physical_length = len;
>   extent.fe_flags = flags;
>  
>   dest += fieinfo->fi_extents_mapped;
> diff --git a/include/uapi/linux/fiemap.h b/include/uapi/linux/fiemap.h
> index 24ca0c00cae3..3079159b8e94 100644
> --- a/include/uapi/linux/fiemap.h
> +++ b/include/uapi/linux/fiemap.h
> @@ -14,14 +14,30 @@
>  
>  #include 
>  
> +/*
> + * For backward compatibility, where the member of the struct was called
> + * fe_length instead of fe_logical_length.
> + */
> +#define fe_length fe_logical_length

This #define has global scope; are you sure this isn't going to cause a
weird build problem downstream with some program that declares an
unrelated fe_length symbol?

> +
>  struct fiemap_extent {
> - __u64 fe_logical;  /* logical offset in bytes for the start of
> - * the extent from the beginning of the file */
> - __u64 fe_physical; /* physical offset in bytes for the start
> - * of the extent from the beginning of the disk */
> - __u64 fe_length;   /* length in bytes for this extent */
> - __u64 fe_reserved64[2];
> - __u32 fe_flags;/* FIEMAP_EXTENT_* flags for this extent */
> + /*
> +  * logical offset in bytes for the start of
> +  * the extent from the beginning of the file
> +  */
> + __u64

Re: [f2fs-dev] [PATCH v3 00/13] fiemap extension for more physical information

2024-04-03 Thread Darrick J. Wong

On Wed, Apr 03, 2024 at 02:17:26PM -0400, Kent Overstreet wrote:
> On Wed, Apr 03, 2024 at 03:22:41AM -0400, Sweet Tea Dorminy wrote:
> > For many years, various btrfs users have written programs to discover
> > the actual disk space used by files, using root-only interfaces.
> > However, this information is a great fit for fiemap: it is inherently
> > tied to extent information, all filesystems can use it, and the
> > capabilities required for FIEMAP make sense for this additional
> > information also.
> > 
> > Hence, this patchset adds various additional information to fiemap,
> > and extends filesystems (but not iomap) to return it.  This uses some of
> > the reserved padding in the fiemap extent structure, so programs unaware
> > of the changes will be unaffected.
> > 
> > This is based on next-20240403. I've tested the btrfs part of this with
> > the standard btrfs testing matrix locally and manually, and done minimal
> > testing of the non-btrfs parts.
> > 
> > I'm unsure whether btrfs should be returning the entire physical extent
> > referenced by a particular logical range, or just the part of the
> > physical extent referenced by that range. The v2 thread has a discussion
> > of this.
> 
> I believe there was some talk of using the padding for a device ID, so
> that fiemap could properly support multi device filesystems. Are we sure
> this is the best use of those bytes?

We still have 5x u32 of empty space in struct fiemap after this series,
so I don't think adding the physical length is going to prohibit future
expansion.

--D

> > 
> > Changelog:
> > 
> > v3: 
> >  - Adapted all the direct users of fiemap, except iomap, to emit
> >the new fiemap information, as far as I understand the other
> >filesystems.
> > 
> > v2:
> >  - Adopted PHYS_LEN flag and COMPRESSED flag from the previous version,
> >as per Andreas Dilger' comment.
> >
> > https://patchwork.ozlabs.org/project/linux-ext4/patch/4f8d5dc5b51a43efaf16c39398c23a6276e40a30.1386778303.git.dste...@suse.cz/
> >  - 
> > https://lore.kernel.org/linux-fsdevel/cover.1711588701.git.sweettea-ker...@dorminy.me/T/#t
> > 
> > v1: 
> > https://lore.kernel.org/linux-fsdevel/20240315030334.GQ6184@frogsfrogsfrogs/T/#t
> > 
> > Sweet Tea Dorminy (13):
> >   fs: fiemap: add physical_length field to extents
> >   fs: fiemap: update fiemap_fill_next_extent() signature
> >   fs: fiemap: add new COMPRESSED extent state
> >   btrfs: fiemap: emit new COMPRESSED state.
> >   btrfs: fiemap: return extent physical size
> >   nilfs2: fiemap: return correct extent physical length
> >   ext4: fiemap: return correct extent physical length
> >   f2fs: fiemap: add physical length to trace_f2fs_fiemap
> >   f2fs: fiemap: return correct extent physical length
> >   ocfs2: fiemap: return correct extent physical length
> >   bcachefs: fiemap: return correct extent physical length
> >   f2fs: fiemap: emit new COMPRESSED state
> >   bcachefs: fiemap: emit new COMPRESSED state
> > 
> >  Documentation/filesystems/fiemap.rst | 35 ++
> >  fs/bcachefs/fs.c | 17 +--
> >  fs/btrfs/extent_io.c | 72 ++--
> >  fs/ext4/extents.c|  3 +-
> >  fs/f2fs/data.c   | 36 +-
> >  fs/f2fs/inline.c |  7 +--
> >  fs/ioctl.c   | 11 +++--
> >  fs/iomap/fiemap.c|  2 +-
> >  fs/nilfs2/inode.c| 18 ---
> >  fs/ntfs3/frecord.c   |  7 +--
> >  fs/ocfs2/extent_map.c| 10 ++--
> >  fs/smb/client/smb2ops.c  |  1 +
> >  include/linux/fiemap.h   |  2 +-
> >  include/trace/events/f2fs.h  | 10 ++--
> >  include/uapi/linux/fiemap.h  | 34 ++---
> >  15 files changed, 178 insertions(+), 87 deletions(-)
> > 
> > 
> > base-commit: 75e31f66adc4c8d049e8aac1f079c1639294cd65
> > -- 
> > 2.43.0
> > 
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 86/87] fs: switch timespec64 fields in inode to discrete integers

2023-09-28 Thread Darrick J. Wong

On Thu, Sep 28, 2023 at 01:06:03PM -0400, Jeff Layton wrote:
> On Thu, 2023-09-28 at 11:48 -0400, Arnd Bergmann wrote:
> > On Thu, Sep 28, 2023, at 07:05, Jeff Layton wrote:
> > > This shaves 8 bytes off struct inode, according to pahole.
> > > 
> > > Signed-off-by: Jeff Layton 
> > 
> > FWIW, this is similar to the approach that Deepa suggested
> > back in 2016:
> > 
> > https://lore.kernel.org/lkml/1452144972-15802-3-git-send-email-deepa.ker...@gmail.com/
> > 
> > It was NaKed at the time because of the added complexity,
> > though it would have been much easier to do it then,
> > as we had to touch all the timespec references anyway.
> > 
> > The approach still seems ok to me, but I'm not sure it's worth
> > doing it now if we didn't do it then.
> > 
> 
> I remember seeing those patches go by. I don't remember that change
> being NaK'ed, but I wasn't paying close attention at the time 
> 
> Looking at it objectively now, I think it's worth it to recover 8 bytes
> per inode and open a 4 byte hole that Amir can use to grow the
> i_fsnotify_mask. We might even able to shave off another 12 bytes
> eventually if we can move to a single 64-bit word per timestamp. 

I don't think you can, since btrfs timestamps utilize s64 seconds
counting in both directions from the Unix epoch.  They also support ns
resolution:

struct btrfs_timespec {
__le64 sec;
__le32 nsec;
} __attribute__ ((__packed__));

--D

> It is a lot of churn though.
> -- 
> Jeff Layton 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH RFC v5 00/29] io_uring getdents

2023-08-25 Thread Darrick J. Wong

On Fri, Aug 25, 2023 at 09:54:02PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> This series introduce getdents64 to io_uring, the code logic is similar
> with the snychronized version's. It first try nowait issue, and offload
> it to io-wq threads if the first try fails.

NAK on the entire series until Jens actually writes down what NOWAIT
does, so that we can check that the *existing* nowait code branches
actually behave how he says it should.

https://lore.kernel.org/all/e2d8e5f1-f794-38eb-cecf-ed30c5712...@kernel.dk/

--D

> 
> Patch1 and Patch2 are some preparation
> Patch3 supports nowait for xfs getdents code
> Patch4-11 are vfs change, include adding helpers and trylock for locks
> Patch12-29 supports nowait for involved xfs journal stuff
> note, Patch24 and 27 are actually two questions, might be removed later.
> an xfs test may come later.
> 
> Tests I've done:
> a liburing test case for functional test:
> https://github.com/HowHsu/liburing/commit/39dc9a8e19c06a8cebf8c2301b85320eb45c061e?diff=unified
> 
> xfstests:
> test/generic: 1 fails and 171 not run
> test/xfs: 72 fails and 156 not run
> run the code before without this patchset, same result.
> I'll try to make the environment more right to run more tests here.
> 
> 
> Tested it with a liburing performance test:
> https://github.com/HowHsu/liburing/blob/getdents/test/getdents2.c
> 
> The test is controlled by the below script[2] which runs getdents2.t 100
> times and calulate the avg.
> The result show that io_uring version is about 2.6% faster:
> 
> note:
> [1] the number of getdents call/request in io_uring and normal sync version
> are made sure to be same beforehand.
> 
> [2] run_getdents.py
> 
> ```python3
> 
> import subprocess
> 
> N = 100
> sum = 0.0
> args = ["/data/home/howeyxu/tmpdir", "sync"]
> 
> for i in range(N):
> output = subprocess.check_output(["./liburing/test/getdents2.t"] + args)
> sum += float(output)
> 
> average = sum / N
> print("Average of sync:", average)
> 
> sum = 0.0
> args = ["/data/home/howeyxu/tmpdir", "iouring"]
> 
> for i in range(N):
> output = subprocess.check_output(["./liburing/test/getdents2.t"] + args)
> sum += float(output)
> 
> average = sum / N
> print("Average of iouring:", average)
> 
> ```
> 
> v4->v5:
>  - move atime update to the beginning of getdents operation
>  - trylock for i_rwsem
>  - nowait semantics for involved xfs journal stuff
> 
> v3->v4:
>  - add Dave's xfs nowait code and fix a deadlock problem, with some code
>style tweak.
>  - disable fixed file to avoid a race problem for now
>  - add a test program.
> 
> v2->v3:
>  - removed the kernfs patches
>  - add f_pos_lock logic
>  - remove the "reduce last EOF getdents try" optimization since
>Dominique reports that doesn't make difference
>  - remove the rewind logic, I think the right way is to introduce lseek
>to io_uring not to patch this logic to getdents.
>  - add Singed-off-by of Stefan Roesch for patch 1 since checkpatch
>complained that Co-developed-by someone should be accompanied with
>Signed-off-by same person, I can remove them if Stefan thinks that's
>not proper.
> 
> 
> Dominique Martinet (1):
>   fs: split off vfs_getdents function of getdents64 syscall
> 
> Hao Xu (28):
>   xfs: rename XBF_TRYLOCK to XBF_NOWAIT
>   xfs: add NOWAIT semantics for readdir
>   vfs: add nowait flag for struct dir_context
>   vfs: add a vfs helper for io_uring file pos lock
>   vfs: add file_pos_unlock() for io_uring usage
>   vfs: add a nowait parameter for touch_atime()
>   vfs: add nowait parameter for file_accessed()
>   vfs: move file_accessed() to the beginning of iterate_dir()
>   vfs: add S_NOWAIT for nowait time update
>   vfs: trylock inode->i_rwsem in iterate_dir() to support nowait
>   xfs: enforce GFP_NOIO implicitly during nowait time update
>   xfs: make xfs_trans_alloc() support nowait semantics
>   xfs: support nowait for xfs_log_reserve()
>   xfs: don't wait for free space in xlog_grant_head_check() in nowait
> case
>   xfs: add nowait parameter for xfs_inode_item_init()
>   xfs: make xfs_trans_ijoin() error out -EAGAIN
>   xfs: set XBF_NOWAIT for xfs_buf_read_map if necessary
>   xfs: support nowait memory allocation in _xfs_buf_alloc()
>   xfs: distinguish error type of memory allocation failure for nowait
> case
>   xfs: return -EAGAIN when bulk memory allocation fails in nowait case
>   xfs: comment page allocation for nowait case in xfs_buf_find_insert()
>   xfs: don't print warn info for -EAGAIN error in  xfs_buf_get_map()
>   xfs: support nowait for xfs_buf_read_map()
>   xfs: support nowait for xfs_buf_item_init()
>   xfs: return -EAGAIN when nowait meets sync in transaction commit
>   xfs: add a comment for xlog_kvmalloc()
>   xfs: support nowait semantics for xc_ctx_lock in xlog_cil_commit()
>   io_uring: add support for getdents
> 
>  arch/s390/hypfs/inode.c |  2 +-
>  block/fops.c|  2 +-
>  fs/btrfs/file.c |  2 +-

Re: [f2fs-dev] [PATCH v7 07/13] xfs: have xfs_vn_update_time gets its own timestamp

2023-08-09 Thread Darrick J. Wong

On Mon, Aug 07, 2023 at 03:38:38PM -0400, Jeff Layton wrote:
> In later patches we're going to drop the "now" parameter from the
> update_time operation. Prepare XFS for this by reworking how it fetches
> timestamps and sets them in the inode. Ensure that we update the ctime
> even if only S_MTIME is set.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/xfs_iops.c | 12 
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 731f45391baa..72d18e7840f5 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1037,6 +1037,7 @@ xfs_vn_update_time(
>   int log_flags = XFS_ILOG_TIMESTAMP;
>   struct xfs_trans*tp;
>   int error;
> + struct timespec64   now = current_time(inode);
>  
>   trace_xfs_update_time(ip);
>  
> @@ -1056,12 +1057,15 @@ xfs_vn_update_time(
>   return error;
>  
>   xfs_ilock(ip, XFS_ILOCK_EXCL);
> - if (flags & S_CTIME)
> - inode_set_ctime_to_ts(inode, *now);
> + if (flags & (S_CTIME|S_MTIME))

Minor nit: spaces around^ the operator.

Otherwise looks ok to me...
Acked-by: Darrick J. Wong 

--D

> + now = inode_set_ctime_current(inode);
> + else
> + now = current_time(inode);
> +
>   if (flags & S_MTIME)
> - inode->i_mtime = *now;
> + inode->i_mtime = now;
>   if (flags & S_ATIME)
> - inode->i_atime = *now;
> + inode->i_atime = now;
>  
>   xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>   xfs_trans_log_inode(tp, ip, log_flags);
> 
> -- 
> 2.41.0
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 11/12] xfs: drop s_umount over opening the log and RT devices

2023-08-05 Thread Darrick J. Wong

On Sat, Aug 05, 2023 at 10:32:39AM +0200, Christoph Hellwig wrote:
> On Wed, Aug 02, 2023 at 09:32:19AM -0700, Darrick J. Wong wrote:
> > > + /* see get_tree_bdev why this is needed and safe */
> > 
> > Which part of get_tree_bdev?  Is it this?
> > 
> > /*
> >  * s_umount nests inside open_mutex during
> >  * __invalidate_device().  blkdev_put() acquires
> >  * open_mutex and can't be called under s_umount.  Drop
> >  * s_umount temporarily.  This is safe as we're
> >  * holding an active reference.
> >  */
> > up_write(>s_umount);
> > blkdev_put(bdev, fc->fs_type);
> > down_write(>s_umount);
> 
> Yes.  With the refactoring earlier in the series get_tree_bdev should
> be trivial enough to not need a more specific reference.  If you
> think there's a better way to refer to it I can update the comment,
> though.

How about:

/*
 * blkdev_put can't be called under s_umount, see the comment in
     * get_tree_bdev for more details
 */

with that and the label name change,
Reviewed-by: Darrick J. Wong 

--D


> > >   mp->m_logdev_targp = mp->m_ddev_targp;
> > >   }
> > >  
> > > - return 0;
> > > + error = 0;
> > > +out_unlock:
> > > + down_write(>s_umount);
> > 
> > Isn't down_write taking s_umount?  I think the label should be
> > out_relock or something less misleading.
> 
> Agreed.  Christian, can you just change this in your branch, or should
> I send an incremental patch?
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v6 5/7] xfs: switch to multigrain timestamps

2023-08-02 Thread Darrick J. Wong

On Tue, Jul 25, 2023 at 10:58:18AM -0400, Jeff Layton wrote:
> Enable multigrain timestamps, which should ensure that there is an
> apparent change to the timestamp whenever it has been written after
> being actively observed via getattr.
> 
> Also, anytime the mtime changes, the ctime must also change, and those
> are now the only two options for xfs_trans_ichgtime. Have that function
> unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
> always set.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 6 +++---
>  fs/xfs/xfs_iops.c   | 4 ++--
>  fs/xfs/xfs_super.c  | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 6b2296ff248a..ad22656376d3 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -62,12 +62,12 @@ xfs_trans_ichgtime(
>   ASSERT(tp);
>   ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  
> - tv = current_time(inode);
> + /* If the mtime changes, then ctime must also change */
> + ASSERT(flags & XFS_ICHGTIME_CHG);
>  
> + tv = inode_set_ctime_current(inode);
>   if (flags & XFS_ICHGTIME_MOD)
>   inode->i_mtime = tv;
> - if (flags & XFS_ICHGTIME_CHG)
> - inode_set_ctime_to_ts(inode, tv);
>   if (flags & XFS_ICHGTIME_CREATE)
>   ip->i_crtime = tv;
>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 3a9363953ef2..3f89ef5a2820 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -573,10 +573,10 @@ xfs_vn_getattr(
>   stat->gid = vfsgid_into_kgid(vfsgid);
>   stat->ino = ip->i_ino;
>   stat->atime = inode->i_atime;
> - stat->mtime = inode->i_mtime;
> - stat->ctime = inode_get_ctime(inode);
>   stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
>  
> + fill_mg_cmtime(request_mask, inode, stat);

Huh.  I would've thought @stat would come first since that's what we're
acting upon, but ... eh. :)

If everyone else is ok with the fill_mg_cmtime signature,
Acked-by: Darrick J. Wong 

--D

> +
>   if (xfs_has_v3inodes(mp)) {
>   if (request_mask & STATX_BTIME) {
>   stat->result_mask |= STATX_BTIME;
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 818510243130..4b10edb2c972 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -2009,7 +2009,7 @@ static struct file_system_type xfs_fs_type = {
>   .init_fs_context= xfs_init_fs_context,
>   .parameters = xfs_fs_parameters,
>   .kill_sb= kill_block_super,
> - .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
> + .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
>  };
>  MODULE_ALIAS_FS("xfs");
>  
> 
> -- 
> 2.41.0
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 12/12] xfs use fs_holder_ops for the log and RT devices

2023-08-02 Thread Darrick J. Wong

On Wed, Aug 02, 2023 at 05:41:31PM +0200, Christoph Hellwig wrote:
> Use the generic fs_holder_ops to shut down the file system when the
> log or RT device goes away instead of duplicating the logic.
> 
> Signed-off-by: Christoph Hellwig 

Nice cleanup,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_super.c | 17 +++--
>  1 file changed, 3 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index d5042419ed9997..338eba71ff8667 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -377,17 +377,6 @@ xfs_setup_dax_always(
>   return 0;
>  }
>  
> -static void
> -xfs_bdev_mark_dead(
> - struct block_device *bdev)
> -{
> - xfs_force_shutdown(bdev->bd_holder, SHUTDOWN_DEVICE_REMOVED);
> -}
> -
> -static const struct blk_holder_ops xfs_holder_ops = {
> - .mark_dead  = xfs_bdev_mark_dead,
> -};
> -
>  STATIC int
>  xfs_blkdev_get(
>   xfs_mount_t *mp,
> @@ -396,8 +385,8 @@ xfs_blkdev_get(
>  {
>   int error = 0;
>  
> - *bdevp = blkdev_get_by_path(name, BLK_OPEN_READ | BLK_OPEN_WRITE, mp,
> - _holder_ops);
> + *bdevp = blkdev_get_by_path(name, BLK_OPEN_READ | BLK_OPEN_WRITE,
> + mp->m_super, _holder_ops);
>   if (IS_ERR(*bdevp)) {
>   error = PTR_ERR(*bdevp);
>   xfs_warn(mp, "Invalid device [%s], error=%d", name, error);
> @@ -412,7 +401,7 @@ xfs_blkdev_put(
>   struct block_device *bdev)
>  {
>   if (bdev)
> - blkdev_put(bdev, mp);
> + blkdev_put(bdev, mp->m_super);
>  }
>  
>  STATIC void
> -- 
> 2.39.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 11/12] xfs: drop s_umount over opening the log and RT devices

2023-08-02 Thread Darrick J. Wong

On Wed, Aug 02, 2023 at 05:41:30PM +0200, Christoph Hellwig wrote:
> Just like get_tree_bdev needs to drop s_umount when opening the main
> device, we need to do the same for the xfs log and RT devices to avoid a
> potential lock order reversal with s_unmount for the mark_dead path.
> 
> It might be preferable to just drop s_umount over ->fill_super entirely,
> but that will require a fairly massive audit first, so we'll do the easy
> version here first.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/xfs/xfs_super.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 8185102431301d..d5042419ed9997 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -448,17 +448,21 @@ STATIC int
>  xfs_open_devices(
>   struct xfs_mount*mp)
>  {
> - struct block_device *ddev = mp->m_super->s_bdev;
> + struct super_block  *sb = mp->m_super;
> + struct block_device *ddev = sb->s_bdev;
>   struct block_device *logdev = NULL, *rtdev = NULL;
>   int error;
>  
> + /* see get_tree_bdev why this is needed and safe */

Which part of get_tree_bdev?  Is it this?

/*
 * s_umount nests inside open_mutex during
 * __invalidate_device().  blkdev_put() acquires
 * open_mutex and can't be called under s_umount.  Drop
 * s_umount temporarily.  This is safe as we're
 * holding an active reference.
 */
up_write(>s_umount);
blkdev_put(bdev, fc->fs_type);
down_write(>s_umount);



> + up_write(>s_umount);
> +
>   /*
>* Open real time and log devices - order is important.
>*/
>   if (mp->m_logname) {
>   error = xfs_blkdev_get(mp, mp->m_logname, );
>   if (error)
> - return error;
> + goto out_unlock;
>   }
>  
>   if (mp->m_rtname) {
> @@ -496,7 +500,10 @@ xfs_open_devices(
>   mp->m_logdev_targp = mp->m_ddev_targp;
>   }
>  
> - return 0;
> + error = 0;
> +out_unlock:
> + down_write(>s_umount);

Isn't down_write taking s_umount?  I think the label should be
out_relock or something less misleading.

--D

> + return error;
>  
>   out_free_rtdev_targ:
>   if (mp->m_rtdev_targp)
> @@ -508,7 +515,7 @@ xfs_open_devices(
>   out_close_logdev:
>   if (logdev && logdev != ddev)
>   xfs_blkdev_put(mp, logdev);
> - return error;
> + goto out_unlock;
>  }
>  
>  /*
> -- 
> 2.39.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 09/12] fs: factor out a direct_write_fallback helper

2023-06-05 Thread Darrick J. Wong

On Thu, Jun 01, 2023 at 04:59:01PM +0200, Christoph Hellwig wrote:
> Add a helper dealing with handling the syncing of a buffered write fallback
> for direct I/O.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Damien Le Moal 
> Reviewed-by: Miklos Szeredi 

Looks good to me; whose tree do you want this to go through?

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/libfs.c | 41 
>  include/linux/fs.h |  2 ++
>  mm/filemap.c   | 66 +++---
>  3 files changed, 58 insertions(+), 51 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 89cf614a327158..5b851315eeed03 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -1613,3 +1613,44 @@ u64 inode_query_iversion(struct inode *inode)
>   return cur >> I_VERSION_QUERIED_SHIFT;
>  }
>  EXPORT_SYMBOL(inode_query_iversion);
> +
> +ssize_t direct_write_fallback(struct kiocb *iocb, struct iov_iter *iter,
> + ssize_t direct_written, ssize_t buffered_written)
> +{
> + struct address_space *mapping = iocb->ki_filp->f_mapping;
> + loff_t pos = iocb->ki_pos - buffered_written;
> + loff_t end = iocb->ki_pos - 1;
> + int err;
> +
> + /*
> +  * If the buffered write fallback returned an error, we want to return
> +  * the number of bytes which were written by direct I/O, or the error
> +  * code if that was zero.
> +  *
> +  * Note that this differs from normal direct-io semantics, which will
> +  * return -EFOO even if some bytes were written.
> +  */
> + if (unlikely(buffered_written < 0)) {
> + if (direct_written)
> + return direct_written;
> + return buffered_written;
> + }
> +
> + /*
> +  * We need to ensure that the page cache pages are written to disk and
> +  * invalidated to preserve the expected O_DIRECT semantics.
> +  */
> + err = filemap_write_and_wait_range(mapping, pos, end);
> + if (err < 0) {
> + /*
> +  * We don't know how much we wrote, so just return the number of
> +  * bytes which were direct-written
> +  */
> + if (direct_written)
> + return direct_written;
> + return err;
> + }
> + invalidate_mapping_pages(mapping, pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> + return direct_written + buffered_written;
> +}
> +EXPORT_SYMBOL_GPL(direct_write_fallback);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 91021b4e1f6f48..6af25137543824 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2738,6 +2738,8 @@ extern ssize_t __generic_file_write_iter(struct kiocb 
> *, struct iov_iter *);
>  extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
>  extern ssize_t generic_file_direct_write(struct kiocb *, struct iov_iter *);
>  ssize_t generic_perform_write(struct kiocb *, struct iov_iter *);
> +ssize_t direct_write_fallback(struct kiocb *iocb, struct iov_iter *iter,
> + ssize_t direct_written, ssize_t buffered_written);
>  
>  ssize_t vfs_iter_read(struct file *file, struct iov_iter *iter, loff_t *ppos,
>   rwf_t flags);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ddb6f8aa86d6ca..137508da5525b6 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -4006,23 +4006,19 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>  {
>   struct file *file = iocb->ki_filp;
>   struct address_space *mapping = file->f_mapping;
> - struct inode*inode = mapping->host;
> - ssize_t written = 0;
> - ssize_t err;
> - ssize_t status;
> + struct inode *inode = mapping->host;
> + ssize_t ret;
>  
> - err = file_remove_privs(file);
> - if (err)
> - goto out;
> + ret = file_remove_privs(file);
> + if (ret)
> + return ret;
>  
> - err = file_update_time(file);
> - if (err)
> - goto out;
> + ret = file_update_time(file);
> + if (ret)
> + return ret;
>  
>   if (iocb->ki_flags & IOCB_DIRECT) {
> - loff_t pos, endbyte;
> -
> - written = generic_file_direct_write(iocb, from);
> + ret = generic_file_direct_write(iocb, from);
>   /*
>* If the write stopped short of completing, fall back to
>* buffered writes.  Some filesystems do this for writes to
> @@ -4030,45 +4026,13 @@ ssize_t __generic_file_write_it

Re: [f2fs-dev] [PATCH 01/11] backing_dev: remove current->backing_dev_info

2023-05-24 Thread Darrick J. Wong

On Wed, May 24, 2023 at 08:38:00AM +0200, Christoph Hellwig wrote:
> The last user of current->backing_dev_info disappeared in commit
> b9b1335e6403 ("remove bdi_congested() and wb_congested() and related
> functions").  Remove the field and all assignments to it.
> 
> Signed-off-by: Christoph Hellwig 

Yay code removal!!!! :)

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/btrfs/file.c   | 6 +-
>  fs/ceph/file.c| 4 
>  fs/ext4/file.c| 2 --
>  fs/f2fs/file.c| 2 --
>  fs/fuse/file.c| 4 
>  fs/gfs2/file.c| 2 --
>  fs/nfs/file.c | 5 +
>  fs/ntfs/file.c| 2 --
>  fs/ntfs3/file.c   | 3 ---
>  fs/xfs/xfs_file.c | 4 
>  include/linux/sched.h | 3 ---
>  mm/filemap.c  | 3 ---
>  12 files changed, 2 insertions(+), 38 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index f649647392e0e4..ecd43ab66fa6c7 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1145,7 +1145,6 @@ static int btrfs_write_check(struct kiocb *iocb, struct 
> iov_iter *from,
>   !(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | 
> BTRFS_INODE_PREALLOC)))
>   return -EAGAIN;
>  
> - current->backing_dev_info = inode_to_bdi(inode);
>   ret = file_remove_privs(file);
>   if (ret)
>   return ret;
> @@ -1165,10 +1164,8 @@ static int btrfs_write_check(struct kiocb *iocb, 
> struct iov_iter *from,
>   loff_t end_pos = round_up(pos + count, fs_info->sectorsize);
>  
>   ret = btrfs_cont_expand(BTRFS_I(inode), oldsize, end_pos);
> - if (ret) {
> - current->backing_dev_info = NULL;
> + if (ret)
>   return ret;
> - }
>   }
>  
>   return 0;
> @@ -1689,7 +1686,6 @@ ssize_t btrfs_do_write_iter(struct kiocb *iocb, struct 
> iov_iter *from,
>   if (sync)
>   atomic_dec(>sync_writers);
>  
> - current->backing_dev_info = NULL;
>   return num_written;
>  }
>  
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index f4d8bf7dec88a8..c8ef72f723badd 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1791,9 +1791,6 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>   else
>   ceph_start_io_write(inode);
>  
> - /* We can write back this queue in page reclaim */
> - current->backing_dev_info = inode_to_bdi(inode);
> -
>   if (iocb->ki_flags & IOCB_APPEND) {
>   err = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false);
>   if (err < 0)
> @@ -1940,7 +1937,6 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>   ceph_end_io_write(inode);
>  out_unlocked:
>   ceph_free_cap_flush(prealloc_cf);
> - current->backing_dev_info = NULL;
>   return written ? written : err;
>  }
>  
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index d101b3b0c7dad8..bc430270c23c19 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -285,9 +285,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb 
> *iocb,
>   if (ret <= 0)
>   goto out;
>  
> - current->backing_dev_info = inode_to_bdi(inode);
>   ret = generic_perform_write(iocb, from);
> - current->backing_dev_info = NULL;
>  
>  out:
>   inode_unlock(inode);
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 5ac53d2627d20d..4f423d367a44b9 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -4517,9 +4517,7 @@ static ssize_t f2fs_buffered_write_iter(struct kiocb 
> *iocb,
>   if (iocb->ki_flags & IOCB_NOWAIT)
>   return -EOPNOTSUPP;
>  
> - current->backing_dev_info = inode_to_bdi(inode);
>   ret = generic_perform_write(iocb, from);
> - current->backing_dev_info = NULL;
>  
>   if (ret > 0) {
>   iocb->ki_pos += ret;
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 89d97f6188e05e..97d435874b14aa 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1362,9 +1362,6 @@ static ssize_t fuse_cache_write_iter(struct kiocb 
> *iocb, struct iov_iter *from)
>  writethrough:
>   inode_lock(inode);
>  
> - /* We can write back this queue in page reclaim */
> - current->backing_dev_info = inode_to_bdi(inode);
> -
>   err = generic_write_checks(iocb, from);
>   if (err <= 0)
>   goto out;
> @@ -1409,7 +1406,6 @@ static ssize_t fuse_cache_write_iter(struct kiocb 
> *iocb, struct iov_iter *from)
>

Re: [f2fs-dev] cleanup the filemap / direct I/O interaction

2023-05-22 Thread Darrick J. Wong

On Fri, May 19, 2023 at 11:35:08AM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series cleans up some of the generic write helper calling
> conventions and the page cache writeback / invalidation for
> direct I/O.  This is a spinoff from the no-bufferhead kernel
> project, for while we'll want to an use iomap based buffered
> write path in the block layer.

Heh.

For patches 3 and 8, I wonder if you could just get rid of
current->backing_dev_info?

For patches 2, 4-6, and 10:
Acked-by: Darrick J. Wong 

For patches 1, 7, and 9:
Reviewed-by: Darrick J. Wong 

The fuse patches I have no idea about. :/

--D

> diffstat:
>  block/fops.c|   18 
>  fs/ceph/file.c  |6 -
>  fs/direct-io.c  |   10 --
>  fs/ext4/file.c  |   12 ---
>  fs/f2fs/file.c  |3 
>  fs/fuse/file.c  |   47 ++--
>  fs/gfs2/file.c  |7 -
>  fs/iomap/buffered-io.c  |   12 ++-
>  fs/iomap/direct-io.c|   88 --
>  fs/libfs.c  |   36 +
>  fs/nfs/file.c   |6 -
>  fs/xfs/xfs_file.c   |7 -
>  fs/zonefs/file.c|4 -
>  include/linux/fs.h  |7 -
>  include/linux/pagemap.h |4 +
>  mm/filemap.c|  184 
> +---
>  16 files changed, 190 insertions(+), 261 deletions(-)


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 08/13] iomap: assign current->backing_dev_info in iomap_file_buffered_write

2023-05-22 Thread Darrick J. Wong

On Fri, May 19, 2023 at 11:35:16AM +0200, Christoph Hellwig wrote:
> Move the assignment to current->backing_dev_info from the callers into
> iomap_file_buffered_write to reduce boiler plate code and reduce the
> scope to just around the page dirtying loop.
> 
> Note that zonefs was missing this assignment before.

I'm still wondering (a) what the hell current->backing_dev_info is for,
and (b) if we need it around the iomap_unshare operation.

$ git grep current..backing_dev_info
fs/btrfs/file.c:1148:   current->backing_dev_info = inode_to_bdi(inode);
fs/btrfs/file.c:1169:   current->backing_dev_info = NULL;
fs/btrfs/file.c:1692:   current->backing_dev_info = NULL;
fs/ceph/file.c:1795:current->backing_dev_info = inode_to_bdi(inode);
fs/ceph/file.c:1943:current->backing_dev_info = NULL;
fs/ext4/file.c:288: current->backing_dev_info = inode_to_bdi(inode);
fs/ext4/file.c:290: current->backing_dev_info = NULL;
fs/f2fs/file.c:4520:current->backing_dev_info = inode_to_bdi(inode);
fs/f2fs/file.c:4522:current->backing_dev_info = NULL;
fs/fuse/file.c:1366:current->backing_dev_info = inode_to_bdi(inode);
fs/fuse/file.c:1412:current->backing_dev_info = NULL;
fs/gfs2/file.c:1044:current->backing_dev_info = inode_to_bdi(inode);
fs/gfs2/file.c:1048:current->backing_dev_info = NULL;
fs/nfs/file.c:652:  current->backing_dev_info = inode_to_bdi(inode);
fs/nfs/file.c:654:  current->backing_dev_info = NULL;
fs/ntfs/file.c:1914:current->backing_dev_info = inode_to_bdi(vi);
fs/ntfs/file.c:1918:current->backing_dev_info = NULL;
fs/ntfs3/file.c:823:current->backing_dev_info = inode_to_bdi(inode);
fs/ntfs3/file.c:996:current->backing_dev_info = NULL;
fs/xfs/xfs_file.c:721:  current->backing_dev_info = inode_to_bdi(inode);
fs/xfs/xfs_file.c:756:  current->backing_dev_info = NULL;
mm/filemap.c:3995:  current->backing_dev_info = inode_to_bdi(inode);
mm/filemap.c:4056:  current->backing_dev_info = NULL;

AFAICT nobody uses it at all?  Unless there's some bizarre user that
isn't extracting it from @current?

Oh, hey, new question (c) isn't this set incorrectly for xfs realtime
files?

--D

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/gfs2/file.c | 3 ---
>  fs/iomap/buffered-io.c | 3 +++
>  fs/xfs/xfs_file.c  | 5 -
>  3 files changed, 3 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 499ef174dec138..261897fcfbc495 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -25,7 +25,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  
>  #include "gfs2.h"
> @@ -1041,11 +1040,9 @@ static ssize_t gfs2_file_buffered_write(struct kiocb 
> *iocb,
>   goto out_unlock;
>   }
>  
> - current->backing_dev_info = inode_to_bdi(inode);
>   pagefault_disable();
>   ret = iomap_file_buffered_write(iocb, from, _iomap_ops);
>   pagefault_enable();
> - current->backing_dev_info = NULL;
>   if (ret > 0)
>   written += ret;
>  
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 550525a525c45c..b2779bd1f10611 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -3,6 +3,7 @@
>   * Copyright (C) 2010 Red Hat, Inc.
>   * Copyright (C) 2016-2019 Christoph Hellwig.
>   */
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -869,8 +870,10 @@ iomap_file_buffered_write(struct kiocb *iocb, struct 
> iov_iter *i,
>   if (iocb->ki_flags & IOCB_NOWAIT)
>   iter.flags |= IOMAP_NOWAIT;
>  
> + current->backing_dev_info = inode_to_bdi(iter.inode);
>   while ((ret = iomap_iter(, ops)) > 0)
>   iter.processed = iomap_write_iter(, i);
> + current->backing_dev_info = NULL;
>  
>   if (unlikely(ret < 0))
>   return ret;
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index bfba10e0b0f3c2..98d763cc3b114c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -27,7 +27,6 @@
>  
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -717,9 +716,6 @@ xfs_file_buffered_write(
>   if (ret)
>   goto out;
>  
> - /* We can write back this queue in page reclaim */
> - current->backing_dev_info = inode_to_bdi(inode);
> -
>   trace_xfs_file_buffered_write(iocb, from);
>   ret = iomap_file_buffered_write(iocb, from,
>   _buffered_write_iomap_ops);
> @@ -751,7 +747,6 @@ xfs_file_buffered_write(
>   goto write_retry;
>   }
>  
> - current->backing_dev_info = NULL;
>  out:
>   if (iolock)
>   xfs_iunlock(ip, iolock);
> -- 
> 2.39.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 11/17] iomap: assign current->backing_dev_info in iomap_file_buffered_write

2023-04-24 Thread Darrick J. Wong

On Mon, Apr 24, 2023 at 07:49:20AM +0200, Christoph Hellwig wrote:
> Move the assignment to current->backing_dev_info from the callers into
> iomap_file_buffered_write.  Note that zonefs was missing this assignment
> before.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/gfs2/file.c | 3 ---
>  fs/iomap/buffered-io.c | 4 
>  fs/xfs/xfs_file.c  | 5 -
>  3 files changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 8c4fad359ff538..4d88c6080b3e30 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -25,7 +25,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  
>  #include "gfs2.h"
> @@ -1041,11 +1040,9 @@ static ssize_t gfs2_file_buffered_write(struct kiocb 
> *iocb,
>   goto out_unlock;
>   }
>  
> - current->backing_dev_info = inode_to_bdi(inode);
>   pagefault_disable();
>   ret = iomap_file_buffered_write(iocb, from, _iomap_ops);
>   pagefault_enable();
> - current->backing_dev_info = NULL;
>   if (ret > 0) {
>   iocb->ki_pos += ret;
>   written += ret;
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 2986be63d2bea6..3d5042efda202a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -3,6 +3,7 @@
>   * Copyright (C) 2010 Red Hat, Inc.
>   * Copyright (C) 2016-2019 Christoph Hellwig.
>   */
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -876,8 +877,11 @@ iomap_file_buffered_write(struct kiocb *iocb, struct 
> iov_iter *i,
>   if (iocb->ki_flags & IOCB_NOWAIT)
>   iter.flags |= IOMAP_NOWAIT;
>  
> + current->backing_dev_info = inode_to_bdi(iter.inode);

Dumb question from me late on a Sunday night, but does the iomap_unshare
code need to set this too?  Since it works by dirtying pagecache folios
without actually changing the contents?

--D

>   while ((ret = iomap_iter(, ops)) > 0)
>   iter.processed = iomap_write_iter(, i);
> + current->backing_dev_info = NULL;
> +
>   if (iter.pos == iocb->ki_pos)
>   return ret;
>   return iter.pos - iocb->ki_pos;
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 705250f9f90a1b..f5442e689baf15 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -27,7 +27,6 @@
>  
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -717,9 +716,6 @@ xfs_file_buffered_write(
>   if (ret)
>   goto out;
>  
> - /* We can write back this queue in page reclaim */
> - current->backing_dev_info = inode_to_bdi(inode);
> -
>   trace_xfs_file_buffered_write(iocb, from);
>   ret = iomap_file_buffered_write(iocb, from,
>   _buffered_write_iomap_ops);
> @@ -753,7 +749,6 @@ xfs_file_buffered_write(
>   goto write_retry;
>   }
>  
> - current->backing_dev_info = NULL;
>  out:
>   if (iolock)
>   xfs_iunlock(ip, iolock);
> -- 
> 2.39.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-05 Thread Darrick J. Wong

On Wed, Apr 05, 2023 at 06:02:21PM +0200, Andrey Albershteyn wrote:
> Hi Darrick,
> 
> On Tue, Apr 04, 2023 at 09:36:02AM -0700, Darrick J. Wong wrote:
> > On Tue, Apr 04, 2023 at 04:53:17PM +0200, Andrey Albershteyn wrote:
> > > In case of different Merkle tree block size fs-verity expects
> > > ->read_merkle_tree_page() to return Merkle tree page filled with
> > > Merkle tree blocks. The XFS stores each merkle tree block under
> > > extended attribute. Those attributes are addressed by block offset
> > > into Merkle tree.
> > > 
> > > This patch make ->read_merkle_tree_page() to fetch multiple merkle
> > > tree blocks based on size ratio. Also the reference to each xfs_buf
> > > is passed with page->private to ->drop_page().
> > > 
> > > Signed-off-by: Andrey Albershteyn 
> > > ---
> > >  fs/xfs/xfs_verity.c | 74 +++--
> > >  fs/xfs/xfs_verity.h |  8 +
> > >  2 files changed, 66 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_verity.c b/fs/xfs/xfs_verity.c
> > > index a9874ff4efcd..ef0aff216f06 100644
> > > --- a/fs/xfs/xfs_verity.c
> > > +++ b/fs/xfs/xfs_verity.c
> > > @@ -134,6 +134,10 @@ xfs_read_merkle_tree_page(
> > >   struct page *page = NULL;
> > >   __be64  name = cpu_to_be64(index << PAGE_SHIFT);
> > >   uint32_tbs = 1 << log_blocksize;
> > > + int blocks_per_page =
> > > + (1 << (PAGE_SHIFT - log_blocksize));
> > > + int n = 0;
> > > + int offset = 0;
> > >   struct xfs_da_args  args = {
> > >   .dp = ip,
> > >   .attr_filter= XFS_ATTR_VERITY,
> > > @@ -143,26 +147,59 @@ xfs_read_merkle_tree_page(
> > >   .valuelen   = bs,
> > >   };
> > >   int error = 0;
> > > + boolis_checked = true;
> > > + struct xfs_verity_buf_list  *buf_list;
> > >  
> > >   page = alloc_page(GFP_KERNEL);
> > >   if (!page)
> > >   return ERR_PTR(-ENOMEM);
> > >  
> > > - error = xfs_attr_get();
> > > - if (error) {
> > > - kmem_free(args.value);
> > > - xfs_buf_rele(args.bp);
> > > + buf_list = kzalloc(sizeof(struct xfs_verity_buf_list), GFP_KERNEL);
> > > + if (!buf_list) {
> > >   put_page(page);
> > > - return ERR_PTR(-EFAULT);
> > > + return ERR_PTR(-ENOMEM);
> > >   }
> > >  
> > > - if (args.bp->b_flags & XBF_VERITY_CHECKED)
> > > + /*
> > > +  * Fill the page with Merkle tree blocks. The blcoks_per_page is higher
> > > +  * than 1 when fs block size != PAGE_SIZE or Merkle tree block size !=
> > > +  * PAGE SIZE
> > > +  */
> > > + for (n = 0; n < blocks_per_page; n++) {
> > 
> > Ahah, ok, that's why we can't pass the xfs_buf pages up to fsverity.
> > 
> > > + offset = bs * n;
> > > + name = cpu_to_be64(((index << PAGE_SHIFT) + offset));
> > 
> > Really this ought to be a typechecked helper...
> > 
> > struct xfs_fsverity_merkle_key {
> > __be64  merkleoff;
> 
> Sure, thanks, will change this
> 
> > };
> > 
> > static inline void
> > xfs_fsverity_merkle_key_to_disk(struct xfs_fsverity_merkle_key *k, loff_t 
> > pos)
> > {
> > k->merkeloff = cpu_to_be64(pos);
> > }
> > 
> > 
> > 
> > > + args.name = (const uint8_t *)
> > > +
> > > + error = xfs_attr_get();
> > > + if (error) {
> > > + kmem_free(args.value);
> > > + /*
> > > +  * No more Merkle tree blocks (e.g. this was the last
> > > +  * block of the tree)
> > > +  */
> > > + if (error == -ENOATTR)
> > > + break;
> > > + xfs_buf_rele(args.bp);
> > > + put_page(page);
> > > + kmem_free(buf_list);
> > > + return ERR_PTR(-EFAULT);
> > > + }
> > > +
> > > + buf_list->bufs[buf_list->buf_count++] = args.bp;
> > > +
> > > + /* One of the buffers was dropped */
> > > + if (!(ar

Re: [f2fs-dev] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files

2023-04-05 Thread Darrick J. Wong

On Wed, Apr 05, 2023 at 05:01:42PM +0200, Andrey Albershteyn wrote:
> On Tue, Apr 04, 2023 at 09:10:47AM -0700, Darrick J. Wong wrote:
> > On Tue, Apr 04, 2023 at 04:53:15PM +0200, Andrey Albershteyn wrote:
> > > The direct path is not supported on verity files. Attempts to use direct
> > > I/O path on such files should fall back to buffered I/O path.
> > > 
> > > Signed-off-by: Andrey Albershteyn 
> > > ---
> > >  fs/xfs/xfs_file.c | 14 +++---
> > >  1 file changed, 11 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 947b5c436172..9e072e82f6c1 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -244,7 +244,8 @@ xfs_file_dax_read(
> > >   struct kiocb*iocb,
> > >   struct iov_iter *to)
> > >  {
> > > - struct xfs_inode*ip = XFS_I(iocb->ki_filp->f_mapping->host);
> > > + struct inode*inode = iocb->ki_filp->f_mapping->host;
> > > + struct xfs_inode*ip = XFS_I(inode);
> > >   ssize_t ret = 0;
> > >  
> > >   trace_xfs_file_dax_read(iocb, to);
> > > @@ -297,10 +298,17 @@ xfs_file_read_iter(
> > >  
> > >   if (IS_DAX(inode))
> > >   ret = xfs_file_dax_read(iocb, to);
> > > - else if (iocb->ki_flags & IOCB_DIRECT)
> > > + else if (iocb->ki_flags & IOCB_DIRECT && !fsverity_active(inode))
> > >   ret = xfs_file_dio_read(iocb, to);
> > > - else
> > > + else {
> > > + /*
> > > +  * In case fs-verity is enabled, we also fallback to the
> > > +  * buffered read from the direct read path. Therefore,
> > > +  * IOCB_DIRECT is set and need to be cleared
> > > +  */
> > > + iocb->ki_flags &= ~IOCB_DIRECT;
> > >   ret = xfs_file_buffered_read(iocb, to);
> > 
> > XFS doesn't usually allow directio fallback to the pagecache. Why
> > would fsverity be any different?
> 
> Didn't know that, this is what happens on ext4 so I did the same.
> Then it probably make sense to just error on DIRECT on verity
> sealed file.

Thinking about this a little more -- I suppose we shouldn't just go
breaking directio reads from a verity file if we can help it.  Is there
a way to ask fsverity to perform its validation against some arbitrary
memory buffer that happens to be fs-block aligned?  In which case we
could support fsblock-aligned directio reads without falling back to the
page cache?

--D

> > 
> > --D
> > 
> > > + }
> > >  
> > >   if (ret > 0)
> > >   XFS_STATS_ADD(mp, xs_read_bytes, ret);
> > > -- 
> > > 2.38.4
> > > 
> > 
> 
> -- 
> - Andrey
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v2 09/23] iomap: allow filesystem to implement read path verification

2023-04-05 Thread Darrick J. Wong

On Wed, Apr 05, 2023 at 01:01:16PM +0200, Andrey Albershteyn wrote:
> Hi Christoph,
> 
> On Tue, Apr 04, 2023 at 08:37:02AM -0700, Christoph Hellwig wrote:
> > >   if (iomap_block_needs_zeroing(iter, pos)) {
> > >   folio_zero_range(folio, poff, plen);
> > > + if (iomap->flags & IOMAP_F_READ_VERITY) {
> > 
> > Wju do we need the new flag vs just testing that folio_ops and
> > folio_ops->verify_folio is non-NULL?
> 
> Yes, it can be just test, haven't noticed that it's used only here,
> initially I used it in several places.
> 
> > 
> > > - ctx->bio = bio_alloc(iomap->bdev, bio_max_segs(nr_vecs),
> > > -  REQ_OP_READ, gfp);
> > > + ctx->bio = bio_alloc_bioset(iomap->bdev, bio_max_segs(nr_vecs),
> > > + REQ_OP_READ, GFP_NOFS, 
> > > _read_ioend_bioset);
> > 
> > All other callers don't really need the larger bioset, so I'd avoid
> > the unconditional allocation here, but more on that later.
> 
> Ok, make sense.
> 
> > 
> > > + ioend = container_of(ctx->bio, struct iomap_read_ioend,
> > > + read_inline_bio);
> > > + ioend->io_inode = iter->inode;
> > > + if (ctx->ops && ctx->ops->prepare_ioend)
> > > + ctx->ops->prepare_ioend(ioend);
> > > +
> > 
> > So what we're doing in writeback and direct I/O, is to:
> > 
> >  a) have a submit_bio hook
> >  b) allow the file system to then hook the bi_end_io caller
> >  c) (only in direct O/O for now) allow the file system to provide
> > a bio_set to allocate from
> 
> I see.
> 
> > 
> > I wonder if that also makes sense and keep all the deferral in the
> > file system.  We'll need that for the btrfs iomap conversion anyway,
> > and it seems more flexible.  The ioend processing would then move into
> > XFS.
> > 
> 
> Not sure what you mean here.

I /think/ Christoph is talking about allowing callers of iomap pagecache
operations to supply a custom submit_bio function and a bio_set so that
filesystems can add in their own post-IO processing and appropriately
sized (read: minimum you can get away with) bios.  I imagine btrfs has
quite a lot of (read) ioend processing they need to do, as will xfs now
that you're adding fsverity.

> > > @@ -156,6 +160,11 @@ struct iomap_folio_ops {
> > >* locked by the iomap code.
> > >*/
> > >   bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
> > > +
> > > + /*
> > > +  * Verify folio when successfully read
> > > +  */
> > > + bool (*verify_folio)(struct folio *folio, loff_t pos, unsigned int len);

Any reason why we shouldn't return the usual negative errno?

> > Why isn't this in iomap_readpage_ops?
> > 
> 
> Yes, it can be. But it appears to me to be more relevant to
> _folio_ops, any particular reason to move it there? Don't mind
> moving it to iomap_readpage_ops.

I think the point is that this is a general "check what we just read"
hook, so it could be in readpage_ops since we're never going to need to
re-validate verity contents, right?  Hence it could be in readpage_ops
instead of the general iomap_folio_ops.

 Is there a use case for ->verify_folio that isn't a read post-
processing step?

--D

> -- 
> - Andrey
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 5/5] fstests/MAINTAINERS: add a co-maintainer for btrfs testing part

2023-04-04 Thread Darrick J. Wong

On Wed, Apr 05, 2023 at 01:14:11AM +0800, Zorro Lang wrote:
> Darrick J. Wong would like to nominate Anand Jain to help more on

In case anyone's wondering how this all came about -- Anand asked me how
he could do more upstream fstests review work, so I suggested that he
and I talk to Zorro about delegating some of the review and maintainer
work so that it's not all on Zorro to keep everything running.

> btrfs testing part (tests/btrfs and common/btrfs). He would like to
> be a co-maintainer of btrfs part, will help to review and test
> fstests btrfs related patches, and I might merge from him if there's
> big patchset. So CC him besides send to fstests@ list, when you have
> a btrfs fstests patch.
> 
> Signed-off-by: Zorro Lang 
> ---
> 
> Please btrfs list help to review this change, if you agree (or no objection),
> then I'll push this change.

This is what Zorro, Anand, and I sketched out as far as co-maintainer
resposibilities go:

> A co-maintainer will do:
> 1) Review patches are related with him.
> 2) Merge and test patches in his local git repo, and give the patch an ACK.
> 3) Maintainer will trust the ack from co-maintainer more (might merge 
> directly).
> 4) Maintainer might merge from co-maintainer when he has a big patchset wait 
> for
>merging.
> 
> Thanks,
> Zorro
> 
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 0ad12a38..9fc6c6b5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -108,6 +108,7 @@ Maintainers List
> or reviewer or co-maintainer can be in cc list.
>  
>  BTRFS
> +M:   Anand Jain 

I would like to hear agreement from the btrfs community about this
before making this particular change official.

--D

>  R:   Filipe Manana 
>  L:   linux-bt...@vger.kernel.org
>  S:   Supported
> -- 
> 2.39.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 4/5] fstests/MAINTAINERS: add some specific reviewers

2023-04-04 Thread Darrick J. Wong

On Wed, Apr 05, 2023 at 01:14:10AM +0800, Zorro Lang wrote:
> Some people contribute to someone specific fs testing mostly, record
> some of them as Reviewer.
> 
> Signed-off-by: Zorro Lang 
> ---
> 
> If someone doesn't want to be in cc list of related fstests patch, please
> reply this email, I'll remove that reviewer line.
> 
> Or if someone else (who contribute to fstests very much) would like to a
> specific reviewer, nominate yourself to get a review.
> 
> Thanks,
> Zorro
> 
>  MAINTAINERS | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 620368cb..0ad12a38 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -108,6 +108,7 @@ Maintainers List
> or reviewer or co-maintainer can be in cc list.
>  
>  BTRFS
> +R:   Filipe Manana 
>  L:   linux-bt...@vger.kernel.org
>  S:   Supported
>  F:   tests/btrfs/
> @@ -137,16 +138,19 @@ F:  tests/f2fs/
>  F:   common/f2fs
>  
>  FSVERITY
> +R:   Eric Biggers 
>  L:   fsver...@lists.linux.dev
>  S:   Supported
>  F:   common/verity
>  
>  FSCRYPT
> +R:   Eric Biggers 
>  L:  linux-fscr...@vger.kernel.org
>  S:   Supported
>  F:   common/encrypt
>  
>  FS-IDMAPPED
> +R:   Christian Brauner 
>  L:   linux-fsde...@vger.kernel.org
>  S:   Supported
>  F:   src/vfs/
> @@ -163,6 +167,7 @@ S:Supported
>  F:   tests/ocfs2/
>  
>  OVERLAYFS
> +R:   Amir Goldstein 
>  L:   linux-unio...@vger.kernel.org
>  S:   Supported
>  F:   tests/overlay
> @@ -174,6 +179,7 @@ S:Supported
>  F:   tests/udf/
>  
>  XFS
> +R:   Darrick J. Wong 

For this one hunk,
Reviewed-by: Darrick J. Wong 

--D

>  L:   linux-...@vger.kernel.org
>  S:   Supported
>  F:   common/dump
> -- 
> 2.39.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 3/5] fstests/MAINTAINERS: add supported mailing list

2023-04-04 Thread Darrick J. Wong

On Tue, Apr 04, 2023 at 03:16:53PM -0700, Eric Biggers wrote:
> Hi Zorro,
> 
> On Wed, Apr 05, 2023 at 01:14:09AM +0800, Zorro Lang wrote:
> > +FSVERITY
> > +L: fsver...@lists.linux.dev
> > +S: Supported
> > +F: common/verity
> > +
> > +FSCRYPT
> > +L:  linux-fscr...@vger.kernel.org
> > +S: Supported
> > +F: common/encrypt
> 
> Most of the encrypt and verity tests are in tests/generic/ and are in the
> 'encrypt' or 'verity' test groups.
> 
> These file patterns only pick up the common files, not the actual tests.
> 
> Have you considered adding a way to specify maintainers for a test group?
> Something like:
> 
> G:  encrypt
> 
> and
> 
> G:  verity

Yes, good suggestion.

--D

> - Eric


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 3/5] fstests/MAINTAINERS: add supported mailing list

2023-04-04 Thread Darrick J. Wong

On Wed, Apr 05, 2023 at 01:14:09AM +0800, Zorro Lang wrote:
> The fstests supports different kind of fs testing, better to cc
> specific fs mailing list for specific fs testing, to get better
> reviewing points. So record these mailing lists and files related
> with them in MAINTAINERS file.
> 
> Signed-off-by: Zorro Lang 
> ---
> 
> If someone mailing list doesn't want to be in cc list of related fstests
> patch, please reply this email, I'll remove that line.
> 
> Or if I missed someone mailing list, please feel free to tell me.
> 
> Thanks,
> Zorro
> 
>  MAINTAINERS | 77 +
>  1 file changed, 77 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 09b1a5a3..620368cb 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -107,6 +107,83 @@ Maintainers List
> should send patch to fstests@ at least. Other relevant mailing list
> or reviewer or co-maintainer can be in cc list.
>  
> +BTRFS
> +L:   linux-bt...@vger.kernel.org
> +S:   Supported
> +F:   tests/btrfs/
> +F:   common/btrfs
> +
> +CEPH
> +L:   ceph-de...@vger.kernel.org
> +S:   Supported
> +F:   tests/ceph/
> +F:   common/ceph
> +
> +CIFS
> +L:   linux-c...@vger.kernel.org
> +S:   Supported
> +F:   tests/cifs
> +
> +EXT4
> +L:   linux-e...@vger.kernel.org
> +S:   Supported
> +F:   tests/ext4/
> +F:   common/ext4
> +
> +F2FS
> +L:   linux-f2fs-devel@lists.sourceforge.net
> +S:   Supported
> +F:   tests/f2fs/
> +F:   common/f2fs
> +
> +FSVERITY
> +L:   fsver...@lists.linux.dev
> +S:   Supported
> +F:   common/verity
> +
> +FSCRYPT
> +L:  linux-fscr...@vger.kernel.org
> +S:   Supported
> +F:   common/encrypt
> +
> +FS-IDMAPPED
> +L:   linux-fsde...@vger.kernel.org
> +S:   Supported
> +F:   src/vfs/
> +
> +NFS
> +L:   linux-...@vger.kernel.org
> +S:   Supported
> +F:   tests/nfs/
> +F:   common/nfs
> +
> +OCFS2
> +L:   ocfs2-de...@oss.oracle.com
> +S:   Supported
> +F:   tests/ocfs2/
> +
> +OVERLAYFS
> +L:   linux-unio...@vger.kernel.org
> +S:   Supported
> +F:   tests/overlay
> +F:   common/overlay
> +
> +UDF
> +R:   Jan Kara 
> +S:   Supported
> +F:   tests/udf/
> +
> +XFS
> +L:   linux-...@vger.kernel.org
> +S:   Supported
> +F:   common/dump
> +F:   common/fuzzy
> +F:   common/inject
> +F:   common/populate

note that populate and fuzzy apply to ext* as well.

> +F:   common/repair
> +F:   common/xfs
> +F:   tests/xfs/

Otherwise looks good to me,

Reviewed-by: Darrick J. Wong 

--D

> +
>  ALL
>  M:   Zorro Lang 
>  L:   fste...@vger.kernel.org
> -- 
> 2.39.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 1/5] fstests: add MAINTAINERS and get_maintainer.pl files

2023-04-04 Thread Darrick J. Wong

On Wed, Apr 05, 2023 at 01:14:07AM +0800, Zorro Lang wrote:
> As fstests covers more and more fs testing, so we always get help
> from fs specific mailing list, due to they learn about their features
> and bugs more. Besides that, some folks help to review patches
> (relevant with them) more often.
> 
> So I'd like to bring in the similar way of linux/MAINTAINERS, records
> fs relevant mailing lists, reviewers or co-maintainers. To recognize
> the contribution from them, and help more users to know who or what
> mailing list can be added in CC list of a patch.
> 
> The MAINTAINERS and get_maintainer.pl are copied from linux project,
> then made some changes for fstests specially.
> 
> Signed-off-by: Zorro Lang 
> ---
>  MAINTAINERS |  116 ++
>  tools/get_maintainer.pl | 2616 +++
>  2 files changed, 2732 insertions(+)
>  create mode 100644 MAINTAINERS
>  create mode 100755 tools/get_maintainer.pl
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> new file mode 100644
> index ..09b1a5a3
> --- /dev/null
> +++ b/MAINTAINERS
> @@ -0,0 +1,116 @@
> +List of reviewers, co-maintainers and how to submit fstests changes
> +
> +
> +Please try to follow the guidelines below.  This will make things
> +easier on the maintainers.  Not all of these guidelines matter for every
> +trivial patch so apply some common sense.
> +
> +Tips for patch submitters
> +-
> +
> +1.   Always *test* your changes, however small, on at least 4 or
> + 5 people, preferably many more.
> +
> +2.   Make sure your changes compile correctly in multiple
> + configurations. In particular check that changes work both as a
> + module and built into the kernel.

Huh?  What kernel? ;)

--D

> +3.   When you are happy with a change make it generally available for
> + testing and await feedback.
> +
> +4.   Make a patch available to fstests@ list directly, that's the only
> + one mailing list which maintain the whole fstests project.
> +
> + PLEASE CC: the relevant reviewers, co-maintainers and mailing lists
> + that are generated by ``tools/get_maintainer.pl.``
> +
> + PLEASE try to include any credit lines you want added with the
> + patch. It avoids people being missed off by mistake and makes
> + it easier to know who wants adding and who doesn't.
> +
> + PLEASE document known bugs. If it doesn't work for everything
> + or does something very odd once a month document it.
> +
> +5.   Make sure you have the right to send any changes you make. If you
> + do changes at work you may find your employer owns the patch
> + not you.
> +
> +6.   Happy hacking.
> +
> +Descriptions of section entries and preferred order
> +---
> +
> + M: *Mail* patches to: FullName 
> +These people might be a co-maintainer (with Supported status) or
> +maintainer (with Maintained status).
> + R: Designated *Reviewer*: FullName 
> +These reviewers should be CCed on patches.
> + L: Besides fstests@ list itself, this *Mailing list* is relevant to
> +this area, should be CCed.
> + S: *Status*, one of the following (note: all things are maintained by
> +fste...@vger.kernel.org):
> +Supported:   Someone is actually paid to look after this.
> +Maintained:  Someone actually looks after it, has the privilege to
> + merge & push.
> +Odd Fixes:   It has a maintainer but they don't have time to do
> + much other than throw the odd patch in. See below..
> +Orphan:  No current maintainer [but maybe you could take the
> + role as you write your new code].
> +Obsolete:Old code. Something tagged obsolete generally means
> + it has been replaced by a better system and you
> + should be using that.
> + W: *Web-page* with status/info
> + Q: *Patchwork* web based patch tracking system site
> + B: URI for where to file *bugs*. A web-page with detailed bug
> +filing info, a direct bug tracker link, or a mailto: URI.
> + C: URI for *chat* protocol, server and channel where developers
> +usually hang out, for example irc://server/channel.
> + P: Subsystem Profile document for more details submitting
> +patches to the given subsystem. This is either an in-tree file,
> +or a URI.
> + T: *SCM* tree type and location.
> +Type is one of: git, hg, quilt, stgit, topgit
> + F: *Files* and directories wildcard patterns.
> +A trailing slash includes all files and subdirectory files.
> +F:   tests/xfs/  all files in and below tests/xfs
> +F:   tests/generic/* all files in tests/generic, but not below
> +F:   */ext4/*all files in "any top level directory"/ext4
> +One

Re: [f2fs-dev] [PATCH v2 00/23] fs-verity support for XFS

2023-04-04 Thread Darrick J. Wong

On Tue, Apr 04, 2023 at 04:52:56PM +0200, Andrey Albershteyn wrote:
> Hi all,
> 
> This is V2 of fs-verity support in XFS. In this series I did
> numerous changes from V1 which are described below.
> 
> This patchset introduces fs-verity [5] support for XFS. This
> implementation utilizes extended attributes to store fs-verity
> metadata. The Merkle tree blocks are stored in the remote extended
> attributes.
> 
> A few key points:
> - fs-verity metadata is stored in extended attributes
> - Direct path and DAX are disabled for inodes with fs-verity
> - Pages are verified in iomap's read IO path (offloaded to
>   workqueue)
> - New workqueue for verification processing
> - New ro-compat flag
> - Inodes with fs-verity have new on-disk diflag
> - xfs_attr_get() can return buffer with the attribute
> 
> The patchset is tested with xfstests -g auto on xfs_1k, xfs_4k,
> xfs_1k_quota, and xfs_4k_quota. Haven't found any major failures.
> 
> Patches [6/23] and [7/23] touch ext4, f2fs, btrfs, and patch [8/23]
> touches erofs, gfs2, and zonefs.
> 
> The patchset consist of four parts:
> - [1..4]: Patches from Parent Pointer patchset which add binary
>   xattr names with a few deps
> - [5..7]: Improvements to core fs-verity
> - [8..9]: Add read path verification to iomap
> - [10..23]: Integration of fs-verity to xfs
> 
> Changes from V1:
> - Added parent pointer patches for easier testing
> - Many issues and refactoring points fixed from the V1 review
> - Adjusted for recent changes in fs-verity core (folios, non-4k)
> - Dropped disabling of large folios
> - Completely new fsverity patches (fix, callout, log_blocksize)
> - Change approach to verification in iomap to the same one as in
>   write path. Callouts to fs instead of direct fs-verity use.
> - New XFS workqueue for post read folio verification
> - xfs_attr_get() can return underlying xfs_buf
> - xfs_bufs are marked with XBF_VERITY_CHECKED to track verified
>   blocks
> 
> kernel:
> [1]: https://github.com/alberand/linux/tree/xfs-verity-v2
> 
> xfsprogs:
> [2]: https://github.com/alberand/xfsprogs/tree/fsverity-v2

Will there any means for xfs_repair to check the merkle tree contents?
Should it clear the ondisk inode flag if it decides to trash the xattr
structure, or is it ok to let the kernel deal with flag set and no
verity data?

--D

> xfstests:
> [3]: https://github.com/alberand/xfstests/tree/fsverity-v2
> 
> v1:
> [4]: 
> https://lore.kernel.org/linux-xfs/20221213172935.680971-1-aalbe...@redhat.com/
> 
> fs-verity:
> [5]: https://www.kernel.org/doc/html/latest/filesystems/fsverity.html
> 
> Thanks,
> Andrey
> 
> Allison Henderson (4):
>   xfs: Add new name to attri/d
>   xfs: add parent pointer support to attribute code
>   xfs: define parent pointer xattr format
>   xfs: Add xfs_verify_pptr
> 
> Andrey Albershteyn (19):
>   fsverity: make fsverity_verify_folio() accept folio's offset and size
>   fsverity: add drop_page() callout
>   fsverity: pass Merkle tree block size to ->read_merkle_tree_page()
>   iomap: hoist iomap_readpage_ctx from the iomap_readahead/_folio
>   iomap: allow filesystem to implement read path verification
>   xfs: add XBF_VERITY_CHECKED xfs_buf flag
>   xfs: add XFS_DA_OP_BUFFER to make xfs_attr_get() return buffer
>   xfs: introduce workqueue for post read IO work
>   xfs: add iomap's readpage operations
>   xfs: add attribute type for fs-verity
>   xfs: add fs-verity ro-compat flag
>   xfs: add inode on-disk VERITY flag
>   xfs: initialize fs-verity on file open and cleanup on inode
> destruction
>   xfs: don't allow to enable DAX on fs-verity sealsed inode
>   xfs: disable direct read path for fs-verity sealed files
>   xfs: add fs-verity support
>   xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE
>   xfs: add fs-verity ioctls
>   xfs: enable ro-compat fs-verity flag
> 
>  fs/btrfs/verity.c   |  15 +-
>  fs/erofs/data.c |  12 +-
>  fs/ext4/verity.c|   9 +-
>  fs/f2fs/verity.c|   9 +-
>  fs/gfs2/aops.c  |  10 +-
>  fs/ioctl.c  |   4 +
>  fs/iomap/buffered-io.c  |  89 ++-
>  fs/verity/read_metadata.c   |   7 +-
>  fs/verity/verify.c  |   9 +-
>  fs/xfs/Makefile |   1 +
>  fs/xfs/libxfs/xfs_attr.c|  81 +-
>  fs/xfs/libxfs/xfs_attr.h|   7 +-
>  fs/xfs/libxfs/xfs_attr_leaf.c   |   7 +
>  fs/xfs/libxfs/xfs_attr_remote.c |  13 +-
>  fs/xfs/libxfs/xfs_da_btree.h|   7 +-
>  fs/xfs/libxfs/xfs_da_format.h   |  46 +-
>  fs/xfs/libxfs/xfs_format.h  |  14 +-
>  fs/xfs/libxfs/xfs_log_format.h  |   8 +-
>  fs/xfs/libxfs/xfs_sb.c  |   2 +
>  fs/xfs/scrub/attr.c |   4 +-
>  fs/xfs/xfs_aops.c   |  61 +++-
>  fs/xfs/xfs_attr_item.c  | 142 +++---
>  fs/xfs/xfs_attr_item.h  |   1 +
>  fs/xfs/xfs_attr_list.c  |  17 ++-
>  fs/xfs/xfs_buf.h|  17 ++-
>

Re: [f2fs-dev] [PATCH v2 21/23] xfs: handle merkle tree block size != fs blocksize != PAGE_SIZE

2023-04-04 Thread Darrick J. Wong

On Tue, Apr 04, 2023 at 04:53:17PM +0200, Andrey Albershteyn wrote:
> In case of different Merkle tree block size fs-verity expects
> ->read_merkle_tree_page() to return Merkle tree page filled with
> Merkle tree blocks. The XFS stores each merkle tree block under
> extended attribute. Those attributes are addressed by block offset
> into Merkle tree.
> 
> This patch make ->read_merkle_tree_page() to fetch multiple merkle
> tree blocks based on size ratio. Also the reference to each xfs_buf
> is passed with page->private to ->drop_page().
> 
> Signed-off-by: Andrey Albershteyn 
> ---
>  fs/xfs/xfs_verity.c | 74 +++--
>  fs/xfs/xfs_verity.h |  8 +
>  2 files changed, 66 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_verity.c b/fs/xfs/xfs_verity.c
> index a9874ff4efcd..ef0aff216f06 100644
> --- a/fs/xfs/xfs_verity.c
> +++ b/fs/xfs/xfs_verity.c
> @@ -134,6 +134,10 @@ xfs_read_merkle_tree_page(
>   struct page *page = NULL;
>   __be64  name = cpu_to_be64(index << PAGE_SHIFT);
>   uint32_tbs = 1 << log_blocksize;
> + int blocks_per_page =
> + (1 << (PAGE_SHIFT - log_blocksize));
> + int n = 0;
> + int offset = 0;
>   struct xfs_da_args  args = {
>   .dp = ip,
>   .attr_filter= XFS_ATTR_VERITY,
> @@ -143,26 +147,59 @@ xfs_read_merkle_tree_page(
>   .valuelen   = bs,
>   };
>   int error = 0;
> + boolis_checked = true;
> + struct xfs_verity_buf_list  *buf_list;
>  
>   page = alloc_page(GFP_KERNEL);
>   if (!page)
>   return ERR_PTR(-ENOMEM);
>  
> - error = xfs_attr_get();
> - if (error) {
> - kmem_free(args.value);
> - xfs_buf_rele(args.bp);
> + buf_list = kzalloc(sizeof(struct xfs_verity_buf_list), GFP_KERNEL);
> + if (!buf_list) {
>   put_page(page);
> - return ERR_PTR(-EFAULT);
> + return ERR_PTR(-ENOMEM);
>   }
>  
> - if (args.bp->b_flags & XBF_VERITY_CHECKED)
> + /*
> +  * Fill the page with Merkle tree blocks. The blcoks_per_page is higher
> +  * than 1 when fs block size != PAGE_SIZE or Merkle tree block size !=
> +  * PAGE SIZE
> +  */
> + for (n = 0; n < blocks_per_page; n++) {

Ahah, ok, that's why we can't pass the xfs_buf pages up to fsverity.

> + offset = bs * n;
> + name = cpu_to_be64(((index << PAGE_SHIFT) + offset));

Really this ought to be a typechecked helper...

struct xfs_fsverity_merkle_key {
__be64  merkleoff;
};

static inline void
xfs_fsverity_merkle_key_to_disk(struct xfs_fsverity_merkle_key *k, loff_t pos)
{
k->merkeloff = cpu_to_be64(pos);
}



> + args.name = (const uint8_t *)
> +
> + error = xfs_attr_get();
> + if (error) {
> + kmem_free(args.value);
> + /*
> +  * No more Merkle tree blocks (e.g. this was the last
> +  * block of the tree)
> +  */
> + if (error == -ENOATTR)
> + break;
> + xfs_buf_rele(args.bp);
> + put_page(page);
> + kmem_free(buf_list);
> + return ERR_PTR(-EFAULT);
> + }
> +
> + buf_list->bufs[buf_list->buf_count++] = args.bp;
> +
> + /* One of the buffers was dropped */
> + if (!(args.bp->b_flags & XBF_VERITY_CHECKED))
> + is_checked = false;

If there's enough memory pressure to cause the merkle tree pages to get
evicted, what are the chances that the xfs_bufs survive the eviction?

> + memcpy(page_address(page) + offset, args.value, args.valuelen);
> + kmem_free(args.value);
> + args.value = NULL;
> + }
> +
> + if (is_checked)
>   SetPageChecked(page);
> + page->private = (unsigned long)buf_list;
>  
> - page->private = (unsigned long)args.bp;
> - memcpy(page_address(page), args.value, args.valuelen);
> -
> - kmem_free(args.value);
>   return page;
>  }
>  
> @@ -191,16 +228,21 @@ xfs_write_merkle_tree_block(
>  
>  static void
>  xfs_drop_page(
> - struct page *page)
> + struct page *page)
>  {
> - struct xfs_buf *buf = (struct xfs_buf *)page->private;
> + int i = 0;
> + struct xfs_verity_buf_list  *buf_list =
> + (struct xfs_verity_buf_list *)page->private;
>  
> - ASSERT(buf != NULL);
> + ASSERT(buf_list != NULL);
>  
> - if (PageChecked(page))
> - buf->b_flags |= XBF_VERITY_CHECKED;
> + for (i = 0; i < buf_list->buf_count; i++) {
> + if

Re: [f2fs-dev] [PATCH v2 20/23] xfs: add fs-verity support

2023-04-04 Thread Darrick J. Wong

On Tue, Apr 04, 2023 at 04:53:16PM +0200, Andrey Albershteyn wrote:
> Add integration with fs-verity. The XFS store fs-verity metadata in
> the extended attributes. The metadata consist of verity descriptor
> and Merkle tree blocks.
> 
> The descriptor is stored under "verity_descriptor" extended
> attribute. The Merkle tree blocks are stored under binary indexes.
> 
> When fs-verity is enabled on an inode, the XFS_IVERITY_CONSTRUCTION
> flag is set meaning that the Merkle tree is being build. The
> initialization ends with storing of verity descriptor and setting
> inode on-disk flag (XFS_DIFLAG2_VERITY).
> 
> The verification on read is done in iomap. Based on the inode verity
> flag the IOMAP_F_READ_VERITY is set in xfs_read_iomap_begin() to let
> iomap know that verification is needed.
> 
> Signed-off-by: Andrey Albershteyn 
> ---
>  fs/xfs/Makefile  |   1 +
>  fs/xfs/libxfs/xfs_attr.c |  13 +++
>  fs/xfs/xfs_inode.h   |   3 +-
>  fs/xfs/xfs_iomap.c   |   3 +
>  fs/xfs/xfs_ondisk.h  |   4 +
>  fs/xfs/xfs_super.c   |   8 ++
>  fs/xfs/xfs_verity.c  | 214 +++
>  fs/xfs/xfs_verity.h  |  19 
>  8 files changed, 264 insertions(+), 1 deletion(-)
>  create mode 100644 fs/xfs/xfs_verity.c
>  create mode 100644 fs/xfs/xfs_verity.h
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 92d88dc3c9f7..76174770d91a 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -130,6 +130,7 @@ xfs-$(CONFIG_XFS_POSIX_ACL)   += xfs_acl.o
>  xfs-$(CONFIG_SYSCTL) += xfs_sysctl.o
>  xfs-$(CONFIG_COMPAT) += xfs_ioctl32.o
>  xfs-$(CONFIG_EXPORTFS_BLOCK_OPS) += xfs_pnfs.o
> +xfs-$(CONFIG_FS_VERITY)  += xfs_verity.o
>  
>  # notify failure
>  ifeq ($(CONFIG_MEMORY_FAILURE),y)
> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> index 298b74245267..39d9038fbeee 100644
> --- a/fs/xfs/libxfs/xfs_attr.c
> +++ b/fs/xfs/libxfs/xfs_attr.c
> @@ -26,6 +26,7 @@
>  #include "xfs_trace.h"
>  #include "xfs_attr_item.h"
>  #include "xfs_xattr.h"
> +#include "xfs_verity.h"
>  
>  struct kmem_cache*xfs_attr_intent_cache;
>  
> @@ -1635,6 +1636,18 @@ xfs_attr_namecheck(
>   return xfs_verify_pptr(mp, (struct xfs_parent_name_rec *)name);
>   }
>  
> + if (flags & XFS_ATTR_VERITY) {
> + /* Merkle tree pages are stored under u64 indexes */
> + if (length == sizeof(__be64))

This ondisk structure should be actual structs that we can grep and
ctags on, not open-coded __be64 scattered around the xattr code.

> + return true;
> +
> + /* Verity descriptor blocks are held in a named attribute. */
> + if (length == XFS_VERITY_DESCRIPTOR_NAME_LEN)
> + return true;
> +
> + return false;
> + }
> +
>   return xfs_str_attr_namecheck(name, length);
>  }
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 69d21e42c10a..a95f28cb049f 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -324,7 +324,8 @@ static inline bool 
> xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
>   * inactivation completes, both flags will be cleared and the inode is a
>   * plain old IRECLAIMABLE inode.
>   */
> -#define XFS_INACTIVATING (1 << 13)
> +#define XFS_INACTIVATING (1 << 13)
> +#define XFS_IVERITY_CONSTRUCTION (1 << 14) /* merkle tree construction */
>  
>  /* All inode state flags related to inode reclaim. */
>  #define XFS_ALL_IRECLAIM_FLAGS   (XFS_IRECLAIMABLE | \
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index e0f3c5d709f6..0adde39f02a5 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -143,6 +143,9 @@ xfs_bmbt_to_iomap(
>   (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
>   iomap->flags |= IOMAP_F_DIRTY;
>  
> + if (fsverity_active(VFS_I(ip)))
> + iomap->flags |= IOMAP_F_READ_VERITY;
> +
>   iomap->validity_cookie = sequence_cookie;
>   iomap->folio_ops = _iomap_folio_ops;
>   return 0;
> diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
> index 9737b5a9f405..7fe88ccda519 100644
> --- a/fs/xfs/xfs_ondisk.h
> +++ b/fs/xfs/xfs_ondisk.h
> @@ -189,6 +189,10 @@ xfs_check_ondisk_structs(void)
>   XFS_CHECK_VALUE(XFS_DQ_BIGTIME_EXPIRY_MIN << XFS_DQ_BIGTIME_SHIFT, 4);
>   XFS_CHECK_VALUE(XFS_DQ_BIGTIME_EXPIRY_MAX << XFS_DQ_BIGTIME_SHIFT,
>   16299260424LL);
> +
> + /* fs-verity descriptor xattr name */
> + XFS_CHECK_VALUE(strlen(XFS_VERITY_DESCRIPTOR_NAME),

Are you encoding the trailing null in the xattr name too?  The attr name
length is stored explicitly, so the null isn't strictly necessary.

> + XFS_VERITY_DESCRIPTOR_NAME_LEN);
>  }
>  
>  #endif /* __XFS_ONDISK_H */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index d40de32362b1..b6e99ed3b187 100644
> --- a/fs/xfs/xfs_super.c
> +++

Re: [f2fs-dev] [PATCH v2 19/23] xfs: disable direct read path for fs-verity sealed files

2023-04-04 Thread Darrick J. Wong

On Tue, Apr 04, 2023 at 04:53:15PM +0200, Andrey Albershteyn wrote:
> The direct path is not supported on verity files. Attempts to use direct
> I/O path on such files should fall back to buffered I/O path.
> 
> Signed-off-by: Andrey Albershteyn 
> ---
>  fs/xfs/xfs_file.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 947b5c436172..9e072e82f6c1 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -244,7 +244,8 @@ xfs_file_dax_read(
>   struct kiocb*iocb,
>   struct iov_iter *to)
>  {
> - struct xfs_inode*ip = XFS_I(iocb->ki_filp->f_mapping->host);
> + struct inode*inode = iocb->ki_filp->f_mapping->host;
> + struct xfs_inode*ip = XFS_I(inode);
>   ssize_t ret = 0;
>  
>   trace_xfs_file_dax_read(iocb, to);
> @@ -297,10 +298,17 @@ xfs_file_read_iter(
>  
>   if (IS_DAX(inode))
>   ret = xfs_file_dax_read(iocb, to);
> - else if (iocb->ki_flags & IOCB_DIRECT)
> + else if (iocb->ki_flags & IOCB_DIRECT && !fsverity_active(inode))
>   ret = xfs_file_dio_read(iocb, to);
> - else
> + else {
> + /*
> +  * In case fs-verity is enabled, we also fallback to the
> +  * buffered read from the direct read path. Therefore,
> +  * IOCB_DIRECT is set and need to be cleared
> +  */
> + iocb->ki_flags &= ~IOCB_DIRECT;
>   ret = xfs_file_buffered_read(iocb, to);

XFS doesn't usually allow directio fallback to the pagecache.  Why would
fsverity be any different?

--D

> + }
>  
>   if (ret > 0)
>   XFS_STATS_ADD(mp, xs_read_bytes, ret);
> -- 
> 2.38.4
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()

2022-11-03 Thread Darrick J. Wong

On Fri, Nov 04, 2022 at 11:32:35AM +1100, Dave Chinner wrote:
> On Thu, Nov 03, 2022 at 03:28:05PM -0700, Vishal Moola wrote:
> > On Wed, Oct 19, 2022 at 08:01:52AM +1100, Dave Chinner wrote:
> > > On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote:
> > > > Converted function to use folios throughout. This is in preparation for
> > > > the removal of find_get_pages_range_tag().
> > > > 
> > > > Signed-off-by: Vishal Moola (Oracle) 
> > > > ---
> > > >  mm/page-writeback.c | 44 +++-
> > > >  1 file changed, 23 insertions(+), 21 deletions(-)
> > > > 
> > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > index 032a7bf8d259..087165357a5a 100644
> > > > --- a/mm/page-writeback.c
> > > > +++ b/mm/page-writeback.c
> > > > @@ -2285,15 +2285,15 @@ int write_cache_pages(struct address_space 
> > > > *mapping,
> > > > int ret = 0;
> > > > int done = 0;
> > > > int error;
> > > > -   struct pagevec pvec;
> > > > -   int nr_pages;
> > > > +   struct folio_batch fbatch;
> > > > +   int nr_folios;
> > > > pgoff_t index;
> > > > pgoff_t end;/* Inclusive */
> > > > pgoff_t done_index;
> > > > int range_whole = 0;
> > > > xa_mark_t tag;
> > > >  
> > > > -   pagevec_init();
> > > > +   folio_batch_init();
> > > > if (wbc->range_cyclic) {
> > > > index = mapping->writeback_index; /* prev offset */
> > > > end = -1;
> > > > @@ -2313,17 +2313,18 @@ int write_cache_pages(struct address_space 
> > > > *mapping,
> > > > while (!done && (index <= end)) {
> > > > int i;
> > > >  
> > > > -   nr_pages = pagevec_lookup_range_tag(, mapping, 
> > > > , end,
> > > > -   tag);
> > > > -   if (nr_pages == 0)
> > > > +   nr_folios = filemap_get_folios_tag(mapping, , end,
> > > > +   tag, );
> > > 
> > > This can find and return dirty multi-page folios if the filesystem
> > > enables them in the mapping at instantiation time, right?
> > 
> > Yup, it will.
> > 
> > > > +
> > > > +   if (nr_folios == 0)
> > > > break;
> > > >  
> > > > -   for (i = 0; i < nr_pages; i++) {
> > > > -   struct page *page = pvec.pages[i];
> > > > +   for (i = 0; i < nr_folios; i++) {
> > > > +   struct folio *folio = fbatch.folios[i];
> > > >  
> > > > -   done_index = page->index;
> > > > +   done_index = folio->index;
> > > >  
> > > > -   lock_page(page);
> > > > +   folio_lock(folio);
> > > >  
> > > > /*
> > > >  * Page truncated or invalidated. We can freely 
> > > > skip it
> > > > @@ -2333,30 +2334,30 @@ int write_cache_pages(struct address_space 
> > > > *mapping,
> > > >  * even if there is now a new, dirty page at 
> > > > the same
> > > >  * pagecache address.
> > > >  */
> > > > -   if (unlikely(page->mapping != mapping)) {
> > > > +   if (unlikely(folio->mapping != mapping)) {
> > > >  continue_unlock:
> > > > -   unlock_page(page);
> > > > +   folio_unlock(folio);
> > > > continue;
> > > > }
> > > >  
> > > > -   if (!PageDirty(page)) {
> > > > +   if (!folio_test_dirty(folio)) {
> > > > /* someone wrote it for us */
> > > > goto continue_unlock;
> > > > }
> > > >  
> > > > -   if (PageWriteback(page)) {
> > > > +   if (folio_test_writeback(folio)) {
> > > > if (wbc->sync_mode != WB_SYNC_NONE)
> > > > -   wait_on_page_writeback(page);
> > > > +   folio_wait_writeback(folio);
> > > > else
> > > > goto continue_unlock;
> > > > }
> > > >  
> > > > -   BUG_ON(PageWriteback(page));
> > > > -   if (!clear_page_dirty_for_io(page))
> > > > +   BUG_ON(folio_test_writeback(folio));
> > > > +   if (!folio_clear_dirty_for_io(folio))
> > > > goto continue_unlock;
> > > >  
> > > > trace_wbc_writepage(wbc, 
> > > > inode_to_bdi(mapping->host));
> > > > -   error = (*writepage)(page, wbc, data);
> > > > +   error = writepage(>page, wbc, data);
> > > 
> > >

Re: [f2fs-dev] [PATCH RFC 5/7] fs/xfs: support `DISABLE_FS_CSUM_VERIFICATION` config option

2022-10-14 Thread Darrick J. Wong

On Fri, Oct 14, 2022 at 08:48:35AM +, Hrutvik Kanabar wrote:
> From: Hrutvik Kanabar 
> 
> When `DISABLE_FS_CSUM_VERIFICATION` is enabled, return truthy value for
> `xfs_verify_cksum`, which is the key function implementing checksum
> verification for XFS.
> 
> Signed-off-by: Hrutvik Kanabar 

NAK, we're not going to break XFS for the sake of automated fuzz tools.

You'll have to adapt your fuzzing tools to rewrite the block header
checksums, like the existing xfs fuzz testing framework does.  See
the xfs_db 'fuzz -d' command and the relevant fstests.

--D

> ---
>  fs/xfs/libxfs/xfs_cksum.h | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_cksum.h b/fs/xfs/libxfs/xfs_cksum.h
> index 999a290cfd72..ba55b1afa382 100644
> --- a/fs/xfs/libxfs/xfs_cksum.h
> +++ b/fs/xfs/libxfs/xfs_cksum.h
> @@ -76,7 +76,10 @@ xfs_verify_cksum(char *buffer, size_t length, unsigned 
> long cksum_offset)
>  {
>   uint32_t crc = xfs_start_cksum_safe(buffer, length, cksum_offset);
>  
> - return *(__le32 *)(buffer + cksum_offset) == xfs_end_cksum(crc);
> + if (IS_ENABLED(CONFIG_DISABLE_FS_CSUM_VERIFICATION))
> + return 1;
> + else
> + return *(__le32 *)(buffer + cksum_offset) == xfs_end_cksum(crc);
>  }
>  
>  #endif /* _XFS_CKSUM_H */
> -- 
> 2.38.0.413.g74048e4d9e-goog
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [man-pages PATCH v3] statx.2, open.2: document STATX_DIOALIGN

2022-10-10 Thread Darrick J. Wong

On Sat, Oct 08, 2022 at 03:56:22AM +0200, Alejandro Colomar wrote:
> Hi Eric,
> 
> On 10/4/22 19:43, Eric Biggers wrote:
> > From: Eric Biggers 
> > 
> > Document the STATX_DIOALIGN support for statx()
> > (https://git.kernel.org/linus/725737e7c21d2d25).
> > 
> > Reviewed-by: Darrick J. Wong 
> > Signed-off-by: Eric Biggers 
> 
> Please see some formatting comments below.
> 
> > ---
> > 
> > I'm resending this now that support for STATX_DIOALIGN has been merged
> > upstream.
> 
> Thanks.
> 
> Cheers,
> Alex
> 
> > 
> > v3: updated mentions of Linux version, fixed some punctuation, and added
> >  a Reviewed-by
> > 
> > v2: rebased onto man-pages master branch, mentioned xfs, and updated
> >  link to patchset
> > 
> >   man2/open.2  | 43 ---
> >   man2/statx.2 | 29 +
> >   2 files changed, 61 insertions(+), 11 deletions(-)
> > 
> > diff --git a/man2/open.2 b/man2/open.2
> > index deba7e4ea..b8617e0d2 100644
> > --- a/man2/open.2
> > +++ b/man2/open.2
> > @@ -1732,21 +1732,42 @@ of user-space buffers and the file offset of I/Os.
> >   In Linux alignment
> >   restrictions vary by filesystem and kernel version and might be
> >   absent entirely.
> > -However there is currently no filesystem\-independent
> > -interface for an application to discover these restrictions for a given
> > -file or filesystem.
> > -Some filesystems provide their own interfaces
> > -for doing so, for example the
> > +The handling of misaligned
> > +.B O_DIRECT
> > +I/Os also varies; they can either fail with
> > +.B EINVAL
> > +or fall back to buffered I/O.
> > +.PP
> > +Since Linux 6.1,
> > +.B O_DIRECT
> > +support and alignment restrictions for a file can be queried using
> > +.BR statx (2),
> > +using the
> > +.B STATX_DIOALIGN
> > +flag.
> > +Support for
> > +.B STATX_DIOALIGN
> > +varies by filesystem; see
> > +.BR statx (2).
> > +.PP
> > +Some filesystems provide their own interfaces for querying
> > +.B O_DIRECT
> > +alignment restrictions, for example the
> >   .B XFS_IOC_DIOINFO
> >   operation in
> >   .BR xfsctl (3).
> > +.B STATX_DIOALIGN
> > +should be used instead when it is available.
> >   .PP
> > -Under Linux 2.4, transfer sizes, the alignment of the user buffer,
> > -and the file offset must all be multiples of the logical block size
> > -of the filesystem.
> > -Since Linux 2.6.0, alignment to the logical block size of the
> > -underlying storage (typically 512 bytes) suffices.
> > -The logical block size can be determined using the

I'm not so familiar with semantic newlines-- is there an automated
reflow program that fixes these problems mechanically, or is this
expected to be performed manually by manpage authors?

If manually, do the items in a comma-separated list count as clauses?

Would the next two paragraphs of this email reformat into semantic
newlines like so?

In the source of a manual page,
new sentences should  be started on new lines,
long sentences should be split into lines at clause breaks
(commas, semicolons, colons, and so on),
and long clauses should be split at phrase boundaries.
This convention,
sometimes known as "semantic newlines",
makes it easier to see the effect of patches,
which often operate at the level of individual sentences, clauses, or 
phrases.

Do we still line-wrap at 72^W74^W78^W80 columns?

and would the proposed manpage text read:

If none of the above is available,
then direct I/O support and alignment restrictions can only be assumed
from known characteristics of the filesystem,
the individual file,
the underlying storage device(s),
and the kernel version.
In Linux 2.4,
most block device based filesystems require that the file offset and the
length and memory address of all I/O segments be multiples of the
filesystem block size
(typically 4096 bytes).
In Linux 2.6.0,
this was relaxed to the logical block size of the block device
(typically 512 bytes).
A block device's logical block size can be determined using the
.BR ioctl (2)
.B BLKSSZGET
operation or from the shell using the command:

--D

> > +If none of the above is available, then direct I/O support and alignment
> 
> Please use semantic newlines.
> 
> See man-pages(7):
>Use semantic newlines
>In the source of a manual page, new sentences  should

Re: [f2fs-dev] [man-pages PATCH v3] statx.2, open.2: document STATX_DIOALIGN

2022-10-06 Thread Darrick J. Wong

On Tue, Oct 04, 2022 at 10:43:07AM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Document the STATX_DIOALIGN support for statx()
> (https://git.kernel.org/linus/725737e7c21d2d25).
> 
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Eric Biggers 
> ---
> 
> I'm resending this now that support for STATX_DIOALIGN has been merged
> upstream.

Woo!  Thank you for getting this over the line! :)

--D

> v3: updated mentions of Linux version, fixed some punctuation, and added
> a Reviewed-by
> 
> v2: rebased onto man-pages master branch, mentioned xfs, and updated
> link to patchset
> 
>  man2/open.2  | 43 ---
>  man2/statx.2 | 29 +
>  2 files changed, 61 insertions(+), 11 deletions(-)
> 
> diff --git a/man2/open.2 b/man2/open.2
> index deba7e4ea..b8617e0d2 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -1732,21 +1732,42 @@ of user-space buffers and the file offset of I/Os.
>  In Linux alignment
>  restrictions vary by filesystem and kernel version and might be
>  absent entirely.
> -However there is currently no filesystem\-independent
> -interface for an application to discover these restrictions for a given
> -file or filesystem.
> -Some filesystems provide their own interfaces
> -for doing so, for example the
> +The handling of misaligned
> +.B O_DIRECT
> +I/Os also varies; they can either fail with
> +.B EINVAL
> +or fall back to buffered I/O.
> +.PP
> +Since Linux 6.1,
> +.B O_DIRECT
> +support and alignment restrictions for a file can be queried using
> +.BR statx (2),
> +using the
> +.B STATX_DIOALIGN
> +flag.
> +Support for
> +.B STATX_DIOALIGN
> +varies by filesystem; see
> +.BR statx (2).
> +.PP
> +Some filesystems provide their own interfaces for querying
> +.B O_DIRECT
> +alignment restrictions, for example the
>  .B XFS_IOC_DIOINFO
>  operation in
>  .BR xfsctl (3).
> +.B STATX_DIOALIGN
> +should be used instead when it is available.
>  .PP
> -Under Linux 2.4, transfer sizes, the alignment of the user buffer,
> -and the file offset must all be multiples of the logical block size
> -of the filesystem.
> -Since Linux 2.6.0, alignment to the logical block size of the
> -underlying storage (typically 512 bytes) suffices.
> -The logical block size can be determined using the
> +If none of the above is available, then direct I/O support and alignment
> +restrictions can only be assumed from known characteristics of the 
> filesystem,
> +the individual file, the underlying storage device(s), and the kernel 
> version.
> +In Linux 2.4, most block device based filesystems require that the file 
> offset
> +and the length and memory address of all I/O segments be multiples of the
> +filesystem block size (typically 4096 bytes).
> +In Linux 2.6.0, this was relaxed to the logical block size of the block 
> device
> +(typically 512 bytes).
> +A block device's logical block size can be determined using the
>  .BR ioctl (2)
>  .B BLKSSZGET
>  operation or from the shell using the command:
> diff --git a/man2/statx.2 b/man2/statx.2
> index 0d1b4591f..50397057d 100644
> --- a/man2/statx.2
> +++ b/man2/statx.2
> @@ -61,7 +61,12 @@ struct statx {
> containing the filesystem where the file resides */
>  __u32 stx_dev_major;   /* Major ID */
>  __u32 stx_dev_minor;   /* Minor ID */
> +
>  __u64 stx_mnt_id;  /* Mount ID */
> +
> +/* Direct I/O alignment restrictions */
> +__u32 stx_dio_mem_align;
> +__u32 stx_dio_offset_align;
>  };
>  .EE
>  .in
> @@ -247,6 +252,8 @@ STATX_BTIME   Want stx_btime
>  STATX_ALLThe same as STATX_BASIC_STATS | STATX_BTIME.
>   It is deprecated and should not be used.
>  STATX_MNT_ID Want stx_mnt_id (since Linux 5.8)
> +STATX_DIOALIGN   Want stx_dio_mem_align and stx_dio_offset_align
> + (since Linux 6.1; support varies by filesystem)
>  .TE
>  .in
>  .PP
> @@ -407,6 +414,28 @@ This is the same number reported by
>  .BR name_to_handle_at (2)
>  and corresponds to the number in the first field in one of the records in
>  .IR /proc/self/mountinfo .
> +.TP
> +.I stx_dio_mem_align
> +The alignment (in bytes) required for user memory buffers for direct I/O
> +.BR "" ( O_DIRECT )
> +on this file, or 0 if direct I/O is not supported on this file.
> +.IP
> +.B STATX_DIOALIGN
> +.IR "" ( stx_dio_mem_align
> +and
> +.IR stx_dio_offset_align )
> +is supported on block devices since Linux 6.1.
> +The support on regular files varies by filesystem; it is supported by ext4,
> +f2fs, and xfs since Linux 6.1.
> +.TP
> +.I stx_dio_offset_align
> +The alignment (i

Re: [f2fs-dev] [PATCH v4 2/2] f2fs: introduce F2FS_IOC_START_ATOMIC_REPLACE

2022-10-06 Thread Darrick J. Wong

On Thu, Oct 06, 2022 at 01:33:34AM -0700, Christoph Hellwig wrote:
> On Tue, Oct 04, 2022 at 10:13:51AM -0700, Daeho Jeong wrote:
> > From: Daeho Jeong 
> > 
> > introduce a new ioctl to replace the whole content of a file atomically,
> > which means it induces truncate and content update at the same time.
> > We can start it with F2FS_IOC_START_ATOMIC_REPLACE and complete it with
> > F2FS_IOC_COMMIT_ATOMIC_WRITE. Or abort it with
> > F2FS_IOC_ABORT_ATOMIC_WRITE.
> 
> It would be great to Cc Darrick and linux-fsdevel as there have been
> attempts to do this properly at the VFS level instead of a completely
> undocumented ioctl.

It's been a while since I sent the last RFC, but yes, it's still in my
queue as part of the xfs online fsck patchserieses.

https://lore.kernel.org/linux-fsdevel/161723932606.3149451.12366114306150243052.stgit@magnolia/

More recent git branch:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates

--D


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH] Documentation: filesystems: correct possessive "its"

2022-08-29 Thread Darrick J. Wong

On Mon, Aug 29, 2022 at 04:54:29PM -0700, Randy Dunlap wrote:
> Change occurrences of "it's" that are possessive to "its"
> so that they don't read as "it is".
> 
> Signed-off-by: Randy Dunlap 
> Cc: Jonathan Corbet 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-f2fs-devel@lists.sourceforge.net
> Cc: linux-...@vger.kernel.org
> Cc: Christian Brauner 
> Cc: Seth Forshee 

Looks correct to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  Documentation/filesystems/f2fs.rst   |2 +-
>  Documentation/filesystems/idmappings.rst |2 +-
>  Documentation/filesystems/qnx6.rst   |2 +-
>  Documentation/filesystems/xfs-delayed-logging-design.rst |6 +++---
>  4 files changed, 6 insertions(+), 6 deletions(-)
> 
> --- a/Documentation/filesystems/f2fs.rst
> +++ b/Documentation/filesystems/f2fs.rst
> @@ -287,7 +287,7 @@ compress_algorithm=%s:%d Control compres
>lz43 - 16
>zstd   1 - 22
>  compress_log_size=%u  Support configuring compress cluster size, the size 
> will
> -  be 4KB * (1 << %u), 16KB is minimum size, also it's
> +  be 4KB * (1 << %u), 16KB is minimum size, also its
>default size.
>  compress_extension=%s Support adding specified extension, so that 
> f2fs can enable
>compression on those corresponding files, e.g. if all 
> files
> --- a/Documentation/filesystems/idmappings.rst
> +++ b/Documentation/filesystems/idmappings.rst
> @@ -661,7 +661,7 @@ idmappings::
>   mount idmapping:  u0:k1:r1
>  
>  Assume a file owned by ``u1000`` is read from disk. The filesystem maps this 
> id
> -to ``k21000`` according to it's idmapping. This is what is stored in the
> +to ``k21000`` according to its idmapping. This is what is stored in the
>  inode's ``i_uid`` and ``i_gid`` fields.
>  
>  When the caller queries the ownership of this file via ``stat()`` the kernel
> --- a/Documentation/filesystems/qnx6.rst
> +++ b/Documentation/filesystems/qnx6.rst
> @@ -176,7 +176,7 @@ Then userspace.
>  The requirement for a static, fixed preallocated system area comes from how
>  qnx6fs deals with writes.
>  
> -Each superblock got it's own half of the system area. So superblock #1
> +Each superblock got its own half of the system area. So superblock #1
>  always uses blocks from the lower half while superblock #2 just writes to
>  blocks represented by the upper half bitmap system area bits.
>  
> --- a/Documentation/filesystems/xfs-delayed-logging-design.rst
> +++ b/Documentation/filesystems/xfs-delayed-logging-design.rst
> @@ -551,14 +551,14 @@ Essentially, this shows that an item tha
>  and relogged, so any tracking must be separate to the AIL infrastructure. As
>  such, we cannot reuse the AIL list pointers for tracking committed items, nor
>  can we store state in any field that is protected by the AIL lock. Hence the
> -committed item tracking needs it's own locks, lists and state fields in the 
> log
> +committed item tracking needs its own locks, lists and state fields in the 
> log
>  item.
>  
>  Similar to the AIL, tracking of committed items is done through a new list
>  called the Committed Item List (CIL).  The list tracks log items that have 
> been
>  committed and have formatted memory buffers attached to them. It tracks 
> objects
>  in transaction commit order, so when an object is relogged it is removed from
> -it's place in the list and re-inserted at the tail. This is entirely 
> arbitrary
> +its place in the list and re-inserted at the tail. This is entirely arbitrary
>  and done to make it easy for debugging - the last items in the list are the
>  ones that are most recently modified. Ordering of the CIL is not necessary 
> for
>  transactional integrity (as discussed in the next section) so the ordering is
> @@ -884,7 +884,7 @@ pin the object the first time it is inse
>  the CIL during a transaction commit, then we do not pin it again. Because 
> there
>  can be multiple outstanding checkpoint contexts, we can still see elevated 
> pin
>  counts, but as each checkpoint completes the pin count will retain the 
> correct
> -value according to it's context.
> +value according to its context.
>  
>  Just to make matters more slightly more complex, this checkpoint level 
> context
>  for the pin count means that the pinning of an item must take place under the


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v5 8/8] xfs: support STATX_DIOALIGN

2022-08-27 Thread Darrick J. Wong

On Fri, Aug 26, 2022 at 11:58:51PM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Add support for STATX_DIOALIGN to xfs, so that direct I/O alignment
> restrictions are exposed to userspace in a generic way.
> 
> Signed-off-by: Eric Biggers 

Looks good to me; I particularly like the adjustment to report the
device's DMA alignment.  Someone should probably fix DIONINFO, or
perhaps turn it into a getattr wrapper and hoist it?  IMHO none of those
suggestions are necessary to land this patch, though.

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_iops.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 45518b8c613c9a..f51c60d7e2054a 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -604,6 +604,16 @@ xfs_vn_getattr(
>   stat->blksize = BLKDEV_IOSIZE;
>   stat->rdev = inode->i_rdev;
>   break;
> + case S_IFREG:
> + if (request_mask & STATX_DIOALIGN) {
> + struct xfs_buftarg  *target = xfs_inode_buftarg(ip);
> + struct block_device *bdev = target->bt_bdev;
> +
> + stat->result_mask |= STATX_DIOALIGN;
> + stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
> + stat->dio_offset_align = bdev_logical_block_size(bdev);
> + }
> + fallthrough;
>   default:
>   stat->blksize = xfs_stat_blksize(ip);
>   stat->rdev = 0;
> -- 
> 2.37.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCHv6 11/11] iomap: add support for dma aligned direct-io

2022-07-22 Thread Darrick J. Wong

On Fri, Jul 22, 2022 at 06:12:40PM +, Eric Biggers wrote:
> On Fri, Jul 22, 2022 at 10:53:42AM -0700, Darrick J. Wong wrote:
> > On Fri, Jul 22, 2022 at 12:36:01AM -0700, Eric Biggers wrote:
> > > [+f2fs list and maintainers]
> > > 
> > > On Fri, Jun 10, 2022 at 12:58:30PM -0700, Keith Busch wrote:
> > > > From: Keith Busch 
> > > > 
> > > > Use the address alignment requirements from the block_device for direct
> > > > io instead of requiring addresses be aligned to the block size.
> > > > 
> > > > Signed-off-by: Keith Busch 
> > > > Reviewed-by: Christoph Hellwig 
> > > > ---
> > > >  fs/iomap/direct-io.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > index 370c3241618a..5d098adba443 100644
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -242,7 +242,6 @@ static loff_t iomap_dio_bio_iter(const struct 
> > > > iomap_iter *iter,
> > > > struct inode *inode = iter->inode;
> > > > unsigned int blkbits = 
> > > > blksize_bits(bdev_logical_block_size(iomap->bdev));
> > > > unsigned int fs_block_size = i_blocksize(inode), pad;
> > > > -   unsigned int align = iov_iter_alignment(dio->submit.iter);
> > > > loff_t length = iomap_length(iter);
> > > > loff_t pos = iter->pos;
> > > > unsigned int bio_opf;
> > > > @@ -253,7 +252,8 @@ static loff_t iomap_dio_bio_iter(const struct 
> > > > iomap_iter *iter,
> > > > size_t copied = 0;
> > > > size_t orig_count;
> > > >  
> > > > -   if ((pos | length | align) & ((1 << blkbits) - 1))
> > > > +   if ((pos | length) & ((1 << blkbits) - 1) ||
> > > > +   !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > 
> > How does this change intersect with "make statx() return DIO alignment
> > information" ?  Will the new STATX_DIOALIGN implementations have to be
> > adjusted to set stx_dio_mem_align = bdev_dma_alignment(...)?
> > 
> > I'm guessing the answer is yes, but I haven't seen any patches on the
> > list to do that, but more and more these days email behaves like a flood
> > of UDP traffic... :(
> > 
> 
> Yes.  I haven't done that in the STATX_DIOALIGN patchset yet because I've been
> basing it on upstream, which doesn't yet have this iomap patch.  I haven't 
> been
> expecting STATX_DIOALIGN to make 5.20, given that it's a new UAPI that needs
> time to be properly reviewed, plus I've just been busy with other things.  So
> I've been planning to make the above change after this patch lands upstream.

 Ok, I'm looking forward to it.  Thank you for your work on statx! :)

--D

> - Eric


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCHv6 11/11] iomap: add support for dma aligned direct-io

2022-07-22 Thread Darrick J. Wong

On Fri, Jul 22, 2022 at 12:36:01AM -0700, Eric Biggers wrote:
> [+f2fs list and maintainers]
> 
> On Fri, Jun 10, 2022 at 12:58:30PM -0700, Keith Busch wrote:
> > From: Keith Busch 
> > 
> > Use the address alignment requirements from the block_device for direct
> > io instead of requiring addresses be aligned to the block size.
> > 
> > Signed-off-by: Keith Busch 
> > Reviewed-by: Christoph Hellwig 
> > ---
> >  fs/iomap/direct-io.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index 370c3241618a..5d098adba443 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -242,7 +242,6 @@ static loff_t iomap_dio_bio_iter(const struct 
> > iomap_iter *iter,
> > struct inode *inode = iter->inode;
> > unsigned int blkbits = 
> > blksize_bits(bdev_logical_block_size(iomap->bdev));
> > unsigned int fs_block_size = i_blocksize(inode), pad;
> > -   unsigned int align = iov_iter_alignment(dio->submit.iter);
> > loff_t length = iomap_length(iter);
> > loff_t pos = iter->pos;
> > unsigned int bio_opf;
> > @@ -253,7 +252,8 @@ static loff_t iomap_dio_bio_iter(const struct 
> > iomap_iter *iter,
> > size_t copied = 0;
> > size_t orig_count;
> >  
> > -   if ((pos | length | align) & ((1 << blkbits) - 1))
> > +   if ((pos | length) & ((1 << blkbits) - 1) ||
> > +   !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))

How does this change intersect with "make statx() return DIO alignment
information" ?  Will the new STATX_DIOALIGN implementations have to be
adjusted to set stx_dio_mem_align = bdev_dma_alignment(...)?

I'm guessing the answer is yes, but I haven't seen any patches on the
list to do that, but more and more these days email behaves like a flood
of UDP traffic... :(

--D

> > return -EINVAL;
> >  
> > if (iomap->type == IOMAP_UNWRITTEN) {
> 
> I noticed that this patch is going to break the following logic in
> f2fs_should_use_dio() in fs/f2fs/file.c:
> 
>   /*
>* Direct I/O not aligned to the disk's logical_block_size will be
>* attempted, but will fail with -EINVAL.
>*
>* f2fs additionally requires that direct I/O be aligned to the
>* filesystem block size, which is often a stricter requirement.
>* However, f2fs traditionally falls back to buffered I/O on requests
>* that are logical_block_size-aligned but not fs-block aligned.
>*
>* The below logic implements this behavior.
>*/
>   align = iocb->ki_pos | iov_iter_alignment(iter);
>   if (!IS_ALIGNED(align, i_blocksize(inode)) &&
>   IS_ALIGNED(align, bdev_logical_block_size(inode->i_sb->s_bdev)))
>   return false;
> 
>   return true;
> 
> So, f2fs assumes that __iomap_dio_rw() returns an error if the I/O isn't 
> logical
> block aligned.  This patch changes that.  The result is that DIO will 
> sometimes
> proceed in cases where the I/O doesn't have the fs block alignment required by
> f2fs for all DIO.
> 
> Does anyone have any thoughts about what f2fs should be doing here?  I think
> it's weird that f2fs has different behaviors for different degrees of
> misalignment: fail with EINVAL if not logical block aligned, else fallback to
> buffered I/O if not fs block aligned.  I think it should be one convention or
> the other.  Any opinions about which one it should be?
> 
> (Note: if you blame the above code, it was written by me.  But I was just
> preserving the existing behavior; I don't know the original motivation.)
> 
> - Eric


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v4 1/9] statx: add direct I/O alignment information

2022-07-22 Thread Darrick J. Wong

On Fri, Jul 22, 2022 at 12:12:20AM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Traditionally, the conditions for when DIO (direct I/O) is supported
> were fairly simple.  For both block devices and regular files, DIO had
> to be aligned to the logical block size of the block device.
> 
> However, due to filesystem features that have been added over time (e.g.
> multi-device support, data journalling, inline data, encryption, verity,
> compression, checkpoint disabling, log-structured mode), the conditions
> for when DIO is allowed on a regular file have gotten increasingly
> complex.  Whether a particular regular file supports DIO, and with what
> alignment, can depend on various file attributes and filesystem mount
> options, as well as which block device(s) the file's data is located on.
> 
> Moreover, the general rule of DIO needing to be aligned to the block
> device's logical block size is being relaxed to allow user buffers (but
> not file offsets) aligned to the DMA alignment instead
> (https://lore.kernel.org/linux-block/20220610195830.3574005-1-kbu...@fb.com/T/#u).
> 
> XFS has an ioctl XFS_IOC_DIOINFO that exposes DIO alignment information.
> Uplifting this to the VFS is one possibility.  However, as discussed
> (https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebigg...@kernel.org/T/#u),
> this ioctl is rarely used and not known to be used outside of
> XFS-specific code.  It was also never intended to indicate when a file
> doesn't support DIO at all, nor was it intended for block devices.
> 
> Therefore, let's expose this information via statx().  Add the
> STATX_DIOALIGN flag and two new statx fields associated with it:
> 
> * stx_dio_mem_align: the alignment (in bytes) required for user memory
>   buffers for DIO, or 0 if DIO is not supported on the file.
> 
> * stx_dio_offset_align: the alignment (in bytes) required for file
>   offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
>   on the file.  This will only be nonzero if stx_dio_mem_align is
>   nonzero, and vice versa.
> 
> Note that as with other statx() extensions, if STATX_DIOALIGN isn't set
> in the returned statx struct, then these new fields won't be filled in.
> This will happen if the file is neither a regular file nor a block
> device, or if the file is a regular file and the filesystem doesn't
> support STATX_DIOALIGN.  It might also happen if the caller didn't
> include STATX_DIOALIGN in the request mask, since statx() isn't required
> to return unrequested information.
> 
> This commit only adds the VFS-level plumbing for STATX_DIOALIGN.  For
> regular files, individual filesystems will still need to add code to
> support it.  For block devices, a separate commit will wire it up too.
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Eric Biggers 

Looks good to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/stat.c | 2 ++
>  include/linux/stat.h  | 2 ++
>  include/uapi/linux/stat.h | 4 +++-
>  3 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index 9ced8860e0f35d..a7930d74448304 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -611,6 +611,8 @@ cp_statx(const struct kstat *stat, struct statx __user 
> *buffer)
>   tmp.stx_dev_major = MAJOR(stat->dev);
>   tmp.stx_dev_minor = MINOR(stat->dev);
>   tmp.stx_mnt_id = stat->mnt_id;
> + tmp.stx_dio_mem_align = stat->dio_mem_align;
> + tmp.stx_dio_offset_align = stat->dio_offset_align;
>  
>   return copy_to_user(buffer, , sizeof(tmp)) ? -EFAULT : 0;
>  }
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index 7df06931f25d85..ff277ced50e9fd 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -50,6 +50,8 @@ struct kstat {
>   struct timespec64 btime;/* File creation time */
>   u64 blocks;
>   u64 mnt_id;
> + u32 dio_mem_align;
> + u32 dio_offset_align;
>  };
>  
>  #endif
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 1500a0f58041ae..7cab2c65d3d7fc 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -124,7 +124,8 @@ struct statx {
>   __u32   stx_dev_minor;
>   /* 0x90 */
>   __u64   stx_mnt_id;
> - __u64   __spare2;
> + __u32   stx_dio_mem_align;  /* Memory buffer alignment for direct 
> I/O */
> + __u32   stx_dio_offset_align;   /* File offset alignment for direct I/O 
> */
>   /* 0xa0 */
>   __u64   __spare3[12];   /* Spare space for future expansion */
>   /* 0x100 */
> @@ -152,6 +153,7 @@ struct statx {
>  #define

Re: [f2fs-dev] [man-pages RFC PATCH v2] statx.2, open.2: document STATX_DIOALIGN

2022-07-22 Thread Darrick J. Wong

On Fri, Jul 22, 2022 at 12:42:28AM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Document the proposed STATX_DIOALIGN support for statx()
> (https://lore.kernel.org/linux-fsdevel/20220722071228.146690-1-ebigg...@kernel.org/T/#u).
> 
> Signed-off-by: Eric Biggers 
> ---
> 
> v2: rebased onto man-pages master branch, mentioned xfs, and updated
> link to patchset
> 
>  man2/open.2  | 43 ---
>  man2/statx.2 | 29 +
>  2 files changed, 61 insertions(+), 11 deletions(-)
> 
> diff --git a/man2/open.2 b/man2/open.2
> index d1485999f..ef29847c3 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -1732,21 +1732,42 @@ of user-space buffers and the file offset of I/Os.
>  In Linux alignment
>  restrictions vary by filesystem and kernel version and might be
>  absent entirely.
> -However there is currently no filesystem\-independent
> -interface for an application to discover these restrictions for a given
> -file or filesystem.
> -Some filesystems provide their own interfaces
> -for doing so, for example the
> +The handling of misaligned
> +.B O_DIRECT
> +I/Os also varies; they can either fail with
> +.B EINVAL
> +or fall back to buffered I/O.
> +.PP
> +Since Linux 5.20,
> +.B O_DIRECT
> +support and alignment restrictions for a file can be queried using
> +.BR statx (2),
> +using the
> +.B STATX_DIOALIGN
> +flag.
> +Support for
> +.B STATX_DIOALIGN
> +varies by filesystem; see
> +.BR statx (2).
> +.PP
> +Some filesystems provide their own interfaces for querying
> +.B O_DIRECT
> +alignment restrictions, for example the
>  .B XFS_IOC_DIOINFO
>  operation in
>  .BR xfsctl (3).
> +.B STATX_DIOALIGN
> +should be used instead when it is available.
>  .PP
> -Under Linux 2.4, transfer sizes, the alignment of the user buffer,
> -and the file offset must all be multiples of the logical block size
> -of the filesystem.
> -Since Linux 2.6.0, alignment to the logical block size of the
> -underlying storage (typically 512 bytes) suffices.
> -The logical block size can be determined using the
> +If none of the above is available, then direct I/O support and alignment
> +restrictions can only be assumed from known characteristics of the 
> filesystem,
> +the individual file, the underlying storage device(s), and the kernel 
> version.
> +In Linux 2.4, most block device based filesystems require that the file 
> offset
> +and the length and memory address of all I/O segments be multiples of the
> +filesystem block size (typically 4096 bytes).
> +In Linux 2.6.0, this was relaxed to the logical block size of the block 
> device
> +(typically 512 bytes).
> +A block device's logical block size can be determined using the
>  .BR ioctl (2)
>  .B BLKSSZGET
>  operation or from the shell using the command:
> diff --git a/man2/statx.2 b/man2/statx.2
> index 0326e9af0..ea38ec829 100644
> --- a/man2/statx.2
> +++ b/man2/statx.2
> @@ -61,7 +61,12 @@ struct statx {
> containing the filesystem where the file resides */
>  __u32 stx_dev_major;   /* Major ID */
>  __u32 stx_dev_minor;   /* Minor ID */
> +
>  __u64 stx_mnt_id;  /* Mount ID */
> +
> +/* Direct I/O alignment restrictions */
> +__u32 stx_dio_mem_align;
> +__u32 stx_dio_offset_align;
>  };
>  .EE
>  .in
> @@ -247,6 +252,8 @@ STATX_BTIME   Want stx_btime
>  STATX_ALLThe same as STATX_BASIC_STATS | STATX_BTIME.
>   It is deprecated and should not be used.
>  STATX_MNT_ID Want stx_mnt_id (since Linux 5.8)
> +STATX_DIOALIGN   Want stx_dio_mem_align and stx_dio_offset_align
> + (since Linux 5.20; support varies by filesystem)
>  .TE
>  .in
>  .PP
> @@ -407,6 +414,28 @@ This is the same number reported by
>  .BR name_to_handle_at (2)
>  and corresponds to the number in the first field in one of the records in
>  .IR /proc/self/mountinfo .
> +.TP
> +.I stx_dio_mem_align
> +The alignment (in bytes) required for user memory buffers for direct I/O
> +.BR "" ( O_DIRECT )
> +on this file. or 0 if direct I/O is not supported on this file.

Nit: "..on this file, or 0 if direct..."

> +.IP
> +.B STATX_DIOALIGN
> +.IR "" ( stx_dio_mem_align
> +and
> +.IR stx_dio_offset_align )
> +is supported on block devices since Linux 5.20.
> +The support on regular files varies by filesystem; it is supported by ext4,
> +f2fs, and xfs since Linux 5.20.
> +.TP
> +.I stx_dio_offset_align
> +The alignment (in bytes) required for file offsets and I/O segment lengths 
> for
> +direct I/O
> +.BR "" ( O_DIRECT )
> +on this file, or 0 if direct I/O is not supported on this file

Re: [f2fs-dev] [PATCH v4 9/9] xfs: support STATX_DIOALIGN

2022-07-22 Thread Darrick J. Wong

On Fri, Jul 22, 2022 at 12:12:28AM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Add support for STATX_DIOALIGN to xfs, so that direct I/O alignment
> restrictions are exposed to userspace in a generic way.
> 
> Signed-off-by: Eric Biggers 

LGTM
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_iops.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 29f5b8b8aca69a..bac3f56141801e 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -605,6 +605,15 @@ xfs_vn_getattr(
>   stat->blksize = BLKDEV_IOSIZE;
>   stat->rdev = inode->i_rdev;
>   break;
> + case S_IFREG:
> + if (request_mask & STATX_DIOALIGN) {
> + struct xfs_buftarg  *target = xfs_inode_buftarg(ip);
> +
> + stat->result_mask |= STATX_DIOALIGN;
> + stat->dio_mem_align = target->bt_logical_sectorsize;
> + stat->dio_offset_align = target->bt_logical_sectorsize;
> + }
> + fallthrough;
>   default:
>   stat->blksize = xfs_stat_blksize(ip);
>   stat->rdev = 0;
> -- 
> 2.37.0
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [man-pages RFC PATCH] statx.2, open.2: document STATX_DIOALIGN

2022-06-23 Thread Darrick J. Wong

On Thu, Jun 16, 2022 at 01:21:41PM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Document the proposed STATX_DIOALIGN support for statx()
> (https://lore.kernel.org/linux-fsdevel/20220616201506.124209-1-ebigg...@kernel.org).
> 
> Signed-off-by: Eric Biggers 
> ---
>  man2/open.2  | 43 ---
>  man2/statx.2 | 32 +++-
>  2 files changed, 63 insertions(+), 12 deletions(-)
> 
> diff --git a/man2/open.2 b/man2/open.2
> index d1485999f..ef29847c3 100644
> --- a/man2/open.2
> +++ b/man2/open.2
> @@ -1732,21 +1732,42 @@ of user-space buffers and the file offset of I/Os.
>  In Linux alignment
>  restrictions vary by filesystem and kernel version and might be
>  absent entirely.
> -However there is currently no filesystem\-independent
> -interface for an application to discover these restrictions for a given
> -file or filesystem.
> -Some filesystems provide their own interfaces
> -for doing so, for example the
> +The handling of misaligned
> +.B O_DIRECT
> +I/Os also varies; they can either fail with
> +.B EINVAL
> +or fall back to buffered I/O.
> +.PP
> +Since Linux 5.20,
> +.B O_DIRECT
> +support and alignment restrictions for a file can be queried using
> +.BR statx (2),
> +using the
> +.B STATX_DIOALIGN
> +flag.
> +Support for
> +.B STATX_DIOALIGN
> +varies by filesystem; see
> +.BR statx (2).
> +.PP
> +Some filesystems provide their own interfaces for querying
> +.B O_DIRECT
> +alignment restrictions, for example the
>  .B XFS_IOC_DIOINFO
>  operation in
>  .BR xfsctl (3).
> +.B STATX_DIOALIGN
> +should be used instead when it is available.
>  .PP
> -Under Linux 2.4, transfer sizes, the alignment of the user buffer,
> -and the file offset must all be multiples of the logical block size
> -of the filesystem.
> -Since Linux 2.6.0, alignment to the logical block size of the
> -underlying storage (typically 512 bytes) suffices.
> -The logical block size can be determined using the
> +If none of the above is available, then direct I/O support and alignment
> +restrictions can only be assumed from known characteristics of the 
> filesystem,
> +the individual file, the underlying storage device(s), and the kernel 
> version.
> +In Linux 2.4, most block device based filesystems require that the file 
> offset
> +and the length and memory address of all I/O segments be multiples of the
> +filesystem block size (typically 4096 bytes).
> +In Linux 2.6.0, this was relaxed to the logical block size of the block 
> device
> +(typically 512 bytes).
> +A block device's logical block size can be determined using the
>  .BR ioctl (2)
>  .B BLKSSZGET
>  operation or from the shell using the command:
> diff --git a/man2/statx.2 b/man2/statx.2
> index a8620be6f..fff0a63ec 100644
> --- a/man2/statx.2
> +++ b/man2/statx.2
> @@ -61,7 +61,12 @@ struct statx {
> containing the filesystem where the file resides */
>  __u32 stx_dev_major;   /* Major ID */
>  __u32 stx_dev_minor;   /* Minor ID */
> +
>  __u64 stx_mnt_id;  /* Mount ID */
> +
> +/* Direct I/O alignment restrictions */
> +__u32 stx_dio_mem_align;
> +__u32 stx_dio_offset_align;
>  };
>  .EE
>  .in
> @@ -244,8 +249,11 @@ STATX_SIZE   Want stx_size
>  STATX_BLOCKS Want stx_blocks
>  STATX_BASIC_STATS[All of the above]
>  STATX_BTIME  Want stx_btime
> +STATX_ALLThe same as STATX_BASIC_STATS | STATX_BTIME.
> + This is deprecated and should not be used.

STATX_ALL is deprecated??  I was under the impression that _ALL meant
all the known bits for that kernel release, but...

>  STATX_MNT_ID Want stx_mnt_id (since Linux 5.8)

...I guess that is not correct.

> -STATX_ALL[All currently available fields]
> +STATX_DIOALIGN   Want stx_dio_mem_align and stx_dio_offset_align
> + (since Linux 5.20; support varies by filesystem)
>  .TE
>  .in
>  .PP
> @@ -406,6 +414,28 @@ This is the same number reported by
>  .BR name_to_handle_at (2)
>  and corresponds to the number in the first field in one of the records in
>  .IR /proc/self/mountinfo .
> +.TP
> +.I stx_dio_mem_align
> +The alignment (in bytes) required for user memory buffers for direct I/O
> +.BR "" ( O_DIRECT )
> +on this file. or 0 if direct I/O is not supported on this file.

"...on this file, or 0 if..."

> +.IP
> +.B STATX_DIOALIGN
> +.IR "" ( stx_dio_mem_align
> +and
> +.IR stx_dio_offset_align )
> +is supported on block devices since Linux 5.20.
> +The support on regular files varies by filesystem; it is supported by ext4 
> and
> +f2fs since Linux 5.20.

If the VFS changes don't provoke further bikeshedding, I'll contribute
an XFS patch to go with your series.

--D

> +.TP
> +.I stx_dio_offset_align
> +The alignment (in bytes) required for file offsets and I/O segment lengths 
> for
> +direct I/O
> +.BR "" ( O_DIRECT )
> +on this file, or 0 if direct I/O is not supported on this file.
> +This will only be nonzero if
> +.I stx_dio_mem_align
> +is nonzero, and vice

Re: [f2fs-dev] [PATCH v3 1/8] statx: add direct I/O alignment information

2022-06-23 Thread Darrick J. Wong

On Thu, Jun 16, 2022 at 01:14:59PM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Traditionally, the conditions for when DIO (direct I/O) is supported
> were fairly simple.  For both block devices and regular files, DIO had
> to be aligned to the logical block size of the block device.
> 
> However, due to filesystem features that have been added over time (e.g.
> multi-device support, data journalling, inline data, encryption, verity,
> compression, checkpoint disabling, log-structured mode), the conditions
> for when DIO is allowed on a regular file have gotten increasingly
> complex.  Whether a particular regular file supports DIO, and with what
> alignment, can depend on various file attributes and filesystem mount
> options, as well as which block device(s) the file's data is located on.
> 
> Moreover, the general rule of DIO needing to be aligned to the block
> device's logical block size is being relaxed to allow user buffers (but
> not file offsets) aligned to the DMA alignment instead
> (https://lore.kernel.org/linux-block/20220610195830.3574005-1-kbu...@fb.com/T/#u).
> 
> XFS has an ioctl XFS_IOC_DIOINFO that exposes DIO alignment information.
> Uplifting this to the VFS is one possibility.  However, as discussed
> (https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebigg...@kernel.org/T/#u),
> this ioctl is rarely used and not known to be used outside of
> XFS-specific code.  It was also never intended to indicate when a file
> doesn't support DIO at all, nor was it intended for block devices.
> 
> Therefore, let's expose this information via statx().  Add the
> STATX_DIOALIGN flag and two new statx fields associated with it:
> 
> * stx_dio_mem_align: the alignment (in bytes) required for user memory
>   buffers for DIO, or 0 if DIO is not supported on the file.
> 
> * stx_dio_offset_align: the alignment (in bytes) required for file
>   offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
>   on the file.  This will only be nonzero if stx_dio_mem_align is
>   nonzero, and vice versa.
> 
> Note that as with other statx() extensions, if STATX_DIOALIGN isn't set
> in the returned statx struct, then these new fields won't be filled in.
> This will happen if the file is neither a regular file nor a block
> device, or if the file is a regular file and the filesystem doesn't
> support STATX_DIOALIGN.  It might also happen if the caller didn't
> include STATX_DIOALIGN in the request mask, since statx() isn't required
> to return unrequested information.
> 
> This commit only adds the VFS-level plumbing for STATX_DIOALIGN.  For
> regular files, individual filesystems will still need to add code to
> support it.  For block devices, a separate commit will wire it up too.
> 
> Signed-off-by: Eric Biggers 
> ---
>  fs/stat.c | 2 ++
>  include/linux/stat.h  | 2 ++
>  include/uapi/linux/stat.h | 4 +++-
>  3 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index 9ced8860e0f35..a7930d7444830 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -611,6 +611,8 @@ cp_statx(const struct kstat *stat, struct statx __user 
> *buffer)
>   tmp.stx_dev_major = MAJOR(stat->dev);
>   tmp.stx_dev_minor = MINOR(stat->dev);
>   tmp.stx_mnt_id = stat->mnt_id;
> + tmp.stx_dio_mem_align = stat->dio_mem_align;
> + tmp.stx_dio_offset_align = stat->dio_offset_align;
>  
>   return copy_to_user(buffer, , sizeof(tmp)) ? -EFAULT : 0;
>  }
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index 7df06931f25d8..ff277ced50e9f 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -50,6 +50,8 @@ struct kstat {
>   struct timespec64 btime;/* File creation time */
>   u64 blocks;
>   u64 mnt_id;
> + u32 dio_mem_align;
> + u32 dio_offset_align;

Hmm.  Does the XFS port of XFS_IOC_DIOINFO to STATX_DIOALIGN look like
this?

struct xfs_buftarg  *target = xfs_inode_buftarg(ip);

kstat.dio_mem_align = target->bt_logical_sectorsize;
kstat.dio_offset_align = target->bt_logical_sectorsize;
kstat.result_mask |= STATX_DIOALIGN;

And I guess you're tabling the "optimal" IO discussions for now, because
there are too many variants of what that means?

--D

>  };
>  
>  #endif
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 1500a0f58041a..7cab2c65d3d7f 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -124,7 +124,8 @@ struct statx {
>   __u32   stx_dev_minor;
>   /* 0x90 */
>   __u64   stx_mnt_id;
> - __u64   __spare2;
> + __u32   stx_dio_mem_align;  /* Memory buffer alignment for direct 
> I/O */
> + __u32   stx_dio_offset_align;   /* File offset alignment for direct I/O 
> */
>   /* 0xa0 */
>   __u64   __spare3[12];   /* Spare space for future expansion */
>   /* 0x100 */
> @@ -152,6 +153,7 @@

Re: [f2fs-dev] [PATCH v2 11/19] mm/migrate: Add filemap_migrate_folio()

2022-06-08 Thread Darrick J. Wong

On Wed, Jun 08, 2022 at 04:02:41PM +0100, Matthew Wilcox (Oracle) wrote:
> There is nothing iomap-specific about iomap_migratepage(), and it fits
> a pattern used by several other filesystems, so move it to mm/migrate.c,
> convert it to be filemap_migrate_folio() and convert the iomap filesystems
> to use it.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 

LGTM
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/gfs2/aops.c  |  2 +-
>  fs/iomap/buffered-io.c  | 25 -
>  fs/xfs/xfs_aops.c   |  2 +-
>  fs/zonefs/super.c   |  2 +-
>  include/linux/iomap.h   |  6 --
>  include/linux/pagemap.h |  6 ++
>  mm/migrate.c| 20 
>  7 files changed, 29 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index 106e90a36583..57ff883d432c 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -774,7 +774,7 @@ static const struct address_space_operations gfs2_aops = {
>   .invalidate_folio = iomap_invalidate_folio,
>   .bmap = gfs2_bmap,
>   .direct_IO = noop_direct_IO,
> - .migratepage = iomap_migrate_page,
> + .migrate_folio = filemap_migrate_folio,
>   .is_partially_uptodate = iomap_is_partially_uptodate,
>   .error_remove_page = generic_error_remove_page,
>  };
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 66278a14bfa7..5a91aa1db945 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -489,31 +489,6 @@ void iomap_invalidate_folio(struct folio *folio, size_t 
> offset, size_t len)
>  }
>  EXPORT_SYMBOL_GPL(iomap_invalidate_folio);
>  
> -#ifdef CONFIG_MIGRATION
> -int
> -iomap_migrate_page(struct address_space *mapping, struct page *newpage,
> - struct page *page, enum migrate_mode mode)
> -{
> - struct folio *folio = page_folio(page);
> - struct folio *newfolio = page_folio(newpage);
> - int ret;
> -
> - ret = folio_migrate_mapping(mapping, newfolio, folio, 0);
> - if (ret != MIGRATEPAGE_SUCCESS)
> - return ret;
> -
> - if (folio_test_private(folio))
> - folio_attach_private(newfolio, folio_detach_private(folio));
> -
> - if (mode != MIGRATE_SYNC_NO_COPY)
> - folio_migrate_copy(newfolio, folio);
> - else
> - folio_migrate_flags(newfolio, folio);
> - return MIGRATEPAGE_SUCCESS;
> -}
> -EXPORT_SYMBOL_GPL(iomap_migrate_page);
> -#endif /* CONFIG_MIGRATION */
> -
>  static void
>  iomap_write_failed(struct inode *inode, loff_t pos, unsigned len)
>  {
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 8ec38b25187b..5d1a995b15f8 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -570,7 +570,7 @@ const struct address_space_operations 
> xfs_address_space_operations = {
>   .invalidate_folio   = iomap_invalidate_folio,
>   .bmap   = xfs_vm_bmap,
>   .direct_IO  = noop_direct_IO,
> - .migratepage= iomap_migrate_page,
> + .migrate_folio  = filemap_migrate_folio,
>   .is_partially_uptodate  = iomap_is_partially_uptodate,
>   .error_remove_page  = generic_error_remove_page,
>   .swap_activate  = xfs_iomap_swapfile_activate,
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index bcb21aea990a..d4c3f28f34ee 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -237,7 +237,7 @@ static const struct address_space_operations 
> zonefs_file_aops = {
>   .dirty_folio= filemap_dirty_folio,
>   .release_folio  = iomap_release_folio,
>   .invalidate_folio   = iomap_invalidate_folio,
> - .migratepage= iomap_migrate_page,
> + .migrate_folio  = filemap_migrate_folio,
>   .is_partially_uptodate  = iomap_is_partially_uptodate,
>   .error_remove_page  = generic_error_remove_page,
>   .direct_IO  = noop_direct_IO,
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index e552097c67e0..758a1125e72f 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -231,12 +231,6 @@ void iomap_readahead(struct readahead_control *, const 
> struct iomap_ops *ops);
>  bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
> -#ifdef CONFIG_MIGRATION
> -int iomap_migrate_page(struct address_space *mapping, struct page *newpage,
> - struct page *page, enum migrate_mode mode);
> -#else
> -#define iomap_migrate_p

Re: [f2fs-dev] [PATCH] f2fs: add sysfs entry to avoid FUA

2022-05-27 Thread Darrick J. Wong

On Fri, May 27, 2022 at 06:06:08PM -0700, Jaegeuk Kim wrote:
> On 05/27, Eric Biggers wrote:
> > [+Cc linux-block for FUA, and linux-xfs for iomap]
> > 
> > On Fri, May 27, 2022 at 01:59:55PM -0700, Jaegeuk Kim wrote:
> > > Some UFS storage gives slower performance on FUA than write+cache_flush.
> > > Let's give a way to manage it.
> > > 
> > > Signed-off-by: Jaegeuk Kim 
> > 
> > Should the driver even be saying that it has FUA support in this case?  If 
> > the
> > driver didn't claim FUA support, that would also solve this problem.
> 
> I think there's still some benefit to use FUA such as small chunk writes
> for checkpoint.
> 
> > 
> > > ---
> > >  Documentation/ABI/testing/sysfs-fs-f2fs | 7 +++
> > >  fs/f2fs/data.c  | 2 ++
> > >  fs/f2fs/f2fs.h  | 1 +
> > >  fs/f2fs/sysfs.c | 2 ++
> > >  4 files changed, 12 insertions(+)
> > > 
> > > diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs 
> > > b/Documentation/ABI/testing/sysfs-fs-f2fs
> > > index 9b583dd0298b..cd96b09d7182 100644
> > > --- a/Documentation/ABI/testing/sysfs-fs-f2fs
> > > +++ b/Documentation/ABI/testing/sysfs-fs-f2fs
> > > @@ -434,6 +434,7 @@ Date: April 2020
> > >  Contact: "Daeho Jeong" 
> > >  Description: Give a way to change iostat_period time. 3secs by 
> > > default.
> > >   The new iostat trace gives stats gap given the period.
> > > +
> > >  What:/sys/fs/f2fs//max_io_bytes
> > >  Date:December 2020
> > >  Contact: "Jaegeuk Kim" 
> > > @@ -442,6 +443,12 @@ Description: This gives a control to limit the bio 
> > > size in f2fs.
> > >   whereas, if it has a certain bytes value, f2fs won't submit a
> > >   bio larger than that size.
> > >  
> > > +What:/sys/fs/f2fs//no_fua_dio
> > > +Date:May 2022
> > > +Contact: "Jaegeuk Kim" 
> > > +Description: This gives a signal to iomap, which should not use FUA 
> > > for
> > > + direct IOs. Default: 0.
> > 
> > iomap is an implementation detail, so it shouldn't be mentioned in UAPI
> > documentation.  UAPI documentation should describe user-visible behavior 
> > only.
> 
> Ok.
> 
> > 
> > > +
> > >  What:/sys/fs/f2fs//stat/sb_status
> > >  Date:December 2020
> > >  Contact: "Chao Yu" 
> > > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> > > index f5f2b7233982..23486486eab2 100644
> > > --- a/fs/f2fs/data.c
> > > +++ b/fs/f2fs/data.c
> > > @@ -4153,6 +4153,8 @@ static int f2fs_iomap_begin(struct inode *inode, 
> > > loff_t offset, loff_t length,
> > >   if ((inode->i_state & I_DIRTY_DATASYNC) ||
> > >   offset + length > i_size_read(inode))
> > >   iomap->flags |= IOMAP_F_DIRTY;
> > > + if (F2FS_I_SB(inode)->no_fua_dio)
> > > + iomap->flags |= IOMAP_F_DIRTY;
> > 
> > This is overloading the IOMAP_F_DIRTY flag to mean something other than 
> > dirty.
> > Perhaps this flag needs to be renamed, or a new flag should be added?
> 
> I'm not sure it's acceptable to add another flag for f2fs only.

I think Al and willy have been throwing around patches to tell
iomap_dio_rw or someone that the caller will handle cache flushes and
that it shouldn't initiate them on its own; would that help here?

--D

> > 
> > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> > > index e10838879538..c2400ea0080b 100644
> > > --- a/fs/f2fs/f2fs.h
> > > +++ b/fs/f2fs/f2fs.h
> > > @@ -1671,6 +1671,7 @@ struct f2fs_sb_info {
> > >   int dir_level;  /* directory level */
> > >   int readdir_ra; /* readahead inode in readdir */
> > >   u64 max_io_bytes;   /* max io bytes to merge IOs */
> > > + int no_fua_dio; /* avoid FUA in DIO */
> > 
> > Make this a bool?
> 
> Done.
> 
> > 
> > > diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c
> > > index 4c50aedd5144..24d628ca92cc 100644
> > > --- a/fs/f2fs/sysfs.c
> > > +++ b/fs/f2fs/sysfs.c
> > > @@ -771,6 +771,7 @@ F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, 
> > > iostat_period_ms, iostat_period_ms);
> > >  #endif
> > >  F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, readdir_ra, readdir_ra);
> > >  F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, max_io_bytes, max_io_bytes);
> > > +F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, no_fua_dio, no_fua_dio);
> > >  F2FS_RW_ATTR(F2FS_SBI, f2fs_sb_info, gc_pin_file_thresh, 
> > > gc_pin_file_threshold);
> > >  F2FS_RW_ATTR(F2FS_SBI, f2fs_super_block, extension_list, extension_list);
> > >  #ifdef CONFIG_F2FS_FAULT_INJECTION
> > > @@ -890,6 +891,7 @@ static struct attribute *f2fs_attrs[] = {
> > >  #endif
> > >   ATTR_LIST(readdir_ra),
> > >   ATTR_LIST(max_io_bytes),
> > > + ATTR_LIST(no_fua_dio),
> > 
> > Where is it validated that only valid values (0 or 1) can be written to this
> > file?
> 
> Added.
> 
> > 
> > - Eric


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net

Re: [f2fs-dev] [RFC PATCH v2 1/7] statx: add I/O alignment information

2022-05-27 Thread Darrick J. Wong

On Fri, May 27, 2022 at 11:02:46AM +0200, Florian Weimer wrote:
> * Eric Biggers:
> 
> > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> > index 1500a0f58041a..f822b23e81091 100644
> > --- a/include/uapi/linux/stat.h
> > +++ b/include/uapi/linux/stat.h
> > @@ -124,9 +124,13 @@ struct statx {
> > __u32   stx_dev_minor;
> > /* 0x90 */
> > __u64   stx_mnt_id;
> > -   __u64   __spare2;
> > +   __u32   stx_mem_align_dio;  /* Memory buffer alignment for direct 
> > I/O */
> > +   __u32   stx_offset_align_dio;   /* File offset alignment for direct I/O 
> > */
> > /* 0xa0 */
> > -   __u64   __spare3[12];   /* Spare space for future expansion */
> > +   __u32   stx_offset_align_optimal; /* Optimal file offset alignment for 
> > I/O */
> > +   __u32   __spare2;
> > +   /* 0xa8 */
> > +   __u64   __spare3[11];   /* Spare space for future expansion */
> > /* 0x100 */
> >  };
> 
> Are 32 bits enough?  Would it make sense to store the base-2 logarithm
> instead?

I don't think a log2 will work here, XFS will want to report things like
raid stripe sizes, which can be any multiple of the fs blocksize.

32 bits is probably enough, seeing as the kernel won't do an IO larger
than 2GB anyway.

--D

> Thanks,
> Florian
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [RFC PATCH v2 1/7] statx: add I/O alignment information

2022-05-19 Thread Darrick J. Wong

On Wed, May 18, 2022 at 04:50:05PM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Traditionally, the conditions for when DIO (direct I/O) is supported
> were fairly simple: filesystems either supported DIO aligned to the
> block device's logical block size, or didn't support DIO at all.
> 
> However, due to filesystem features that have been added over time (e.g,
> data journalling, inline data, encryption, verity, compression,
> checkpoint disabling, log-structured mode), the conditions for when DIO
> is allowed on a file have gotten increasingly complex.  Whether a
> particular file supports DIO, and with what alignment, can depend on
> various file attributes and filesystem mount options, as well as which
> block device(s) the file's data is located on.
> 
> XFS has an ioctl XFS_IOC_DIOINFO which exposes this information to
> applications.  However, as discussed
> (https://lore.kernel.org/linux-fsdevel/20220120071215.123274-1-ebigg...@kernel.org/T/#u),
> this ioctl is rarely used and not known to be used outside of
> XFS-specific code.  It also was never intended to indicate when a file
> doesn't support DIO at all, and it only exposes the minimum I/O
> alignment, not the optimal I/O alignment which has been requested too.
> 
> Therefore, let's expose this information via statx().  Add the
> STATX_IOALIGN flag and three fields associated with it:
> 
> * stx_mem_align_dio: the alignment (in bytes) required for user memory
>   buffers for DIO, or 0 if DIO is not supported on the file.
> 
> * stx_offset_align_dio: the alignment (in bytes) required for file
>   offsets and I/O segment lengths for DIO, or 0 if DIO is not supported
>   on the file.  This will only be nonzero if stx_mem_align_dio is
>   nonzero, and vice versa.
> 
> * stx_offset_align_optimal: the alignment (in bytes) suggested for file
>   offsets and I/O segment lengths to get optimal performance.  This
>   applies to both DIO and buffered I/O.  It differs from stx_blocksize
>   in that stx_offset_align_optimal will contain the real optimum I/O
>   size, which may be a large value.  In contrast, for compatibility
>   reasons stx_blocksize is the minimum size needed to avoid page cache
>   read/write/modify cycles, which may be much smaller than the optimum
>   I/O size.  For more details about the motivation for this field, see
>   https://lore.kernel.org/r/20220210040304.gm59...@dread.disaster.area

Hmm.  So I guess this is supposed to be the filesystem's best guess at
the IO size that will minimize RMW cycles in the entire stack?  i.e. if
the user does not want RMW of pagecache pages, of file allocation units
(if COW is enabled), of RAID stripes, or in the storage itself, then it
should ensure that all IOs are aligned to this value?

I guess that means for XFS it's effectively max(pagesize, i_blocksize,
bdev io_opt, sb_width, and (pretend XFS can reflink the realtime volume)
the rt extent size)?  I didn't see a manpage update for statx(2) but
that's mostly what I'm interested in. :)

Looking ahead, it looks like the ext4/f2fs implementations only seem to
be returning max(i_blocksize, bdev io_opt)?  But not the pagesize?  Did
I misunderstood this, then?

(The plumbing changes in this patch look ok.)

--D

> Note that as with other statx() extensions, if STATX_IOALIGN isn't set
> in the returned statx struct, then these new fields won't be filled in.
> This will happen if the filesystem doesn't support STATX_IOALIGN, or if
> the file isn't a regular file.  (It might be supported on block device
> files in the future.)  It might also happen if the caller didn't include
> STATX_IOALIGN in the request mask, since statx() isn't required to
> return information that wasn't requested.
> 
> This commit adds the VFS-level plumbing for STATX_IOALIGN.  Individual
> filesystems will still need to add code to support it.
> 
> Signed-off-by: Eric Biggers 
> ---
>  fs/stat.c | 3 +++
>  include/linux/stat.h  | 3 +++
>  include/uapi/linux/stat.h | 9 +++--
>  3 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index 5c2c94464e8b0..9d477218545b8 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -611,6 +611,9 @@ cp_statx(const struct kstat *stat, struct statx __user 
> *buffer)
>   tmp.stx_dev_major = MAJOR(stat->dev);
>   tmp.stx_dev_minor = MINOR(stat->dev);
>   tmp.stx_mnt_id = stat->mnt_id;
> + tmp.stx_mem_align_dio = stat->mem_align_dio;
> + tmp.stx_offset_align_dio = stat->offset_align_dio;
> + tmp.stx_offset_align_optimal = stat->offset_align_optimal;
>  
>   return copy_to_user(buffer, , sizeof(tmp)) ? -EFAULT : 0;
>  }
> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index 7df06931f25d8..48b8b1ad1567c 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -50,6 +50,9 @@ struct kstat {
>   struct timespec64 btime;/* File creation time */
>   u64 blocks;
>   u64 mnt_id;
>

Re: [f2fs-dev] [PATCH v10 0/5] add support for direct I/O with fscrypt using blk-crypto

2022-01-20 Thread Darrick J. Wong

On Fri, Jan 21, 2022 at 10:57:55AM +1100, Dave Chinner wrote:
> On Thu, Jan 20, 2022 at 02:48:52PM -0800, Eric Biggers wrote:
> > On Fri, Jan 21, 2022 at 09:04:14AM +1100, Dave Chinner wrote:
> > > On Thu, Jan 20, 2022 at 01:00:27PM -0800, Darrick J. Wong wrote:
> > > > On Thu, Jan 20, 2022 at 12:39:14PM -0800, Eric Biggers wrote:
> > > > > On Thu, Jan 20, 2022 at 09:10:27AM -0800, Darrick J. Wong wrote:
> > > > > > On Thu, Jan 20, 2022 at 12:30:23AM -0800, Christoph Hellwig wrote:
> > > > > > > On Wed, Jan 19, 2022 at 11:12:10PM -0800, Eric Biggers wrote:
> > > > > > > > 
> > > > > > > > Given the above, as far as I know the only remaining objection 
> > > > > > > > to this
> > > > > > > > patchset would be that DIO constraints aren't sufficiently 
> > > > > > > > discoverable
> > > > > > > > by userspace.  Now, to put this in context, this is a 
> > > > > > > > longstanding issue
> > > > > > > > with all Linux filesystems, except XFS which has 
> > > > > > > > XFS_IOC_DIOINFO.  It's
> > > > > > > > not specific to this feature, and it doesn't actually seem to 
> > > > > > > > be too
> > > > > > > > important in practice; many other filesystem features place 
> > > > > > > > constraints
> > > > > > > > on DIO, and f2fs even *only* allows fully FS block size aligned 
> > > > > > > > DIO.
> > > > > > > > (And for better or worse, many systems using fscrypt already 
> > > > > > > > have
> > > > > > > > out-of-tree patches that enable DIO support, and people don't 
> > > > > > > > seem to
> > > > > > > > have trouble with the FS block size alignment requirement.)
> > > > > > > 
> > > > > > > It might make sense to use this as an opportunity to implement
> > > > > > > XFS_IOC_DIOINFO for ext4 and f2fs.
> > > > > > 
> > > > > > Hmm.  A potential problem with DIOINFO is that it doesn't explicitly
> > > > > > list the /file/ position alignment requirement:
> > > > > > 
> > > > > > struct dioattr {
> > > > > > __u32   d_mem;  /* data buffer memory alignment 
> > > > > > */
> > > > > > __u32   d_miniosz;  /* min xfer size
> > > > > > */
> > > > > > __u32   d_maxiosz;  /* max xfer size
> > > > > > */
> > > > > > };
> > > > > 
> > > > > Well, the comment above struct dioattr says:
> > > > > 
> > > > >   /*
> > > > >* Direct I/O attribute record used with XFS_IOC_DIOINFO
> > > > >* d_miniosz is the min xfer size, xfer size multiple and file 
> > > > > seek offset
> > > > >* alignment.
> > > > >*/
> > > > > 
> > > > > So d_miniosz serves that purpose already.
> > > > > 
> > > > > > 
> > > > > > Since I /think/ fscrypt requires that directio writes be aligned to 
> > > > > > file
> > > > > > block size, right?
> > > > > 
> > > > > The file position must be a multiple of the filesystem block size, 
> > > > > yes.
> > > > > Likewise for the "minimum xfer size" and "xfer size multiple", and 
> > > > > the "data
> > > > > buffer memory alignment" for that matter.  So I think XFS_IOC_DIOINFO 
> > > > > would be
> > > > > good enough for the fscrypt direct I/O case.
> > > > 
> > > > Oh, ok then.  In that case, just hoist XFS_IOC_DIOINFO to the VFS and
> > > > add a couple of implementations for ext4 and f2fs, and I think that'll
> > > > be enough to get the fscrypt patchset moving again.
> > > 
> > > On the contrary, I'd much prefer to see this information added to
> > > statx(). The file offset alignment info is a property of the current
> > > file (e.g. XFS can have different per-file requirements depending on
> > > whether the file data is hosted on the data or RT device, etc) and
> > &

Re: [f2fs-dev] [PATCH v10 0/5] add support for direct I/O with fscrypt using blk-crypto

2022-01-20 Thread Darrick J. Wong

On Thu, Jan 20, 2022 at 12:39:14PM -0800, Eric Biggers wrote:
> On Thu, Jan 20, 2022 at 09:10:27AM -0800, Darrick J. Wong wrote:
> > On Thu, Jan 20, 2022 at 12:30:23AM -0800, Christoph Hellwig wrote:
> > > On Wed, Jan 19, 2022 at 11:12:10PM -0800, Eric Biggers wrote:
> > > > 
> > > > Given the above, as far as I know the only remaining objection to this
> > > > patchset would be that DIO constraints aren't sufficiently discoverable
> > > > by userspace.  Now, to put this in context, this is a longstanding issue
> > > > with all Linux filesystems, except XFS which has XFS_IOC_DIOINFO.  It's
> > > > not specific to this feature, and it doesn't actually seem to be too
> > > > important in practice; many other filesystem features place constraints
> > > > on DIO, and f2fs even *only* allows fully FS block size aligned DIO.
> > > > (And for better or worse, many systems using fscrypt already have
> > > > out-of-tree patches that enable DIO support, and people don't seem to
> > > > have trouble with the FS block size alignment requirement.)
> > > 
> > > It might make sense to use this as an opportunity to implement
> > > XFS_IOC_DIOINFO for ext4 and f2fs.
> > 
> > Hmm.  A potential problem with DIOINFO is that it doesn't explicitly
> > list the /file/ position alignment requirement:
> > 
> > struct dioattr {
> > __u32   d_mem;  /* data buffer memory alignment */
> > __u32   d_miniosz;  /* min xfer size*/
> > __u32   d_maxiosz;  /* max xfer size*/
> > };
> 
> Well, the comment above struct dioattr says:
> 
>   /*
>* Direct I/O attribute record used with XFS_IOC_DIOINFO
>* d_miniosz is the min xfer size, xfer size multiple and file seek 
> offset
>* alignment.
>*/
> 
> So d_miniosz serves that purpose already.
> 
> > 
> > Since I /think/ fscrypt requires that directio writes be aligned to file
> > block size, right?
> 
> The file position must be a multiple of the filesystem block size, yes.
> Likewise for the "minimum xfer size" and "xfer size multiple", and the "data
> buffer memory alignment" for that matter.  So I think XFS_IOC_DIOINFO would be
> good enough for the fscrypt direct I/O case.

Oh, ok then.  In that case, just hoist XFS_IOC_DIOINFO to the VFS and
add a couple of implementations for ext4 and f2fs, and I think that'll
be enough to get the fscrypt patchset moving again.

> The real question is whether there are any direct I/O implementations where
> XFS_IOC_DIOINFO would *not* be good enough, for example due to "xfer size
> multiple" != "file seek offset alignment" being allowed.  In that case we 
> would
> need to define a new ioctl that is more general (like the one you described
> below) rather than simply uplifting XFS_IOC_DIOINFO.

I don't think there are any currently, but if anyone ever redesigns
DIOINFO we might as well make all those pieces explicit.

> More general is nice, but it's not helpful if no one will actually use the 
> extra
> information.  So we need to figure out what is actually useful.

 Clearly I haven't wanted d_opt_fpos badly enough to propose
revving the ioctl. ;)

--D

> 
> > How about something like this:
> > 
> > struct dioattr2 {
> > __u32   d_mem;  /* data buffer memory alignment */
> > __u32   d_miniosz;  /* min xfer size*/
> > __u32   d_maxiosz;  /* max xfer size*/
> > 
> > /* file range must be aligned to this value */
> > __u32   d_min_fpos;
> > 
> > /* for optimal performance, align file range to this */
> > __u32   d_opt_fpos;
> > 
> > __u32   d_padding[11];
> > };
> > 
> 
> - Eric


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v10 0/5] add support for direct I/O with fscrypt using blk-crypto

2022-01-20 Thread Darrick J. Wong

On Thu, Jan 20, 2022 at 12:30:23AM -0800, Christoph Hellwig wrote:
> On Wed, Jan 19, 2022 at 11:12:10PM -0800, Eric Biggers wrote:
> > 
> > Given the above, as far as I know the only remaining objection to this
> > patchset would be that DIO constraints aren't sufficiently discoverable
> > by userspace.  Now, to put this in context, this is a longstanding issue
> > with all Linux filesystems, except XFS which has XFS_IOC_DIOINFO.  It's
> > not specific to this feature, and it doesn't actually seem to be too
> > important in practice; many other filesystem features place constraints
> > on DIO, and f2fs even *only* allows fully FS block size aligned DIO.
> > (And for better or worse, many systems using fscrypt already have
> > out-of-tree patches that enable DIO support, and people don't seem to
> > have trouble with the FS block size alignment requirement.)
> 
> It might make sense to use this as an opportunity to implement
> XFS_IOC_DIOINFO for ext4 and f2fs.

Hmm.  A potential problem with DIOINFO is that it doesn't explicitly
list the /file/ position alignment requirement:

struct dioattr {
__u32   d_mem;  /* data buffer memory alignment */
__u32   d_miniosz;  /* min xfer size*/
__u32   d_maxiosz;  /* max xfer size*/
};

Since I /think/ fscrypt requires that directio writes be aligned to file
block size, right?

> > I plan to propose a new generic ioctl to address the issue of DIO
> > constraints being insufficiently discoverable.  But until then, I'm

Which is what I suspect Eric meant by this sentence. :)

> > wondering if people are willing to consider this patchset again, or
> > whether it is considered blocked by this issue alone.  (And if this
> > patchset is still unacceptable, would it be acceptable with f2fs support
> > only, given that f2fs *already* only allows FS block size aligned DIO?)
> 
> I think the patchset looks fine, but I'd really love to have a way for
> the alignment restrictions to be discoverable from the start.

I agree.  The mechanics of the patchset look ok to me, but it's very
unfortunate that there's no way for userspace programs to ask the kernel
about the directio geometry for a file.

Ever since we added reflink to XFS I've wanted to add a way to tell
userspace that direct writes to a reflink(able) file will be much more
efficient if they can align the io request to 1 fs block instead of 1
sector.

How about something like this:

struct dioattr2 {
__u32   d_mem;  /* data buffer memory alignment */
__u32   d_miniosz;  /* min xfer size*/
__u32   d_maxiosz;  /* max xfer size*/

/* file range must be aligned to this value */
__u32   d_min_fpos;

/* for optimal performance, align file range to this */
__u32   d_opt_fpos;

__u32   d_padding[11];
};

--D


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

[f2fs-dev] [GIT PULL] vfs: new code for 5.15

2021-08-31 Thread Darrick J. Wong

Hi Linus,

Please pull this single VFS patch that prevents userspace from setting
project quota ids on files that the VFS considers invalid.

This branch merges cleanly against your upstream branch as of a few
minutes ago, and does not introduce any fstests regressions for ext4 or
xfs.

--D

The following changes since commit c500bee1c5b2f1d59b1081ac879d73268ab0ff17:

  Linux 5.14-rc4 (2021-08-01 17:04:17 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git tags/vfs-5.15-merge-1

for you to fetch changes up to d03ef4daf33a33da8d7c397102fff8ae87d04a93:

  fs: forbid invalid project ID (2021-08-03 09:48:04 -0700)


New code for 5.15:
 - Strengthen parameter checking for project quota ids.


Wang Shilong (1):
  fs: forbid invalid project ID

 fs/ioctl.c | 8 
 1 file changed, 8 insertions(+)


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 3/9] f2fs: rework write preallocations

2021-07-27 Thread Darrick J. Wong

On Tue, Jul 27, 2021 at 04:30:16PM +0800, Chao Yu wrote:
> On 2021/7/27 15:38, Eric Biggers wrote:
> > That's somewhat helpful, but I've been doing some more investigation and 
> > now I'm
> > even more confused.  How can f2fs support non-overwrite DIO writes at all
> > (meaning DIO writes in LFS mode as well as DIO writes to holes in non-LFS 
> > mode),
> > given that it has no support for unwritten extents?  AFAICS, as-is users can
> 
> I'm trying to pick up DAX support patch created by Qiuyang from huawei, and it
> looks it faces the same issue, so it tries to fix this by calling 
> sb_issue_zeroout()
> in f2fs_map_blocks() before it returns.

I really hope you don't, because zeroing the region before memcpy'ing it
is absurd.  I don't know if f2fs can do that (xfs can't really) without
pinning resources during a potentially lengthy memcpy operation, but you
/could/ allocate the space in ->iomap_begin, attach some record of that
to iomap->private, and only commit the mapping update in ->iomap_end.

--D

> > easily leak uninitialized disk contents on f2fs by issuing a DIO write that
> > won't complete fully (or might not complete fully), then reading back the 
> > blocks
> > that got allocated but not written to.
> > 
> > I think that f2fs will have to take the ext2 approach of not allowing
> > non-overwrite DIO writes at all...
> Yes,
> 
> Another option is to enhance f2fs metadata's scalability which needs to 
> update layout
> of dnode block or SSA block, after that we can record the status of unwritten 
> data block
> there... it's a big change though...
> 
> Thanks,
> 
> > 
> > - Eric
> > 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v9 6/9] iomap: support direct I/O with fscrypt using blk-crypto

2021-07-22 Thread Darrick J. Wong

On Fri, Jun 04, 2021 at 09:09:05PM +, Satya Tangirala wrote:
> From: Eric Biggers 
> 
> Set bio crypt contexts on bios by calling into fscrypt when required.
> No DUN contiguity checks are done - callers are expected to set up the
> iomap correctly to ensure that each bio submitted by iomap will not have
> blocks with incontiguous DUNs by calling fscrypt_limit_io_blocks()
> appropriately.
> 
> Signed-off-by: Eric Biggers 
> Co-developed-by: Satya Tangirala 
> Signed-off-by: Satya Tangirala 

Looks like a straightforward conversion...

Acked-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/direct-io.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 9398b8c31323..1c825deb36a9 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -185,11 +186,14 @@ static void
>  iomap_dio_zero(struct iomap_dio *dio, struct iomap *iomap, loff_t pos,
>   unsigned len)
>  {
> + struct inode *inode = file_inode(dio->iocb->ki_filp);
>   struct page *page = ZERO_PAGE(0);
>   int flags = REQ_SYNC | REQ_IDLE;
>   struct bio *bio;
>  
>   bio = bio_alloc(GFP_KERNEL, 1);
> + fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> +   GFP_KERNEL);
>   bio_set_dev(bio, iomap->bdev);
>   bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>   bio->bi_private = dio;
> @@ -306,6 +310,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   }
>  
>   bio = bio_alloc(GFP_KERNEL, nr_pages);
> + fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> +   GFP_KERNEL);
>   bio_set_dev(bio, iomap->bdev);
>   bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>   bio->bi_write_hint = dio->iocb->ki_hint;
> -- 
> 2.32.0.rc1.229.g3e70b5a671-goog
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 0/14 v10] fs: Hole punch vs page cache filling races

2021-07-16 Thread Darrick J. Wong

On Fri, Jul 16, 2021 at 07:02:19AM +0100, Christoph Hellwig wrote:
> On Thu, Jul 15, 2021 at 03:40:10PM +0200, Jan Kara wrote:
> > Hello,
> > 
> > here is another version of my patches to address races between hole punching
> > and page cache filling functions for ext4 and other filesystems. The only
> > change since the last time is a small cleanup applied to changes of
> > filemap_fault() in patch 3/14 based on Christoph's & Darrick's feedback 
> > (thanks
> > guys!).  Darrick, Christoph, is the patch fine now?
> 
> Looks fine to me.

Me too.

--D


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v4] fs: forbid invalid project ID

2021-07-13 Thread Darrick J. Wong

On Sat, Jul 10, 2021 at 10:39:59PM +0800, Wang Shilong wrote:
> From: Wang Shilong 
> 
> fileattr_set_prepare() should check if project ID
> is valid, otherwise dqget() will return NULL for
> such project ID quota.
> 
> Signed-off-by: Wang Shilong 
> ---
> v3->v3:
> only check project Id if caller is allowed
> to change and being changed.
> 
> v2->v3: move check before @fsx_projid is accessed
> and use make_kprojid() helper.
> 
> v1->v2: try to fix in the VFS
>  fs/ioctl.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 1e2204fa9963..d4fabb5421cd 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -817,6 +817,14 @@ static int fileattr_set_prepare(struct inode *inode,
>   if ((old_ma->fsx_xflags ^ fa->fsx_xflags) &
>   FS_XFLAG_PROJINHERIT)
>   return -EINVAL;
> + } else {
> + /*
> +  * Caller is allowed to change the project ID. If it is being
> +  * changed, make sure that the new value is valid.
> +  */
> + if (old_ma->fsx_projid != fa->fsx_projid &&
> + !projid_valid(make_kprojid(_user_ns, fa->fsx_projid)))
> + return -EINVAL;

Hmm, for XFS this is sort of a userspace-breaking change in the sense
that (technically) we've never rejected -1 before.  xfs_quota won't have
anything to do with that, and (assuming I read the helper/macro
gooeyness correctly) the vfs quota code won't either, so

Reviewed-by: Darrick J. Wong 

--D

>   }
>  
>   /* Check extent size hints. */
> -- 
> 2.27.0
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 03/14] mm: Protect operations adding pages to page cache with invalidate_lock

2021-07-12 Thread Darrick J. Wong

On Mon, Jul 12, 2021 at 06:55:54PM +0200, Jan Kara wrote:
> Currently, serializing operations such as page fault, read, or readahead
> against hole punching is rather difficult. The basic race scheme is
> like:
> 
> fallocate(FALLOC_FL_PUNCH_HOLE)   read / fault / ..
>   truncate_inode_pages_range()
>   cache here>
>   
> 
> Now the problem is in this way read / page fault / readahead can
> instantiate pages in page cache with potentially stale data (if blocks
> get quickly reused). Avoiding this race is not simple - page locks do
> not work because we want to make sure there are *no* pages in given
> range. inode->i_rwsem does not work because page fault happens under
> mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> the performance for mixed read-write workloads suffer.
> 
> So create a new rw_semaphore in the address_space - invalidate_lock -
> that protects adding of pages to page cache for page faults / reads /
> readahead.
> 
> Reviewed-by: Darrick J. Wong 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Jan Kara 
> ---
>  Documentation/filesystems/locking.rst | 62 +--
>  fs/inode.c|  2 +
>  include/linux/fs.h| 33 ++
>  mm/filemap.c  | 88 +++
>  mm/readahead.c|  2 +
>  mm/rmap.c | 37 +--
>  mm/truncate.c |  3 +-
>  7 files changed, 176 insertions(+), 51 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst 
> b/Documentation/filesystems/locking.rst
> index cdf15492c699..38a3097b6f1c 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -271,19 +271,19 @@ prototypes::
>  locking rules:
>   All except set_page_dirty and freepage may block
>  
> -==    =
> -ops  PageLocked(page) i_rwsem
> -==    =
> +==    =  
> ===
> +ops  PageLocked(page) i_rwseminvalidate_lock
> +==    =  
> ===
>  writepage:   yes, unlocks (see below)
> -readpage:yes, unlocks
> +readpage:yes, unlocksshared
>  writepages:
>  set_page_dirty   no
> -readahead:   yes, unlocks
> -readpages:   no
> +readahead:   yes, unlocksshared
> +readpages:   no  shared
>  write_begin: locks the page   exclusive
>  write_end:   yes, unlocks exclusive
>  bmap:
> -invalidatepage:  yes
> +invalidatepage:  yes 
> exclusive
>  releasepage: yes
>  freepage:yes
>  direct_IO:
> @@ -378,7 +378,10 @@ keep it that way and don't breed new callers.
>  ->invalidatepage() is called when the filesystem must attempt to drop
>  some or all of the buffers from the page when it is being truncated. It
>  returns zero on success. If ->invalidatepage is zero, the kernel uses
> -block_invalidatepage() instead.
> +block_invalidatepage() instead. The filesystem must exclusively acquire
> +invalidate_lock before invalidating page cache in truncate / hole punch path
> +(and thus calling into ->invalidatepage) to block races between page cache
> +invalidation and page cache filling functions (fault, read, ...).
>  
>  ->releasepage() is called when the kernel is about to try to drop the
>  buffers from the page in preparation for freeing it.  It returns zero to
> @@ -573,6 +576,25 @@ in sys_read() and friends.
>  the lease within the individual filesystem to record the result of the
>  operation
>  
> +->fallocate implementation must be really careful to maintain page cache
> +consistency when punching holes or performing other operations that 
> invalidate
> +page cache contents. Usually the filesystem needs to call
> +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> +However the filesystem usually also needs to update its internal (and on 
> disk)
> +view of file offset -> disk block mapping. Until this update is finished, the
> +filesystem needs to block page faults and reads from reloading now-stale page
> +cache contents from the disk. Si

Re: [f2fs-dev] [PATCH 02/14] documentation: Sync file_operations members with reality

2021-07-12 Thread Darrick J. Wong

On Mon, Jul 12, 2021 at 06:55:53PM +0200, Jan Kara wrote:
> Sync listing of struct file_operations members with the real one in
> fs.h.
> 
> Acked-by: Darrick J. Wong 

Might as well upgrade this to:
Reviewed-by: Darrick J. Wong 

--D

> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Jan Kara 
> ---
>  Documentation/filesystems/locking.rst | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst 
> b/Documentation/filesystems/locking.rst
> index 2183fd8cc350..cdf15492c699 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -506,6 +506,7 @@ prototypes::
>   ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
>   ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
>   ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
> + int (*iopoll) (struct kiocb *kiocb, bool spin);
>   int (*iterate) (struct file *, struct dir_context *);
>   int (*iterate_shared) (struct file *, struct dir_context *);
>   __poll_t (*poll) (struct file *, struct poll_table_struct *);
> @@ -518,12 +519,6 @@ prototypes::
>   int (*fsync) (struct file *, loff_t start, loff_t end, int datasync);
>   int (*fasync) (int, struct file *, int);
>   int (*lock) (struct file *, int, struct file_lock *);
> - ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
> - loff_t *);
> - ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
> - loff_t *);
> - ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
> - void __user *);
>   ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
>   loff_t *, int);
>   unsigned long (*get_unmapped_area)(struct file *, unsigned long,
> @@ -536,6 +531,14 @@ prototypes::
>   size_t, unsigned int);
>   int (*setlease)(struct file *, long, struct file_lock **, void **);
>   long (*fallocate)(struct file *, int, loff_t, loff_t);
> + void (*show_fdinfo)(struct seq_file *m, struct file *f);
> + unsigned (*mmap_capabilities)(struct file *);
> + ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
> + loff_t, size_t, unsigned int);
> + loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + loff_t len, unsigned int remap_flags);
> + int (*fadvise)(struct file *, loff_t, loff_t, int);
>  
>  locking rules:
>   All may block.
> -- 
> 2.26.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 07/14] xfs: Refactor xfs_isilocked()

2021-06-17 Thread Darrick J. Wong

On Thu, Jun 17, 2021 at 09:29:20AM -0700, Darrick J. Wong wrote:
> On Wed, Jun 16, 2021 at 05:57:12PM +0200, Jan Kara wrote:
> > On Wed 16-06-21 08:47:05, Darrick J. Wong wrote:
> > > On Wed, Jun 16, 2021 at 10:53:04AM +0200, Jan Kara wrote:
> > > > On Wed 16-06-21 06:37:12, Christoph Hellwig wrote:
> > > > > On Tue, Jun 15, 2021 at 11:17:57AM +0200, Jan Kara wrote:
> > > > > > From: Pavel Reichl 
> > > > > > 
> > > > > > Refactor xfs_isilocked() to use newly introduced 
> > > > > > __xfs_rwsem_islocked().
> > > > > > __xfs_rwsem_islocked() is a helper function which encapsulates 
> > > > > > checking
> > > > > > state of rw_semaphores hold by inode.
> > > > > 
> > > > > __xfs_rwsem_islocked doesn't seem to actually existing in any tree I
> > > > > checked yet?
> > > > 
> > > > __xfs_rwsem_islocked is introduced by this patch so I'm not sure what 
> > > > are
> > > > you asking about... :)
> > > 
> > > The sentence structure implies that __xfs_rwsem_islocked was previously
> > > introduced.  You might change the commit message to read:
> > > 
> > > "Introduce a new __xfs_rwsem_islocked predicate to encapsulate checking
> > > the state of a rw_semaphore, then refactor xfs_isilocked to use it."
> > > 
> > > Since it's not quite a straight copy-paste of the old code.
> > 
> > Ah, ok. Sure, I can rephrase the changelog (or we can just update it on
> > commit if that's the only problem with this series...). Oh, now I've
> > remembered I've promised you a branch to pull :) Here it is with this
> > change and Christoph's Reviewed-by tags:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git 
> > hole_punch_fixes
> 
> To catch-up the list with the ext4 concall:
> 
> Dave Chinner and I have been experimenting with accepting tagged pull
> requests, where the tag message is the most recent cover letter so that
> the git history can capture the broader justification for the series and
> the development revision history.  Signed tags would be ideal too,
> though given the impossibility of meeting in person to exchange gnupg
> keys (and the fact that one has to verify that the patches in the branch
> more or less match what's on the list) I don't consider that an
> impediment.
> 
> Also, if you want me to take this through the xfs tree then it would
> make things much easier if you could base this branch off 5.13-rc4, or
> something that won't cause a merge request to pull in a bunch of
> unrelated upstream changes.

Oh, and also: Please send pull requests as a new thread tagged '[GIT
PULL]' so the requests don't get buried in a patch reply thread.

--D

> --D
> 
> > 
> > Honza
> > -- 
> > Jan Kara 
> > SUSE Labs, CR


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 07/14] xfs: Refactor xfs_isilocked()

2021-06-17 Thread Darrick J. Wong

On Wed, Jun 16, 2021 at 05:57:12PM +0200, Jan Kara wrote:
> On Wed 16-06-21 08:47:05, Darrick J. Wong wrote:
> > On Wed, Jun 16, 2021 at 10:53:04AM +0200, Jan Kara wrote:
> > > On Wed 16-06-21 06:37:12, Christoph Hellwig wrote:
> > > > On Tue, Jun 15, 2021 at 11:17:57AM +0200, Jan Kara wrote:
> > > > > From: Pavel Reichl 
> > > > > 
> > > > > Refactor xfs_isilocked() to use newly introduced 
> > > > > __xfs_rwsem_islocked().
> > > > > __xfs_rwsem_islocked() is a helper function which encapsulates 
> > > > > checking
> > > > > state of rw_semaphores hold by inode.
> > > > 
> > > > __xfs_rwsem_islocked doesn't seem to actually existing in any tree I
> > > > checked yet?
> > > 
> > > __xfs_rwsem_islocked is introduced by this patch so I'm not sure what are
> > > you asking about... :)
> > 
> > The sentence structure implies that __xfs_rwsem_islocked was previously
> > introduced.  You might change the commit message to read:
> > 
> > "Introduce a new __xfs_rwsem_islocked predicate to encapsulate checking
> > the state of a rw_semaphore, then refactor xfs_isilocked to use it."
> > 
> > Since it's not quite a straight copy-paste of the old code.
> 
> Ah, ok. Sure, I can rephrase the changelog (or we can just update it on
> commit if that's the only problem with this series...). Oh, now I've
> remembered I've promised you a branch to pull :) Here it is with this
> change and Christoph's Reviewed-by tags:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git 
> hole_punch_fixes

To catch-up the list with the ext4 concall:

Dave Chinner and I have been experimenting with accepting tagged pull
requests, where the tag message is the most recent cover letter so that
the git history can capture the broader justification for the series and
the development revision history.  Signed tags would be ideal too,
though given the impossibility of meeting in person to exchange gnupg
keys (and the fact that one has to verify that the patches in the branch
more or less match what's on the list) I don't consider that an
impediment.

Also, if you want me to take this through the xfs tree then it would
make things much easier if you could base this branch off 5.13-rc4, or
something that won't cause a merge request to pull in a bunch of
unrelated upstream changes.

--D

> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR

___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 05/14] ext4: Convert to use mapping->invalidate_lock

2021-06-17 Thread Darrick J. Wong

On Tue, Jun 15, 2021 at 11:17:55AM +0200, Jan Kara wrote:
> Convert ext4 to use mapping->invalidate_lock instead of its private
> EXT4_I(inode)->i_mmap_sem. This is mostly search-and-replace. By this
> conversion we fix a long standing race between hole punching and read(2)
> / readahead(2) paths that can lead to stale page cache contents.
> 
> CC: 
> CC: Ted Tso 

Hmm, still no ACK from Ted?

This looks like a pretty straightforward i_mmap_sem conversion, though
in general I'd like /some/ kind of response from anyone in the ext4
community who has been writing code more recently than me...

Reviewed-by: Darrick J. Wong 

--D

> Signed-off-by: Jan Kara 
> ---
>  fs/ext4/ext4.h | 10 --
>  fs/ext4/extents.c  | 25 +---
>  fs/ext4/file.c | 13 +++--
>  fs/ext4/inode.c| 47 +-
>  fs/ext4/ioctl.c|  4 ++--
>  fs/ext4/super.c| 13 +
>  fs/ext4/truncate.h |  8 +---
>  7 files changed, 50 insertions(+), 70 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 37002663d521..ed64b4b217a1 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1077,15 +1077,6 @@ struct ext4_inode_info {
>* by other means, so we have i_data_sem.
>*/
>   struct rw_semaphore i_data_sem;
> - /*
> -  * i_mmap_sem is for serializing page faults with truncate / punch hole
> -  * operations. We have to make sure that new page cannot be faulted in
> -  * a section of the inode that is being punched. We cannot easily use
> -  * i_data_sem for this since we need protection for the whole punch
> -  * operation and i_data_sem ranks below transaction start so we have
> -  * to occasionally drop it.
> -  */
> - struct rw_semaphore i_mmap_sem;
>   struct inode vfs_inode;
>   struct jbd2_inode *jinode;
>  
> @@ -2962,7 +2953,6 @@ extern int ext4_chunk_trans_blocks(struct inode *, int 
> nrblocks);
>  extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
>loff_t lstart, loff_t lend);
>  extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf);
> -extern vm_fault_t ext4_filemap_fault(struct vm_fault *vmf);
>  extern qsize_t *ext4_get_reserved_space(struct inode *inode);
>  extern int ext4_get_projid(struct inode *inode, kprojid_t *projid);
>  extern void ext4_da_release_space(struct inode *inode, int to_free);
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index cbf37b2cf871..db5d38af9ba8 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4470,6 +4470,7 @@ static long ext4_zero_range(struct file *file, loff_t 
> offset,
>   loff_t len, int mode)
>  {
>   struct inode *inode = file_inode(file);
> + struct address_space *mapping = file->f_mapping;
>   handle_t *handle = NULL;
>   unsigned int max_blocks;
>   loff_t new_size = 0;
> @@ -4556,17 +4557,17 @@ static long ext4_zero_range(struct file *file, loff_t 
> offset,
>* Prevent page faults from reinstantiating pages we have
>* released from page cache.
>*/
> - down_write(_I(inode)->i_mmap_sem);
> + filemap_invalidate_lock(mapping);
>  
>   ret = ext4_break_layouts(inode);
>   if (ret) {
> - up_write(_I(inode)->i_mmap_sem);
> + filemap_invalidate_unlock(mapping);
>   goto out_mutex;
>   }
>  
>   ret = ext4_update_disksize_before_punch(inode, offset, len);
>   if (ret) {
> - up_write(_I(inode)->i_mmap_sem);
> + filemap_invalidate_unlock(mapping);
>   goto out_mutex;
>   }
>   /* Now release the pages and zero block aligned part of pages */
> @@ -4575,7 +4576,7 @@ static long ext4_zero_range(struct file *file, loff_t 
> offset,
>  
>   ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
>flags);
> - up_write(_I(inode)->i_mmap_sem);
> + filemap_invalidate_unlock(mapping);
>   if (ret)
>   goto out_mutex;
>   }
> @@ -5217,6 +5218,7 @@ ext4_ext_shift_extents(struct inode *inode, handle_t 
> *handle,
>  static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t 
> len)
>  {
>   struct super_block *sb = inode->i_sb;
> + struct address_space *mapping = inode->i_mapping;
>   ext4_lblk_t punch_start, punch_stop;
>   handle_t *handle;
>   unsigned i

Re: [f2fs-dev] [PATCH 07/14] xfs: Refactor xfs_isilocked()

2021-06-17 Thread Darrick J. Wong

On Tue, Jun 15, 2021 at 11:17:57AM +0200, Jan Kara wrote:
> From: Pavel Reichl 
> 
> Refactor xfs_isilocked() to use newly introduced __xfs_rwsem_islocked().
> __xfs_rwsem_islocked() is a helper function which encapsulates checking
> state of rw_semaphores hold by inode.
> 
> Signed-off-by: Pavel Reichl 
> Suggested-by: Dave Chinner 
> Suggested-by: Eric Sandeen 
> Suggested-by: Darrick J. Wong 
> Signed-off-by: Jan Kara 

With the commit message updated,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_inode.c | 34 ++
>  fs/xfs/xfs_inode.h |  2 +-
>  2 files changed, 27 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index e4c2da4566f1..ffd47217a8fa 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -342,9 +342,29 @@ xfs_ilock_demote(
>  }
>  
>  #if defined(DEBUG) || defined(XFS_WARN)
> -int
> +static inline bool
> +__xfs_rwsem_islocked(
> + struct rw_semaphore *rwsem,
> + boolshared)
> +{
> + if (!debug_locks)
> + return rwsem_is_locked(rwsem);
> +
> + if (!shared)
> + return lockdep_is_held_type(rwsem, 0);
> +
> + /*
> +  * We are checking that the lock is held at least in shared
> +  * mode but don't care that it might be held exclusively
> +  * (i.e. shared | excl). Hence we check if the lock is held
> +  * in any mode rather than an explicit shared mode.
> +  */
> + return lockdep_is_held_type(rwsem, -1);
> +}
> +
> +bool
>  xfs_isilocked(
> - xfs_inode_t *ip,
> + struct xfs_inode*ip,
>   uintlock_flags)
>  {
>   if (lock_flags & (XFS_ILOCK_EXCL|XFS_ILOCK_SHARED)) {
> @@ -359,15 +379,13 @@ xfs_isilocked(
>   return rwsem_is_locked(>i_mmaplock.mr_lock);
>   }
>  
> - if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
> - if (!(lock_flags & XFS_IOLOCK_SHARED))
> - return !debug_locks ||
> - lockdep_is_held_type(_I(ip)->i_rwsem, 0);
> - return rwsem_is_locked(_I(ip)->i_rwsem);
> + if (lock_flags & (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED)) {
> + return __xfs_rwsem_islocked(_I(ip)->i_rwsem,
> + (lock_flags & XFS_IOLOCK_SHARED));
>   }
>  
>   ASSERT(0);
> - return 0;
> + return false;
>  }
>  #endif
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index ca826cfba91c..4659e1568966 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -410,7 +410,7 @@ void  xfs_ilock(xfs_inode_t *, uint);
>  int  xfs_ilock_nowait(xfs_inode_t *, uint);
>  void xfs_iunlock(xfs_inode_t *, uint);
>  void xfs_ilock_demote(xfs_inode_t *, uint);
> -int  xfs_isilocked(xfs_inode_t *, uint);
> +bool xfs_isilocked(struct xfs_inode *, uint);
>  uint xfs_ilock_data_map_shared(struct xfs_inode *);
>  uint xfs_ilock_attr_map_shared(struct xfs_inode *);
>  
> -- 
> 2.26.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 03/14] mm: Protect operations adding pages to page cache with invalidate_lock

2021-06-17 Thread Darrick J. Wong

On Tue, Jun 15, 2021 at 11:17:53AM +0200, Jan Kara wrote:
> Currently, serializing operations such as page fault, read, or readahead
> against hole punching is rather difficult. The basic race scheme is
> like:
> 
> fallocate(FALLOC_FL_PUNCH_HOLE)   read / fault / ..
>   truncate_inode_pages_range()
>   cache here>
>   
> 
> Now the problem is in this way read / page fault / readahead can
> instantiate pages in page cache with potentially stale data (if blocks
> get quickly reused). Avoiding this race is not simple - page locks do
> not work because we want to make sure there are *no* pages in given
> range. inode->i_rwsem does not work because page fault happens under
> mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> the performance for mixed read-write workloads suffer.
> 
> So create a new rw_semaphore in the address_space - invalidate_lock -
> that protects adding of pages to page cache for page faults / reads /
> readahead.
> 
> Signed-off-by: Jan Kara 

Looks good to me now,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  Documentation/filesystems/locking.rst | 62 +
>  fs/inode.c|  2 +
>  include/linux/fs.h| 33 ++
>  mm/filemap.c  | 65 ++-
>  mm/readahead.c|  2 +
>  mm/rmap.c | 37 +++
>  mm/truncate.c |  3 +-
>  7 files changed, 154 insertions(+), 50 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst 
> b/Documentation/filesystems/locking.rst
> index 4ed2b22bd0a8..3b27319dd187 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -271,19 +271,19 @@ prototypes::
>  locking rules:
>   All except set_page_dirty and freepage may block
>  
> -==    =
> -ops  PageLocked(page) i_rwsem
> -==    =
> +==    =  
> ===
> +ops  PageLocked(page) i_rwseminvalidate_lock
> +==    =  
> ===
>  writepage:   yes, unlocks (see below)
> -readpage:yes, unlocks
> +readpage:yes, unlocksshared
>  writepages:
>  set_page_dirty   no
> -readahead:   yes, unlocks
> -readpages:   no
> +readahead:   yes, unlocksshared
> +readpages:   no  shared
>  write_begin: locks the page   exclusive
>  write_end:   yes, unlocks exclusive
>  bmap:
> -invalidatepage:  yes
> +invalidatepage:  yes 
> exclusive
>  releasepage: yes
>  freepage:yes
>  direct_IO:
> @@ -378,7 +378,10 @@ keep it that way and don't breed new callers.
>  ->invalidatepage() is called when the filesystem must attempt to drop
>  some or all of the buffers from the page when it is being truncated. It
>  returns zero on success. If ->invalidatepage is zero, the kernel uses
> -block_invalidatepage() instead.
> +block_invalidatepage() instead. The filesystem must exclusively acquire
> +invalidate_lock before invalidating page cache in truncate / hole punch path
> +(and thus calling into ->invalidatepage) to block races between page cache
> +invalidation and page cache filling functions (fault, read, ...).
>  
>  ->releasepage() is called when the kernel is about to try to drop the
>  buffers from the page in preparation for freeing it.  It returns zero to
> @@ -573,6 +576,25 @@ in sys_read() and friends.
>  the lease within the individual filesystem to record the result of the
>  operation
>  
> +->fallocate implementation must be really careful to maintain page cache
> +consistency when punching holes or performing other operations that 
> invalidate
> +page cache contents. Usually the filesystem needs to call
> +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> +However the filesystem usually also needs to update its internal (and on 
> disk)
> +view of file offset -> disk block mapping. Until this update is finished, the
> +filesystem needs to block page faults and reads from reloading now-stale page
> +cache contents from the d

Re: [f2fs-dev] [PATCH 07/14] xfs: Refactor xfs_isilocked()

2021-06-16 Thread Darrick J. Wong

On Wed, Jun 16, 2021 at 10:53:04AM +0200, Jan Kara wrote:
> On Wed 16-06-21 06:37:12, Christoph Hellwig wrote:
> > On Tue, Jun 15, 2021 at 11:17:57AM +0200, Jan Kara wrote:
> > > From: Pavel Reichl 
> > > 
> > > Refactor xfs_isilocked() to use newly introduced __xfs_rwsem_islocked().
> > > __xfs_rwsem_islocked() is a helper function which encapsulates checking
> > > state of rw_semaphores hold by inode.
> > 
> > __xfs_rwsem_islocked doesn't seem to actually existing in any tree I
> > checked yet?
> 
> __xfs_rwsem_islocked is introduced by this patch so I'm not sure what are
> you asking about... :)

The sentence structure implies that __xfs_rwsem_islocked was previously
introduced.  You might change the commit message to read:

"Introduce a new __xfs_rwsem_islocked predicate to encapsulate checking
the state of a rw_semaphore, then refactor xfs_isilocked to use it."

Since it's not quite a straight copy-paste of the old code.

--D

> 
>   Honza
> 
> -- 
> Jan Kara 
> SUSE Labs, CR


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 03/14] mm: Protect operations adding pages to page cache with invalidate_lock

2021-06-07 Thread Darrick J. Wong

On Mon, Jun 07, 2021 at 04:52:13PM +0200, Jan Kara wrote:
> Currently, serializing operations such as page fault, read, or readahead
> against hole punching is rather difficult. The basic race scheme is
> like:
> 
> fallocate(FALLOC_FL_PUNCH_HOLE)   read / fault / ..
>   truncate_inode_pages_range()
>   cache here>
>   
> 
> Now the problem is in this way read / page fault / readahead can
> instantiate pages in page cache with potentially stale data (if blocks
> get quickly reused). Avoiding this race is not simple - page locks do
> not work because we want to make sure there are *no* pages in given
> range. inode->i_rwsem does not work because page fault happens under
> mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> the performance for mixed read-write workloads suffer.
> 
> So create a new rw_semaphore in the address_space - invalidate_lock -
> that protects adding of pages to page cache for page faults / reads /
> readahead.
> 
> Signed-off-by: Jan Kara 
> ---
>  Documentation/filesystems/locking.rst | 64 ++
>  fs/inode.c|  2 +
>  include/linux/fs.h| 33 ++
>  mm/filemap.c  | 65 ++-
>  mm/readahead.c|  2 +
>  mm/rmap.c | 37 +++
>  mm/truncate.c |  3 +-
>  7 files changed, 156 insertions(+), 50 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst 
> b/Documentation/filesystems/locking.rst
> index 4ed2b22bd0a8..fcb4c0f05050 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -271,19 +271,19 @@ prototypes::
>  locking rules:
>   All except set_page_dirty and freepage may block
>  
> -==    =
> -ops  PageLocked(page) i_rwsem
> -==    =
> +==    =  
> ===
> +ops  PageLocked(page) i_rwseminvalidate_lock
> +==    =  
> ===
>  writepage:   yes, unlocks (see below)
> -readpage:yes, unlocks
> +readpage:yes, unlocksshared
>  writepages:
>  set_page_dirty   no
> -readahead:   yes, unlocks
> -readpages:   no
> +readahead:   yes, unlocksshared
> +readpages:   no  shared
>  write_begin: locks the page   exclusive
>  write_end:   yes, unlocks exclusive
>  bmap:
> -invalidatepage:  yes
> +invalidatepage:  yes 
> exclusive
>  releasepage: yes
>  freepage:yes
>  direct_IO:
> @@ -378,7 +378,10 @@ keep it that way and don't breed new callers.
>  ->invalidatepage() is called when the filesystem must attempt to drop
>  some or all of the buffers from the page when it is being truncated. It
>  returns zero on success. If ->invalidatepage is zero, the kernel uses
> -block_invalidatepage() instead.
> +block_invalidatepage() instead. The filesystem must exclusively acquire
> +invalidate_lock before invalidating page cache in truncate / hole punch path
> +(and thus calling into ->invalidatepage) to block races between page cache
> +invalidation and page cache filling functions (fault, read, ...).
>  
>  ->releasepage() is called when the kernel is about to try to drop the
>  buffers from the page in preparation for freeing it.  It returns zero to
> @@ -573,6 +576,27 @@ in sys_read() and friends.
>  the lease within the individual filesystem to record the result of the
>  operation
>  
> +->fallocate implementation must be really careful to maintain page cache
> +consistency when punching holes or performing other operations that 
> invalidate
> +page cache contents. Usually the filesystem needs to call
> +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> +However the filesystem usually also needs to update its internal (and on 
> disk)
> +view of file offset -> disk block mapping. Until this update is finished, the
> +filesystem needs to block page faults and reads from reloading now-stale page
> +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> +and acquires it in shared mode in paths loading pages from disk
> +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> +responsible for taking this lock in its fallocate implementation and 
> generally
> +whenever the page cache contents needs to be invalidated because a block is
> +moving from

Re: [f2fs-dev] [PATCH 09/14] xfs: Convert double locking of MMAPLOCK to use VFS helpers

2021-06-07 Thread Darrick J. Wong

On Mon, Jun 07, 2021 at 04:52:19PM +0200, Jan Kara wrote:
> Convert places in XFS that take MMAPLOCK for two inodes to use helper
> VFS provides for it (filemap_invalidate_down_write_two()). Note that
> this changes lock ordering for MMAPLOCK from inode number based ordering
> to pointer based ordering VFS generally uses.
> 
> CC: "Darrick J. Wong" 
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Jan Kara 

Looks straightforward,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_bmap_util.c | 15 ---
>  fs/xfs/xfs_inode.c | 37 +++--
>  2 files changed, 19 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index a5e9d7d34023..7421d6ec4def 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1582,7 +1582,6 @@ xfs_swap_extents(
>   struct xfs_bstat*sbp = >sx_stat;
>   int src_log_flags, target_log_flags;
>   int error = 0;
> - int lock_flags;
>   uint64_tf;
>   int resblks = 0;
>   unsigned intflags = 0;
> @@ -1594,8 +1593,8 @@ xfs_swap_extents(
>* do the rest of the checks.
>*/
>   lock_two_nondirectories(VFS_I(ip), VFS_I(tip));
> - lock_flags = XFS_MMAPLOCK_EXCL;
> - xfs_lock_two_inodes(ip, XFS_MMAPLOCK_EXCL, tip, XFS_MMAPLOCK_EXCL);
> + filemap_invalidate_lock_two(VFS_I(ip)->i_mapping,
> + VFS_I(tip)->i_mapping);
>  
>   /* Verify that both files have the same format */
>   if ((VFS_I(ip)->i_mode & S_IFMT) != (VFS_I(tip)->i_mode & S_IFMT)) {
> @@ -1667,7 +1666,6 @@ xfs_swap_extents(
>* or cancel will unlock the inodes from this point onwards.
>*/
>   xfs_lock_two_inodes(ip, XFS_ILOCK_EXCL, tip, XFS_ILOCK_EXCL);
> - lock_flags |= XFS_ILOCK_EXCL;
>   xfs_trans_ijoin(tp, ip, 0);
>   xfs_trans_ijoin(tp, tip, 0);
>  
> @@ -1786,13 +1784,16 @@ xfs_swap_extents(
>   trace_xfs_swap_extent_after(ip, 0);
>   trace_xfs_swap_extent_after(tip, 1);
>  
> +out_unlock_ilock:
> + xfs_iunlock(ip, XFS_ILOCK_EXCL);
> + xfs_iunlock(tip, XFS_ILOCK_EXCL);
>  out_unlock:
> - xfs_iunlock(ip, lock_flags);
> - xfs_iunlock(tip, lock_flags);
> + filemap_invalidate_unlock_two(VFS_I(ip)->i_mapping,
> +   VFS_I(tip)->i_mapping);
>   unlock_two_nondirectories(VFS_I(ip), VFS_I(tip));
>   return error;
>  
>  out_trans_cancel:
>   xfs_trans_cancel(tp);
> - goto out_unlock;
> + goto out_unlock_ilock;
>  }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index e1854a660809..0468f56f3bbb 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -556,12 +556,10 @@ xfs_lock_inodes(
>  }
>  
>  /*
> - * xfs_lock_two_inodes() can only be used to lock one type of lock at a time 
> -
> - * the mmaplock or the ilock, but not more than one type at a time. If we 
> lock
> - * more than one at a time, lockdep will report false positives saying we 
> have
> - * violated locking orders.  The iolock must be double-locked separately 
> since
> - * we use i_rwsem for that.  We now support taking one lock EXCL and the 
> other
> - * SHARED.
> + * xfs_lock_two_inodes() can only be used to lock ilock. The iolock and
> + * mmaplock must be double-locked separately since we use i_rwsem and
> + * invalidate_lock for that. We now support taking one lock EXCL and the
> + * other SHARED.
>   */
>  void
>  xfs_lock_two_inodes(
> @@ -579,15 +577,8 @@ xfs_lock_two_inodes(
>   ASSERT(hweight32(ip1_mode) == 1);
>   ASSERT(!(ip0_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
>   ASSERT(!(ip1_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
> - ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> -!(ip0_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> - ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> -!(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> - ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> -!(ip0_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> - ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> -!(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> -
> + ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)));
> + ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)));
>   ASSERT(ip0->i_ino != ip1->i_ino);
>

Re: [f2fs-dev] [PATCH 08/14] xfs: Convert to use invalidate_lock

2021-06-07 Thread Darrick J. Wong

On Mon, Jun 07, 2021 at 04:52:18PM +0200, Jan Kara wrote:
> Use invalidate_lock instead of XFS internal i_mmap_lock. The intended
> purpose of invalidate_lock is exactly the same. Note that the locking in
> __xfs_filemap_fault() slightly changes as filemap_fault() already takes
> invalidate_lock.
> 
> Reviewed-by: Christoph Hellwig 
> CC: 
> CC: "Darrick J. Wong" 
> Signed-off-by: Jan Kara 
> ---
>  fs/xfs/xfs_file.c  | 13 +++-
>  fs/xfs/xfs_inode.c | 50 --
>  fs/xfs/xfs_inode.h |  1 -
>  fs/xfs/xfs_super.c |  2 --
>  4 files changed, 34 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 396ef36dcd0a..7cb7703c2209 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1282,7 +1282,7 @@ xfs_file_llseek(
>   *
>   * mmap_lock (MM)
>   *   sb_start_pagefault(vfs, freeze)
> - * i_mmaplock (XFS - truncate serialisation)
> + * invalidate_lock (vfs/XFS_MMAPLOCK - truncate serialisation)
>   *   page_lock (MM)
>   * i_lock (XFS - extent map serialisation)
>   */
> @@ -1303,24 +1303,27 @@ __xfs_filemap_fault(
>   file_update_time(vmf->vma->vm_file);
>   }
>  
> - xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>   if (IS_DAX(inode)) {
>   pfn_t pfn;
>  
> + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>   ret = dax_iomap_fault(vmf, pe_size, , NULL,
>   (write_fault && !vmf->cow_page) ?
>_direct_write_iomap_ops :
>_read_iomap_ops);
>   if (ret & VM_FAULT_NEEDDSYNC)
>   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);

I've been wondering if iomap_page_mkwrite and dax_iomap_fault should be
taking these locks?  I guess that would violate the premise that iomap
requires that callers arrange for concurrency control (i.e. iomap
doesn't take locks).

Code changes look fine, though.

Reviewed-by: Darrick J. Wong 

--D

>   } else {
> - if (write_fault)
> + if (write_fault) {
> + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>   ret = iomap_page_mkwrite(vmf,
>   _buffered_write_iomap_ops);
> - else
> + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> + } else {
>   ret = filemap_fault(vmf);
> + }
>   }
> - xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>   if (write_fault)
>   sb_end_pagefault(inode->i_sb);
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 6247977870bd..e1854a660809 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -131,7 +131,7 @@ xfs_ilock_attr_map_shared(
>  
>  /*
>   * In addition to i_rwsem in the VFS inode, the xfs inode contains 2
> - * multi-reader locks: i_mmap_lock and the i_lock.  This routine allows
> + * multi-reader locks: invalidate_lock and the i_lock.  This routine allows
>   * various combinations of the locks to be obtained.
>   *
>   * The 3 locks should always be ordered so that the IO lock is obtained 
> first,
> @@ -139,23 +139,23 @@ xfs_ilock_attr_map_shared(
>   *
>   * Basic locking order:
>   *
> - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> + * i_rwsem -> invalidate_lock -> page_lock -> i_ilock
>   *
>   * mmap_lock locking order:
>   *
>   * i_rwsem -> page lock -> mmap_lock
> - * mmap_lock -> i_mmap_lock -> page_lock
> + * mmap_lock -> invalidate_lock -> page_lock
>   *
>   * The difference in mmap_lock locking order mean that we cannot hold the
> - * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths 
> can
> - * fault in pages during copy in/out (for buffered IO) or require the 
> mmap_lock
> - * in get_user_pages() to map the user pages into the kernel address space 
> for
> - * direct IO. Similarly the i_rwsem cannot be taken inside a page fault 
> because
> - * page faults already hold the mmap_lock.
> + * invalidate_lock over syscall based read(2)/write(2) based IO. These IO 
> paths
> + * can fault in pages during copy in/out (for buffered IO) or require the
> + * mmap_lock in get_user_pages() to map the user pages into the kernel 
> address
> + * space for direct IO. Similarly the i_rwsem cannot be taken inside a page
> + * fault because page faults already hold the mmap_lock.
>   *
>   * Hence to serialise fully against both syscall and mmap based IO, we

Re: [f2fs-dev] [PATCH 07/14] xfs: Refactor xfs_isilocked()

2021-06-07 Thread Darrick J. Wong

On Mon, Jun 07, 2021 at 04:52:17PM +0200, Jan Kara wrote:
> From: Pavel Reichl 
> 
> Refactor xfs_isilocked() to use newly introduced __xfs_rwsem_islocked().
> __xfs_rwsem_islocked() is a helper function which encapsulates checking
> state of rw_semaphores hold by inode.
> 
> Signed-off-by: Pavel Reichl 
> Suggested-by: Dave Chinner 
> Suggested-by: Eric Sandeen 
> Suggested-by: Darrick J. Wong 
> Reviewed-by: Darrick J. Wong 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Jan Kara 
> ---
>  fs/xfs/xfs_inode.c | 39 +++
>  fs/xfs/xfs_inode.h | 21 ++---
>  2 files changed, 45 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 0369eb22c1bb..6247977870bd 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -342,9 +342,34 @@ xfs_ilock_demote(
>  }
>  
>  #if defined(DEBUG) || defined(XFS_WARN)
> -int
> +static inline bool
> +__xfs_rwsem_islocked(
> + struct rw_semaphore *rwsem,
> + int lock_flags,
> + int shift)
> +{
> + lock_flags >>= shift;
> +
> + if (!debug_locks)
> + return rwsem_is_locked(rwsem);
> + /*
> +  * If the shared flag is not set, pass 0 to explicitly check for
> +  * exclusive access to the lock. If the shared flag is set, we typically
> +  * want to make sure the lock is at least held in shared mode
> +  * (i.e., shared | excl) but we don't necessarily care that it might
> +  * actually be held exclusive. Therefore, pass -1 to check whether the
> +  * lock is held in any mode rather than one of the explicit shared mode
> +  * values (1 or 2)."

Extra double-quote not needed here.

With that fixed,
Reviewed-by: Darrick J. Wong 

(You can delete the previous review tag.)

--D

> +  */
> + if (lock_flags & (1 << XFS_SHARED_LOCK_SHIFT)) {
> + return lockdep_is_held_type(rwsem, -1);
> + }
> + return lockdep_is_held_type(rwsem, 0);
> +}
> +
> +bool
>  xfs_isilocked(
> - xfs_inode_t *ip,
> + struct xfs_inode*ip,
>   uintlock_flags)
>  {
>   if (lock_flags & (XFS_ILOCK_EXCL|XFS_ILOCK_SHARED)) {
> @@ -359,15 +384,13 @@ xfs_isilocked(
>   return rwsem_is_locked(>i_mmaplock.mr_lock);
>   }
>  
> - if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
> - if (!(lock_flags & XFS_IOLOCK_SHARED))
> - return !debug_locks ||
> - lockdep_is_held_type(_I(ip)->i_rwsem, 0);
> - return rwsem_is_locked(_I(ip)->i_rwsem);
> + if (lock_flags & (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED)) {
> + return __xfs_rwsem_islocked(_I(ip)->i_rwsem, lock_flags,
> + XFS_IOLOCK_FLAG_SHIFT);
>   }
>  
>   ASSERT(0);
> - return 0;
> + return false;
>  }
>  #endif
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index ca826cfba91c..1c0e15c480bc 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -262,12 +262,19 @@ static inline bool xfs_inode_has_bigtime(struct 
> xfs_inode *ip)
>   * Bit ranges:   1<<1  - 1<<16-1 -- iolock/ilock modes (bitfield)
>   *   1<<16 - 1<<32-1 -- lockdep annotation (integers)
>   */
> -#define  XFS_IOLOCK_EXCL (1<<0)
> -#define  XFS_IOLOCK_SHARED   (1<<1)
> -#define  XFS_ILOCK_EXCL  (1<<2)
> -#define  XFS_ILOCK_SHARED(1<<3)
> -#define  XFS_MMAPLOCK_EXCL   (1<<4)
> -#define  XFS_MMAPLOCK_SHARED (1<<5)
> +
> +#define XFS_IOLOCK_FLAG_SHIFT0
> +#define XFS_ILOCK_FLAG_SHIFT 2
> +#define XFS_MMAPLOCK_FLAG_SHIFT  4
> +
> +#define XFS_SHARED_LOCK_SHIFT1
> +
> +#define XFS_IOLOCK_EXCL  (1 << XFS_IOLOCK_FLAG_SHIFT)
> +#define XFS_IOLOCK_SHARED(XFS_IOLOCK_EXCL << XFS_SHARED_LOCK_SHIFT)
> +#define XFS_ILOCK_EXCL   (1 << XFS_ILOCK_FLAG_SHIFT)
> +#define XFS_ILOCK_SHARED (XFS_ILOCK_EXCL << XFS_SHARED_LOCK_SHIFT)
> +#define XFS_MMAPLOCK_EXCL(1 << XFS_MMAPLOCK_FLAG_SHIFT)
> +#define XFS_MMAPLOCK_SHARED  (XFS_MMAPLOCK_EXCL << XFS_SHARED_LOCK_SHIFT)
>  
>  #define XFS_LOCK_MASK(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
>   | XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
> @@ -410,7 +417,7 @@ void  xfs_ilock(xfs_inode_t *, uint);
>  int  xfs_ilock_nowait(xfs_inode_t *, uint);
>  void xfs_iunloc

Re: [f2fs-dev] [PATCH 04/14] mm: Add functions to lock invalidate_lock for two mappings

2021-06-07 Thread Darrick J. Wong

On Mon, Jun 07, 2021 at 04:52:14PM +0200, Jan Kara wrote:
> Some operations such as reflinking blocks among files will need to lock
> invalidate_lock for two mappings. Add helper functions to do that.
> 
> Signed-off-by: Jan Kara 

Straightforward lift from xfs, though now with vfs lock ordering
rules...

Reviewed-by: Darrick J. Wong 

--D

> ---
>  include/linux/fs.h |  6 ++
>  mm/filemap.c   | 38 ++
>  2 files changed, 44 insertions(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index d8afbc9661d7..ddc11bafc183 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -849,6 +849,12 @@ static inline void filemap_invalidate_unlock_shared(
>  void lock_two_nondirectories(struct inode *, struct inode*);
>  void unlock_two_nondirectories(struct inode *, struct inode*);
>  
> +void filemap_invalidate_lock_two(struct address_space *mapping1,
> +  struct address_space *mapping2);
> +void filemap_invalidate_unlock_two(struct address_space *mapping1,
> +struct address_space *mapping2);
> +
> +
>  /*
>   * NOTE: in a 32bit arch with a preemptable kernel and
>   * an UP compile the i_size_read/write must be atomic
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c8e7e451d81e..b8e9bccecd9f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1009,6 +1009,44 @@ struct page *__page_cache_alloc(gfp_t gfp)
>  EXPORT_SYMBOL(__page_cache_alloc);
>  #endif
>  
> +/*
> + * filemap_invalidate_lock_two - lock invalidate_lock for two mappings
> + *
> + * Lock exclusively invalidate_lock of any passed mapping that is not NULL.
> + *
> + * @mapping1: the first mapping to lock
> + * @mapping2: the second mapping to lock
> + */
> +void filemap_invalidate_lock_two(struct address_space *mapping1,
> +  struct address_space *mapping2)
> +{
> + if (mapping1 > mapping2)
> + swap(mapping1, mapping2);
> + if (mapping1)
> + down_write(>invalidate_lock);
> + if (mapping2 && mapping1 != mapping2)
> + down_write_nested(>invalidate_lock, 1);
> +}
> +EXPORT_SYMBOL(filemap_invalidate_lock_two);
> +
> +/*
> + * filemap_invalidate_unlock_two - unlock invalidate_lock for two mappings
> + *
> + * Unlock exclusive invalidate_lock of any passed mapping that is not NULL.
> + *
> + * @mapping1: the first mapping to unlock
> + * @mapping2: the second mapping to unlock
> + */
> +void filemap_invalidate_unlock_two(struct address_space *mapping1,
> +struct address_space *mapping2)
> +{
> + if (mapping1)
> + up_write(>invalidate_lock);
> + if (mapping2 && mapping1 != mapping2)
> + up_write(>invalidate_lock);
> +}
> +EXPORT_SYMBOL(filemap_invalidate_unlock_two);
> +
>  /*
>   * In order to wait for pages to become available there must be
>   * waitqueues associated with pages. By using a hash table of
> -- 
> 2.26.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 01/14] mm: Fix comments mentioning i_mutex

2021-06-07 Thread Darrick J. Wong

On Mon, Jun 07, 2021 at 04:52:11PM +0200, Jan Kara wrote:
> inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix
> comments still mentioning i_mutex.
> 
> Reviewed-by: Christoph Hellwig 
> Acked-by: Hugh Dickins 
> Signed-off-by: Jan Kara 

Looks good to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  mm/filemap.c| 10 +-
>  mm/madvise.c|  2 +-
>  mm/memory-failure.c |  2 +-
>  mm/rmap.c   |  6 +++---
>  mm/shmem.c  | 20 ++--
>  mm/truncate.c   |  8 
>  6 files changed, 24 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 66f7e9fdfbc4..ba1068a1837f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -76,7 +76,7 @@
>   *  ->swap_lock  (exclusive_swap_page, others)
>   *->i_pages lock
>   *
> - *  ->i_mutex
> + *  ->i_rwsem
>   *->i_mmap_rwsem (truncate->unmap_mapping_range)
>   *
>   *  ->mmap_lock
> @@ -87,7 +87,7 @@
>   *  ->mmap_lock
>   *->lock_page(access_process_vm)
>   *
> - *  ->i_mutex(generic_perform_write)
> + *  ->i_rwsem(generic_perform_write)
>   *->mmap_lock(fault_in_pages_readable->do_page_fault)
>   *
>   *  bdi->wb.list_lock
> @@ -3710,12 +3710,12 @@ EXPORT_SYMBOL(generic_perform_write);
>   * modification times and calls proper subroutines depending on whether we
>   * do direct IO or a standard buffered write.
>   *
> - * It expects i_mutex to be grabbed unless we work on a block device or 
> similar
> + * It expects i_rwsem to be grabbed unless we work on a block device or 
> similar
>   * object which does not need locking at all.
>   *
>   * This function does *not* take care of syncing data in case of O_SYNC 
> write.
>   * A caller has to handle it. This is mainly due to the fact that we want to
> - * avoid syncing under i_mutex.
> + * avoid syncing under i_rwsem.
>   *
>   * Return:
>   * * number of bytes written, even for truncated writes
> @@ -3803,7 +3803,7 @@ EXPORT_SYMBOL(__generic_file_write_iter);
>   *
>   * This is a wrapper around __generic_file_write_iter() to be used by most
>   * filesystems. It takes care of syncing the file in case of O_SYNC file
> - * and acquires i_mutex as needed.
> + * and acquires i_rwsem as needed.
>   * Return:
>   * * negative error code if no data has been written at all of
>   *   vfs_fsync_range() failed for a synchronous write
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 63e489e5bfdb..a0137706b92a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -853,7 +853,7 @@ static long madvise_remove(struct vm_area_struct *vma,
>   + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>  
>   /*
> -  * Filesystem's fallocate may need to take i_mutex.  We need to
> +  * Filesystem's fallocate may need to take i_rwsem.  We need to
>* explicitly grab a reference because the vma (and hence the
>* vma's reference to the file) can go away as soon as we drop
>* mmap_lock.
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 85ad98c00fd9..9dcc9bcea731 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -704,7 +704,7 @@ static int me_pagecache_clean(struct page *p, unsigned 
> long pfn)
>   /*
>* Truncation is a bit tricky. Enable it per file system for now.
>*
> -  * Open: to take i_mutex or not for this? Right now we don't.
> +  * Open: to take i_rwsem or not for this? Right now we don't.
>*/
>   return truncate_error_page(p, pfn, mapping);
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 693a610e181d..a35cbbbded0d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -20,9 +20,9 @@
>  /*
>   * Lock ordering in mm:
>   *
> - * inode->i_mutex(while writing or truncating, not reading or faulting)
> + * inode->i_rwsem(while writing or truncating, not reading or faulting)
>   *   mm->mmap_lock
> - * page->flags PG_locked (lock_page)   * (see huegtlbfs below)
> + * page->flags PG_locked (lock_page)   * (see hugetlbfs below)
>   *   hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
>   * mapping->i_mmap_rwsem
>   *   hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
> @@ -41,7 +41,7 @@
>   * in arch-dependent flush_dcache_mmap_lock,
>   * within bdi.wb->list_lock in 
> __sync_single_inode)
>   *
> - * anon_vma->rwsem,mapping->i_mutex  (memory_failure, collect_procs_anon)
> + * anon_vma->rwsem,mappi

Re: [f2fs-dev] [PATCH 07/13] xfs: Convert to use invalidate_lock

2021-05-26 Thread Darrick J. Wong

On Wed, May 26, 2021 at 12:18:40PM +0200, Jan Kara wrote:
> On Tue 25-05-21 14:37:29, Darrick J. Wong wrote:
> > On Tue, May 25, 2021 at 03:50:44PM +0200, Jan Kara wrote:
> > > Use invalidate_lock instead of XFS internal i_mmap_lock. The intended
> > > purpose of invalidate_lock is exactly the same. Note that the locking in
> > > __xfs_filemap_fault() slightly changes as filemap_fault() already takes
> > > invalidate_lock.
> > > 
> > > Reviewed-by: Christoph Hellwig 
> > > CC: 
> > > CC: "Darrick J. Wong" 
> > 
> > It's djw...@kernel.org now.
> 
> OK, updated.
> 
> > > @@ -355,8 +358,11 @@ xfs_isilocked(
> > >  
> > >   if (lock_flags & (XFS_MMAPLOCK_EXCL|XFS_MMAPLOCK_SHARED)) {
> > >   if (!(lock_flags & XFS_MMAPLOCK_SHARED))
> > > - return !!ip->i_mmaplock.mr_writer;
> > > - return rwsem_is_locked(>i_mmaplock.mr_lock);
> > > + return !debug_locks ||
> > > + lockdep_is_held_type(
> > > + _I(ip)->i_mapping->invalidate_lock,
> > > + 0);
> > > + return rwsem_is_locked(_I(ip)->i_mapping->invalidate_lock);
> > 
> > This doesn't look right...
> > 
> > If lockdep is disabled, we always return true for
> > xfs_isilocked(ip, XFS_MMAPLOCK_EXCL) even if nobody holds the lock?
> > 
> > Granted, you probably just copy-pasted from the IOLOCK_SHARED clause
> > beneath it.  Er... oh right, preichl was messing with all that...
> > 
> > https://lore.kernel.org/linux-xfs/20201016021005.548850-2-prei...@redhat.com/
> 
> Indeed copy-paste programming ;) It certainly makes the assertions happy
> but useless. Should I pull the patch you reference into the series? It
> seems to have been uncontroversial and reviewed. Or will you pull the
> series to xfs tree so I can just rebase on top?

The full conversion series introduced assertion failures because lockdep
can't handle some of the ILOCK usage patterns, specifically the fact
that a thread sometimes takes the ILOCK but then hands the inode to a
workqueue to avoid overflowing the first thread's stack.  That's why it
never got merged into the xfs tree.

However, that kind of switcheroo isn't done with the
MMAPLOCK/invalidate_lock, so you could simply pull the patch I linked
above into your series.

--D

> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 04/13] mm: Add functions to lock invalidate_lock for two mappings

2021-05-26 Thread Darrick J. Wong

On Wed, May 26, 2021 at 03:45:18PM +0200, Jan Kara wrote:
> On Wed 26-05-21 12:11:43, Damien Le Moal wrote:
> > On 2021/05/26 19:07, Jan Kara wrote:
> > > On Tue 25-05-21 13:48:05, Darrick J. Wong wrote:
> > >> On Tue, May 25, 2021 at 03:50:41PM +0200, Jan Kara wrote:
> > >>> Some operations such as reflinking blocks among files will need to lock
> > >>> invalidate_lock for two mappings. Add helper functions to do that.
> > >>>
> > >>> Signed-off-by: Jan Kara 
> > >>> ---
> > >>>  include/linux/fs.h |  6 ++
> > >>>  mm/filemap.c   | 38 ++
> > >>>  2 files changed, 44 insertions(+)
> > >>>
> > >>> diff --git a/include/linux/fs.h b/include/linux/fs.h
> > >>> index 897238d9f1e0..e6f7447505f5 100644
> > >>> --- a/include/linux/fs.h
> > >>> +++ b/include/linux/fs.h
> > >>> @@ -822,6 +822,12 @@ static inline void inode_lock_shared_nested(struct 
> > >>> inode *inode, unsigned subcla
> > >>>  void lock_two_nondirectories(struct inode *, struct inode*);
> > >>>  void unlock_two_nondirectories(struct inode *, struct inode*);
> > >>>  
> > >>> +void filemap_invalidate_down_write_two(struct address_space *mapping1,
> > >>> +  struct address_space *mapping2);
> > >>> +void filemap_invalidate_up_write_two(struct address_space *mapping1,
> > >>
> > >> TBH I find myself wishing that the invalidate_lock used the same
> > >> lock/unlock style wrappers that we use for i_rwsem.
> > >>
> > >> filemap_invalidate_lock(inode1->mapping);
> > >> filemap_invalidate_lock_two(inode1->i_mapping, inode2->i_mapping);
> > > 
> > > OK, and filemap_invalidate_lock_shared() for down_read()? I guess that
> > > works for me.
> > 
> > What about filemap_invalidate_lock_read() and 
> > filemap_invalidate_lock_write() ?
> > That reminds the down_read()/down_write() without the slightly confusing 
> > down/up.
> 
> Well, if we go for lock wrappers as Darrick suggested, I'd mirror naming
> used for inode_lock(). That is IMO the least confusing option... And that
> naming has _lock and _lock_shared suffixes.

I'd like filemap_invalidate_lock and filemap_invalidate_lock_shared.

--D

> 
>   Honza
> 
> -- 
> Jan Kara 
> SUSE Labs, CR


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 08/13] xfs: Convert double locking of MMAPLOCK to use VFS helpers

2021-05-25 Thread Darrick J. Wong

On Tue, May 25, 2021 at 03:50:45PM +0200, Jan Kara wrote:
> Convert places in XFS that take MMAPLOCK for two inodes to use helper
> VFS provides for it (filemap_invalidate_down_write_two()). Note that
> this changes lock ordering for MMAPLOCK from inode number based ordering
> to pointer based ordering VFS generally uses.
> 
> Signed-off-by: Jan Kara 

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_bmap_util.c | 15 ---
>  fs/xfs/xfs_inode.c | 37 +++--
>  2 files changed, 19 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index a5e9d7d34023..8a5cede59f3f 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1582,7 +1582,6 @@ xfs_swap_extents(
>   struct xfs_bstat*sbp = >sx_stat;
>   int src_log_flags, target_log_flags;
>   int error = 0;
> - int lock_flags;
>   uint64_tf;
>   int resblks = 0;
>   unsigned intflags = 0;
> @@ -1594,8 +1593,8 @@ xfs_swap_extents(
>* do the rest of the checks.
>*/
>   lock_two_nondirectories(VFS_I(ip), VFS_I(tip));
> - lock_flags = XFS_MMAPLOCK_EXCL;
> - xfs_lock_two_inodes(ip, XFS_MMAPLOCK_EXCL, tip, XFS_MMAPLOCK_EXCL);
> + filemap_invalidate_down_write_two(VFS_I(ip)->i_mapping,
> +   VFS_I(tip)->i_mapping);
>  
>   /* Verify that both files have the same format */
>   if ((VFS_I(ip)->i_mode & S_IFMT) != (VFS_I(tip)->i_mode & S_IFMT)) {
> @@ -1667,7 +1666,6 @@ xfs_swap_extents(
>* or cancel will unlock the inodes from this point onwards.
>*/
>   xfs_lock_two_inodes(ip, XFS_ILOCK_EXCL, tip, XFS_ILOCK_EXCL);
> - lock_flags |= XFS_ILOCK_EXCL;
>   xfs_trans_ijoin(tp, ip, 0);
>   xfs_trans_ijoin(tp, tip, 0);
>  
> @@ -1786,13 +1784,16 @@ xfs_swap_extents(
>   trace_xfs_swap_extent_after(ip, 0);
>   trace_xfs_swap_extent_after(tip, 1);
>  
> +out_unlock_ilock:
> + xfs_iunlock(ip, XFS_ILOCK_EXCL);
> + xfs_iunlock(tip, XFS_ILOCK_EXCL);
>  out_unlock:
> - xfs_iunlock(ip, lock_flags);
> - xfs_iunlock(tip, lock_flags);
> + filemap_invalidate_up_write_two(VFS_I(ip)->i_mapping,
> + VFS_I(tip)->i_mapping);
>   unlock_two_nondirectories(VFS_I(ip), VFS_I(tip));
>   return error;
>  
>  out_trans_cancel:
>   xfs_trans_cancel(tp);
> - goto out_unlock;
> + goto out_unlock_ilock;
>  }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 53bb5fc33621..11616c9b37f8 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -537,12 +537,10 @@ xfs_lock_inodes(
>  }
>  
>  /*
> - * xfs_lock_two_inodes() can only be used to lock one type of lock at a time 
> -
> - * the mmaplock or the ilock, but not more than one type at a time. If we 
> lock
> - * more than one at a time, lockdep will report false positives saying we 
> have
> - * violated locking orders.  The iolock must be double-locked separately 
> since
> - * we use i_rwsem for that.  We now support taking one lock EXCL and the 
> other
> - * SHARED.
> + * xfs_lock_two_inodes() can only be used to lock ilock. The iolock and
> + * mmaplock must be double-locked separately since we use i_rwsem and
> + * invalidate_lock for that. We now support taking one lock EXCL and the
> + * other SHARED.
>   */
>  void
>  xfs_lock_two_inodes(
> @@ -560,15 +558,8 @@ xfs_lock_two_inodes(
>   ASSERT(hweight32(ip1_mode) == 1);
>   ASSERT(!(ip0_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
>   ASSERT(!(ip1_mode & (XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL)));
> - ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> -!(ip0_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> - ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> -!(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> - ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> -!(ip0_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> - ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> -!(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> -
> + ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)));
> + ASSERT(!(ip1_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)));
>   ASSERT(ip0->i_ino != ip1->i_ino);
>  
>   if (ip0->i_ino > ip1->i_ino) {
> @@ -3731,11 +3722,8 @@ xf

Re: [f2fs-dev] [PATCH 07/13] xfs: Convert to use invalidate_lock

2021-05-25 Thread Darrick J. Wong

On Tue, May 25, 2021 at 03:50:44PM +0200, Jan Kara wrote:
> Use invalidate_lock instead of XFS internal i_mmap_lock. The intended
> purpose of invalidate_lock is exactly the same. Note that the locking in
> __xfs_filemap_fault() slightly changes as filemap_fault() already takes
> invalidate_lock.
> 
> Reviewed-by: Christoph Hellwig 
> CC: 
> CC: "Darrick J. Wong" 

It's djw...@kernel.org now.

> Signed-off-by: Jan Kara 
> ---
>  fs/xfs/xfs_file.c  | 12 ++-
>  fs/xfs/xfs_inode.c | 52 ++
>  fs/xfs/xfs_inode.h |  1 -
>  fs/xfs/xfs_super.c |  2 --
>  4 files changed, 36 insertions(+), 31 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 396ef36dcd0a..dc9cb5c20549 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1282,7 +1282,7 @@ xfs_file_llseek(
>   *
>   * mmap_lock (MM)
>   *   sb_start_pagefault(vfs, freeze)
> - * i_mmaplock (XFS - truncate serialisation)
> + * invalidate_lock (vfs/XFS_MMAPLOCK - truncate serialisation)
>   *   page_lock (MM)
>   * i_lock (XFS - extent map serialisation)
>   */
> @@ -1303,24 +1303,26 @@ __xfs_filemap_fault(
>   file_update_time(vmf->vma->vm_file);
>   }
>  
> - xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>   if (IS_DAX(inode)) {
>   pfn_t pfn;
>  
> + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>   ret = dax_iomap_fault(vmf, pe_size, , NULL,
>   (write_fault && !vmf->cow_page) ?
>_direct_write_iomap_ops :
>_read_iomap_ops);
>   if (ret & VM_FAULT_NEEDDSYNC)
>   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>   } else {
> - if (write_fault)
> + if (write_fault) {
> + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>   ret = iomap_page_mkwrite(vmf,
>   _buffered_write_iomap_ops);
> - else
> + xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> + } else
>   ret = filemap_fault(vmf);
>   }
> - xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
>  
>   if (write_fault)
>   sb_end_pagefault(inode->i_sb);
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 0369eb22c1bb..53bb5fc33621 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -131,7 +131,7 @@ xfs_ilock_attr_map_shared(
>  
>  /*
>   * In addition to i_rwsem in the VFS inode, the xfs inode contains 2
> - * multi-reader locks: i_mmap_lock and the i_lock.  This routine allows
> + * multi-reader locks: invalidate_lock and the i_lock.  This routine allows
>   * various combinations of the locks to be obtained.
>   *
>   * The 3 locks should always be ordered so that the IO lock is obtained 
> first,
> @@ -139,23 +139,23 @@ xfs_ilock_attr_map_shared(
>   *
>   * Basic locking order:
>   *
> - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> + * i_rwsem -> invalidate_lock -> page_lock -> i_ilock
>   *
>   * mmap_lock locking order:
>   *
>   * i_rwsem -> page lock -> mmap_lock
> - * mmap_lock -> i_mmap_lock -> page_lock
> + * mmap_lock -> invalidate_lock -> page_lock
>   *
>   * The difference in mmap_lock locking order mean that we cannot hold the
> - * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths 
> can
> - * fault in pages during copy in/out (for buffered IO) or require the 
> mmap_lock
> - * in get_user_pages() to map the user pages into the kernel address space 
> for
> - * direct IO. Similarly the i_rwsem cannot be taken inside a page fault 
> because
> - * page faults already hold the mmap_lock.
> + * invalidate_lock over syscall based read(2)/write(2) based IO. These IO 
> paths
> + * can fault in pages during copy in/out (for buffered IO) or require the
> + * mmap_lock in get_user_pages() to map the user pages into the kernel 
> address
> + * space for direct IO. Similarly the i_rwsem cannot be taken inside a page
> + * fault because page faults already hold the mmap_lock.
>   *
>   * Hence to serialise fully against both syscall and mmap based IO, we need 
> to
> - * take both the i_rwsem and the i_mmap_lock. These locks should *only* be 
> both
> - * taken in places where we need to invalidate the page cache in a race
> + * take both the i_rwsem and the invalidate_lock. These locks should *only* 
> be
> + * both tak

Re: [f2fs-dev] [PATCH 03/13] mm: Protect operations adding pages to page cache with invalidate_lock

2021-05-25 Thread Darrick J. Wong

On Tue, May 25, 2021 at 03:50:40PM +0200, Jan Kara wrote:
> Currently, serializing operations such as page fault, read, or readahead
> against hole punching is rather difficult. The basic race scheme is
> like:
> 
> fallocate(FALLOC_FL_PUNCH_HOLE)   read / fault / ..
>   truncate_inode_pages_range()
>   cache here>
>   
> 
> Now the problem is in this way read / page fault / readahead can
> instantiate pages in page cache with potentially stale data (if blocks
> get quickly reused). Avoiding this race is not simple - page locks do
> not work because we want to make sure there are *no* pages in given
> range. inode->i_rwsem does not work because page fault happens under
> mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> the performance for mixed read-write workloads suffer.
> 
> So create a new rw_semaphore in the address_space - invalidate_lock -
> that protects adding of pages to page cache for page faults / reads /
> readahead.
> 
> Signed-off-by: Jan Kara 
> ---
>  Documentation/filesystems/locking.rst | 64 ++
>  fs/inode.c|  2 +
>  include/linux/fs.h|  6 +++
>  mm/filemap.c  | 65 ++-
>  mm/readahead.c|  2 +
>  mm/rmap.c | 37 +++
>  mm/truncate.c |  3 +-
>  7 files changed, 129 insertions(+), 50 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst 
> b/Documentation/filesystems/locking.rst
> index 4ed2b22bd0a8..af425bef55d3 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -271,19 +271,19 @@ prototypes::
>  locking rules:
>   All except set_page_dirty and freepage may block
>  
> -==    =
> -ops  PageLocked(page) i_rwsem
> -==    =
> +==    =  
> ===
> +ops  PageLocked(page) i_rwseminvalidate_lock
> +==    =  
> ===
>  writepage:   yes, unlocks (see below)
> -readpage:yes, unlocks
> +readpage:yes, unlocksshared
>  writepages:
>  set_page_dirty   no
> -readahead:   yes, unlocks
> -readpages:   no
> +readahead:   yes, unlocksshared
> +readpages:   no  shared
>  write_begin: locks the page   exclusive
>  write_end:   yes, unlocks exclusive
>  bmap:
> -invalidatepage:  yes
> +invalidatepage:  yes 
> exclusive
>  releasepage: yes
>  freepage:yes
>  direct_IO:
> @@ -378,7 +378,10 @@ keep it that way and don't breed new callers.
>  ->invalidatepage() is called when the filesystem must attempt to drop
>  some or all of the buffers from the page when it is being truncated. It
>  returns zero on success. If ->invalidatepage is zero, the kernel uses
> -block_invalidatepage() instead.
> +block_invalidatepage() instead. The filesystem should exclusively acquire

s/should/must/ ?  It's not really optional to lock out invalidations
anymore now that the page cache synchronizes on invalidate_lock, right?

> +invalidate_lock before invalidating page cache in truncate / hole punch path
> +(and thus calling into ->invalidatepage) to block races between page cache
> +invalidation and page cache filling functions (fault, read, ...).
>  
>  ->releasepage() is called when the kernel is about to try to drop the
>  buffers from the page in preparation for freeing it.  It returns zero to
> @@ -573,6 +576,27 @@ in sys_read() and friends.
>  the lease within the individual filesystem to record the result of the
>  operation
>  
> +->fallocate implementation must be really careful to maintain page cache
> +consistency when punching holes or performing other operations that 
> invalidate
> +page cache contents. Usually the filesystem needs to call
> +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> +However the filesystem usually also needs to update its internal (and on 
> disk)
> +view of file offset -> disk block mapping. Until this update is finished, the
> +filesystem needs to block page faults and reads from reloading now-stale page
> +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> +and acquires it in shared mode in paths loading pages from disk
> +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> +responsible for taking this lock in its

Re: [f2fs-dev] [PATCH 04/13] mm: Add functions to lock invalidate_lock for two mappings

2021-05-25 Thread Darrick J. Wong

On Tue, May 25, 2021 at 03:50:41PM +0200, Jan Kara wrote:
> Some operations such as reflinking blocks among files will need to lock
> invalidate_lock for two mappings. Add helper functions to do that.
> 
> Signed-off-by: Jan Kara 
> ---
>  include/linux/fs.h |  6 ++
>  mm/filemap.c   | 38 ++
>  2 files changed, 44 insertions(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 897238d9f1e0..e6f7447505f5 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -822,6 +822,12 @@ static inline void inode_lock_shared_nested(struct inode 
> *inode, unsigned subcla
>  void lock_two_nondirectories(struct inode *, struct inode*);
>  void unlock_two_nondirectories(struct inode *, struct inode*);
>  
> +void filemap_invalidate_down_write_two(struct address_space *mapping1,
> +struct address_space *mapping2);
> +void filemap_invalidate_up_write_two(struct address_space *mapping1,

TBH I find myself wishing that the invalidate_lock used the same
lock/unlock style wrappers that we use for i_rwsem.

filemap_invalidate_lock(inode1->mapping);
filemap_invalidate_lock_two(inode1->i_mapping, inode2->i_mapping);

To be fair, I've never been able to keep straight that down means lock
and up means unlock.  Ah well, at least you didn't use "p" and "v".

Mechanically, the changes look ok to me.
Acked-by: Darrick J. Wong 

--D

> +  struct address_space *mapping2);
> +
> +
>  /*
>   * NOTE: in a 32bit arch with a preemptable kernel and
>   * an UP compile the i_size_read/write must be atomic
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4d9ec4c6cc34..d3801a9739aa 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1009,6 +1009,44 @@ struct page *__page_cache_alloc(gfp_t gfp)
>  EXPORT_SYMBOL(__page_cache_alloc);
>  #endif
>  
> +/*
> + * filemap_invalidate_down_write_two - lock invalidate_lock for two mappings
> + *
> + * Lock exclusively invalidate_lock of any passed mapping that is not NULL.
> + *
> + * @mapping1: the first mapping to lock
> + * @mapping2: the second mapping to lock
> + */
> +void filemap_invalidate_down_write_two(struct address_space *mapping1,
> +struct address_space *mapping2)
> +{
> + if (mapping1 > mapping2)
> + swap(mapping1, mapping2);
> + if (mapping1)
> + down_write(>invalidate_lock);
> + if (mapping2 && mapping1 != mapping2)
> + down_write_nested(>invalidate_lock, 1);
> +}
> +EXPORT_SYMBOL(filemap_invalidate_down_write_two);
> +
> +/*
> + * filemap_invalidate_up_write_two - unlock invalidate_lock for two mappings
> + *
> + * Unlock exclusive invalidate_lock of any passed mapping that is not NULL.
> + *
> + * @mapping1: the first mapping to unlock
> + * @mapping2: the second mapping to unlock
> + */
> +void filemap_invalidate_up_write_two(struct address_space *mapping1,
> +  struct address_space *mapping2)
> +{
> + if (mapping1)
> + up_write(>invalidate_lock);
> + if (mapping2 && mapping1 != mapping2)
> + up_write(>invalidate_lock);
> +}
> +EXPORT_SYMBOL(filemap_invalidate_up_write_two);
> +
>  /*
>   * In order to wait for pages to become available there must be
>   * waitqueues associated with pages. By using a hash table of
> -- 
> 2.26.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 02/13] documentation: Sync file_operations members with reality

2021-05-25 Thread Darrick J. Wong

On Tue, May 25, 2021 at 03:50:39PM +0200, Jan Kara wrote:
> Sync listing of struct file_operations members with the real one in
> fs.h.
> 
> Signed-off-by: Jan Kara 
> ---
>  Documentation/filesystems/locking.rst | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst 
> b/Documentation/filesystems/locking.rst
> index 1e894480115b..4ed2b22bd0a8 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -506,6 +506,7 @@ prototypes::
>   ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
>   ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
>   ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
> + int (*iopoll) (struct kiocb *kiocb, bool spin);
>   int (*iterate) (struct file *, struct dir_context *);
>   int (*iterate_shared) (struct file *, struct dir_context *);
>   __poll_t (*poll) (struct file *, struct poll_table_struct *);
> @@ -518,12 +519,6 @@ prototypes::
>   int (*fsync) (struct file *, loff_t start, loff_t end, int datasync);
>   int (*fasync) (int, struct file *, int);
>   int (*lock) (struct file *, int, struct file_lock *);
> - ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
> - loff_t *);
> - ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
> - loff_t *);
> - ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t,
> - void __user *);
>   ssize_t (*sendpage) (struct file *, struct page *, int, size_t,
>   loff_t *, int);
>   unsigned long (*get_unmapped_area)(struct file *, unsigned long,
> @@ -536,6 +531,14 @@ prototypes::
>   size_t, unsigned int);
>   int (*setlease)(struct file *, long, struct file_lock **, void **);
>   long (*fallocate)(struct file *, int, loff_t, loff_t);
> + void (*show_fdinfo)(struct seq_file *m, struct file *f);
> + unsigned (*mmap_capabilities)(struct file *);
> + ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
> + loff_t, size_t, unsigned int);
> + loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
> +         struct file *file_out, loff_t pos_out,
> + loff_t len, unsigned int remap_flags);

Acked-by: Darrick J. Wong 

The remap_file_range part looks correct to me.  At a glance the others
seem fine too, but I'm not as familiar with them...

--D

> + int (*fadvise)(struct file *, loff_t, loff_t, int);
>  
>  locking rules:
>   All may block.
> -- 
> 2.26.2
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock

2021-05-14 Thread Darrick J. Wong

On Fri, May 14, 2021 at 09:19:45AM +1000, Dave Chinner wrote:
> On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote:
> > On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> > > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > > > +->fallocate implementation must be really careful to maintain page 
> > > > > cache
> > > > > +consistency when punching holes or performing other operations that 
> > > > > invalidate
> > > > > +page cache contents. Usually the filesystem needs to call
> > > > > +truncate_inode_pages_range() to invalidate relevant range of the 
> > > > > page cache.
> > > > > +However the filesystem usually also needs to update its internal 
> > > > > (and on disk)
> > > > > +view of file offset -> disk block mapping. Until this update is 
> > > > > finished, the
> > > > > +filesystem needs to block page faults and reads from reloading 
> > > > > now-stale page
> > > > > +cache contents from the disk. VFS provides mapping->invalidate_lock 
> > > > > for this
> > > > > +and acquires it in shared mode in paths loading pages from disk
> > > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > > > +responsible for taking this lock in its fallocate implementation and 
> > > > > generally
> > > > > +whenever the page cache contents needs to be invalidated because a 
> > > > > block is
> > > > > +moving from under a page.
> > > > > +
> > > > > +->copy_file_range and ->remap_file_range implementations need to 
> > > > > serialize
> > > > > +against modifications of file data while the operation is running. 
> > > > > For blocking
> > > > > +changes through write(2) and similar operations inode->i_rwsem can 
> > > > > be used. For
> > > > > +blocking changes through memory mapping, the filesystem can use
> > > > > +mapping->invalidate_lock provided it also acquires it in its 
> > > > > ->page_mkwrite
> > > > > +implementation.
> > > > 
> > > > Question: What is the locking order when acquiring the invalidate_lock
> > > > of two different files?  Is it the same as i_rwsem (increasing order of
> > > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > > > being hoisted here (increasing order of i_ino)?
> > > > 
> > > > The reason I ask is that remap_file_range has to do that, but I don't
> > > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > > > calls in xfs_ilock2_io_mmap in this series.
> > > 
> > > Good question. Technically, I don't think there's real need to establish a
> > > single ordering because locks among different filesystems are never going
> > > to be acquired together (effectively each lock type is local per sb and we
> > > are free to define an ordering for each lock type differently). But to
> > > maintain some sanity I guess having the same locking order for doublelock
> > > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> > > by-ino ordering? So that we don't have to consider two different orders in
> > > xfs_lock_two_inodes()...
> > 
> > I imagine Dave will chime in on this, but I suspect the reason is
> > hysterical raisins^Wreasons.
> 
> It's the locking rules that XFS has used pretty much forever.
> Locking by inode number always guarantees the same locking order of
> two inodes in the same filesystem, regardless of the specific
> in-memory instances of the two inodes.
> 
> e.g. if we lock based on the inode structure address, in one
> instancex, we could get A -> B, then B gets recycled and
> reallocated, then we get B -> A as the locking order for the same
> two inodes.
> 
> That, IMNSHO, is utterly crazy because with non-deterministic inode
> lock ordered like this you can't make consistent locking rules for
> locking the physical inode cluster buffers underlying the inodes in
> the situation where they also need to be locked.

 That's protected by the ILOCK, correct?

> We've been down this path before more than a decade ago when the
> powers that be decreed that inode locking order is to be "by
> structure address" rather than inode number, because "inode number
&g

Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock

2021-05-13 Thread Darrick J. Wong

On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > +->fallocate implementation must be really careful to maintain page cache
> > > +consistency when punching holes or performing other operations that 
> > > invalidate
> > > +page cache contents. Usually the filesystem needs to call
> > > +truncate_inode_pages_range() to invalidate relevant range of the page 
> > > cache.
> > > +However the filesystem usually also needs to update its internal (and on 
> > > disk)
> > > +view of file offset -> disk block mapping. Until this update is 
> > > finished, the
> > > +filesystem needs to block page faults and reads from reloading now-stale 
> > > page
> > > +cache contents from the disk. VFS provides mapping->invalidate_lock for 
> > > this
> > > +and acquires it in shared mode in paths loading pages from disk
> > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > +responsible for taking this lock in its fallocate implementation and 
> > > generally
> > > +whenever the page cache contents needs to be invalidated because a block 
> > > is
> > > +moving from under a page.
> > > +
> > > +->copy_file_range and ->remap_file_range implementations need to 
> > > serialize
> > > +against modifications of file data while the operation is running. For 
> > > blocking
> > > +changes through write(2) and similar operations inode->i_rwsem can be 
> > > used. For
> > > +blocking changes through memory mapping, the filesystem can use
> > > +mapping->invalidate_lock provided it also acquires it in its 
> > > ->page_mkwrite
> > > +implementation.
> > 
> > Question: What is the locking order when acquiring the invalidate_lock
> > of two different files?  Is it the same as i_rwsem (increasing order of
> > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > being hoisted here (increasing order of i_ino)?
> > 
> > The reason I ask is that remap_file_range has to do that, but I don't
> > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > calls in xfs_ilock2_io_mmap in this series.
> 
> Good question. Technically, I don't think there's real need to establish a
> single ordering because locks among different filesystems are never going
> to be acquired together (effectively each lock type is local per sb and we
> are free to define an ordering for each lock type differently). But to
> maintain some sanity I guess having the same locking order for doublelock
> of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> by-ino ordering? So that we don't have to consider two different orders in
> xfs_lock_two_inodes()...

I imagine Dave will chime in on this, but I suspect the reason is
hysterical raisins^Wreasons.  It might simply be time to convert all
three XFS inode locks to use the same ordering rules.

--D

> 
>   Honza
> 
> > > +
> > >  dquot_operations
> > >  
> > >  
> > > @@ -634,9 +658,9 @@ access:   yes
> > >  to be faulted in. The filesystem must find and return the page associated
> > >  with the passed in "pgoff" in the vm_fault structure. If it is possible 
> > > that
> > >  the page may be truncated and/or invalidated, then the filesystem must 
> > > lock
> > > -the page, then ensure it is not already truncated (the page lock will 
> > > block
> > > -subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
> > > -locked. The VM will unlock the page.
> > > +invalidate_lock, then ensure the page is not already truncated 
> > > (invalidate_lock
> > > +will block subsequent truncate), and then return with VM_FAULT_LOCKED, 
> > > and the
> > > +page locked. The VM will unlock the page.
> > >  
> > >  ->map_pages() is called when VM asks to map easy accessible pages.
> > >  Filesystem should find and map pages associated with offsets from 
> > > "start_pgoff"
> > > @@ -647,12 +671,14 @@ page table entry. Pointer to entry associated with 
> > > the page is passed in
> > >  "pte" field in vm_fault structure. Pointers to entries for other offsets
> > >  should be calculated relative to "pte".
> > >  
> > > --

Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock

2021-05-12 Thread Darrick J. Wong

On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> Currently, serializing operations such as page fault, read, or readahead
> against hole punching is rather difficult. The basic race scheme is
> like:
> 
> fallocate(FALLOC_FL_PUNCH_HOLE)   read / fault / ..
>   truncate_inode_pages_range()
>   cache here>
>   
> 
> Now the problem is in this way read / page fault / readahead can
> instantiate pages in page cache with potentially stale data (if blocks
> get quickly reused). Avoiding this race is not simple - page locks do
> not work because we want to make sure there are *no* pages in given
> range. inode->i_rwsem does not work because page fault happens under
> mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> the performance for mixed read-write workloads suffer.
> 
> So create a new rw_semaphore in the address_space - invalidate_lock -
> that protects adding of pages to page cache for page faults / reads /
> readahead.
> 
> Signed-off-by: Jan Kara 
> ---
>  Documentation/filesystems/locking.rst | 60 ++---
>  fs/inode.c|  3 ++
>  include/linux/fs.h|  6 +++
>  mm/filemap.c  | 65 ++-
>  mm/readahead.c|  2 +
>  mm/rmap.c | 37 +++
>  mm/truncate.c |  2 +-
>  7 files changed, 127 insertions(+), 48 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst 
> b/Documentation/filesystems/locking.rst
> index 4ed2b22bd0a8..b73666a3da42 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -271,19 +271,19 @@ prototypes::
>  locking rules:
>   All except set_page_dirty and freepage may block
>  
> -==    =
> -ops  PageLocked(page) i_rwsem
> -==    =
> +==    =  
> ===
> +ops  PageLocked(page) i_rwseminvalidate_lock
> +==    =  
> ===
>  writepage:   yes, unlocks (see below)
> -readpage:yes, unlocks
> +readpage:yes, unlocksshared
>  writepages:
>  set_page_dirty   no
> -readahead:   yes, unlocks
> -readpages:   no
> +readahead:   yes, unlocksshared
> +readpages:   no  shared
>  write_begin: locks the page   exclusive
>  write_end:   yes, unlocks exclusive
>  bmap:
> -invalidatepage:  yes
> +invalidatepage:  yes 
> exclusive
>  releasepage: yes
>  freepage:yes
>  direct_IO:
> @@ -378,7 +378,10 @@ keep it that way and don't breed new callers.
>  ->invalidatepage() is called when the filesystem must attempt to drop
>  some or all of the buffers from the page when it is being truncated. It
>  returns zero on success. If ->invalidatepage is zero, the kernel uses
> -block_invalidatepage() instead.
> +block_invalidatepage() instead. The filesystem should exclusively acquire
> +invalidate_lock before invalidating page cache in truncate / hole punch path 
> (and
> +thus calling into ->invalidatepage) to block races between page cache
> +invalidation and page cache filling functions (fault, read, ...).
>  
>  ->releasepage() is called when the kernel is about to try to drop the
>  buffers from the page in preparation for freeing it.  It returns zero to
> @@ -573,6 +576,27 @@ in sys_read() and friends.
>  the lease within the individual filesystem to record the result of the
>  operation
>  
> +->fallocate implementation must be really careful to maintain page cache
> +consistency when punching holes or performing other operations that 
> invalidate
> +page cache contents. Usually the filesystem needs to call
> +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> +However the filesystem usually also needs to update its internal (and on 
> disk)
> +view of file offset -> disk block mapping. Until this update is finished, the
> +filesystem needs to block page faults and reads from reloading now-stale page
> +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> +and acquires it in shared mode in paths loading pages from disk
> +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> +responsible for taking this lock in its fallocate implementation and 
> generally
> +whenever the page cache contents needs to be invalidated because a block is
> +moving from under a

Re: [f2fs-dev] [PATCH v2 12/12] xfs: remove a stale comment from xfs_file_aio_write_checks()

2021-01-12 Thread Darrick J. Wong

On Fri, Jan 08, 2021 at 11:59:03PM -0800, Eric Biggers wrote:
> From: Eric Biggers 
> 
> The comment in xfs_file_aio_write_checks() about calling file_modified()
> after dropping the ilock doesn't make sense, because the code that
> unconditionally acquires and drops the ilock was removed by
> commit 467f78992a07 ("xfs: reduce ilock hold times in
> xfs_file_aio_write_checks").
> 
> Remove this outdated comment.
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Eric Biggers 

Yep, thanks for the update. :)

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/xfs_file.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5b0f93f738372..4927c6653f15d 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -389,12 +389,6 @@ xfs_file_aio_write_checks(
>   } else
>   spin_unlock(>i_flags_lock);
>  
> - /*
> -  * Updating the timestamps will grab the ilock again from
> -  * xfs_fs_dirty_inode, so we have to call it after dropping the
> -  * lock above.  Eventually we should look into a way to avoid
> -  * the pointless lock roundtrip.
> -  */
>   return file_modified(file);
>  }
>  
> -- 
> 2.30.0
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v7 0/8] add support for direct I/O with fscrypt using blk-crypto

2020-11-17 Thread Darrick J. Wong

On Tue, Nov 17, 2020 at 12:15:26PM -0500, Theodore Y. Ts'o wrote:
> What is the expected use case for Direct I/O using fscrypt?  This
> isn't a problem which is unique to fscrypt, but one of the really
> unfortunate aspects of the DIO interface is the silent fallback to
> buffered I/O.  We've lived with this because DIO goes back decades,
> and the original use case was to keep enterprise databases happy, and
> the rules around what is necessary for DIO to work was relatively well
> understood.
> 
> But with fscrypt, there's going to be some additional requirements
> (e.g., using inline crypto) required or else DIO silently fall back to
> buffered I/O for encrypted files.  Depending on the intended use case
> of DIO with fscrypt, this caveat might or might not be unfortunately
> surprising for applications.
> 
> I wonder if we should have some kind of interface so we can more
> explicitly allow applications to query exactly what the requirements
> might be for a particular file vis-a-vis Direct I/O.  What are the
> memory alignment requirements, what are the file offset alignment
> requirements, what are the write size requirements, for a particular
> file.

In Ye Olde days there was XFS_IOC_DIOINFO to communicate all that (xfs
hardcodes 512b file offset alignment), but in this modern era perhaps
it's time to shovel that into statx...

--D

> 
>   - Ted


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v4 3/7] iomap: support direct I/O with fscrypt using blk-crypto

2020-07-22 Thread Darrick J. Wong

On Wed, Jul 22, 2020 at 04:26:25PM -0700, Eric Biggers wrote:
> On Wed, Jul 22, 2020 at 03:34:04PM -0700, Eric Biggers wrote:
> > So, something like this:
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 44bad4bb8831..2816194db46c 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3437,6 +3437,15 @@ static int ext4_iomap_begin(struct inode *inode, 
> > loff_t offset, loff_t length,
> > map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> >   EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> >  
> > +   /*
> > +* When inline encryption is enabled, sometimes I/O to an encrypted file
> > +* has to be broken up to guarantee DUN contiguity.  Handle this by
> > +* limiting the length of the mapping returned.
> > +*/
> > +   if (!(flags & IOMAP_REPORT))
> > +   map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk,
> > +   map.m_len);
> > +
> > if (flags & IOMAP_WRITE)
> > ret = ext4_iomap_alloc(inode, , flags);
> > else
> > 
> > 
> > That also avoids any confusion between pages and blocks, which is nice.
> 
> Correction: for fiemap, ext4 actually uses ext4_iomap_begin_report() instead 
> of
> ext4_iomap_begin().  So we don't need to check for !IOMAP_REPORT.
> 
> Also it could make sense to limit map.m_len after ext4_iomap_alloc() rather 
> than
> before, so that we don't limit the length of the extent that gets allocated 
> but
> rather just the length that gets returned to iomap.

Naïve question here -- if the decision to truncate the bio depends on
the file block offset, can you achieve the same thing by capping the
length of the iovec prior to iomap_dio_rw?

Granted that probably only makes sense if the LBLK IV thing is only
supposed to be used infrequently, and having to opencode a silly loop
might be more hassle than it's worth...

--D

> - Eric


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH 1/2] writeback: avoid double-writing the inode on a lazytime expiration

2020-03-25 Thread Darrick J. Wong

On Wed, Mar 25, 2020 at 11:21:13AM -0400, Theodore Y. Ts'o wrote:
> On Wed, Mar 25, 2020 at 02:20:57AM -0700, Christoph Hellwig wrote:
> > >   spin_unlock(>i_lock);
> > >  
> > > - if (dirty & I_DIRTY_TIME)
> > > - mark_inode_dirty_sync(inode);
> > > + /* This was a lazytime expiration; we need to tell the file system */
> > > + if (dirty & I_DIRTY_TIME_EXPIRED && inode->i_sb->s_op->dirty_inode)
> > > + inode->i_sb->s_op->dirty_inode(inode, I_DIRTY_SYNC);
> > 
> > I think this needs a very clear comment explaining why we don't go
> > through __mark_inode_dirty.
> 
> I can take the explanation which is in the git commit description and
> move it into the comment.
> 
> > But as said before I'd rather have a new lazytime_expired operation that
> > makes it very clear what is happening.  We currenly have 4 file systems
> > (ext4, f2fs, ubifs and xfs) that support lazytime, so this won't really
> > be a major churn.
> 
> Again, I believe patch #2 does what you want; if it doesn't can you
> explain why passing I_DIRTY_TIME_EXPIRED to s_op->dirty_inode() isn't
> "a new lazytime expired operation that makes very clear what is
> happening"?
> 
> I separated out patch #1 and patch #2 because patch #1 preserves
> current behavior, and patch #2 modifies XFS code, which I don't want
> to push Linus without an XFS reviewed-by.
> 
> N.b.  None of the other file systems required a change for patch #2,
> so if you want, we can have the XFS tree carry patch #2, and/or
> combine that with whatever other simplifying changes that you want.
> Or I can combine patch #1 and patch #2, with an XFS Reviewed-by, and
> send it through the ext4 tree.
> 
> What's your pleasure?

TBH while I'm pretty sure this does actually maintain more or less the
same behavior on xfs, I prefer Christoph's explicit ->lazytime_expired
approach[1] over squinting at bitflag manipulations.

(It also took me a while to realize that this patch duo even existed, as
it was kinda buried in its parent thread...)

--D

[1] 
https://lore.kernel.org/linux-fsdevel/20200325122825.1086872-1-...@lst.de/T/#t

> 
>   - Ted
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v8 25/25] iomap: Convert from readpages to readahead

2020-02-26 Thread Darrick J. Wong

On Wed, Feb 26, 2020 at 09:07:28AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 26, 2020 at 09:04:25AM -0800, Darrick J. Wong wrote:
> > > @@ -456,15 +435,8 @@ iomap_readpages(struct address_space *mapping, 
> > > struct list_head *pages,
> > >   unlock_page(ctx.cur_page);
> > >   put_page(ctx.cur_page);
> > >   }
> > > -
> > > - /*
> > > -  * Check that we didn't lose a page due to the arcance calling
> > > -  * conventions..
> > > -  */
> > > - WARN_ON_ONCE(!ret && !list_empty(ctx.pages));
> > > - return ret;
> > 
> > After all the discussion about "if we still have ctx.cur_page we should
> > just stop" in v7, I'm surprised that this patch now doesn't say much of
> > anything, not even a WARN_ON()?
> 
> The code quoted above puts the cur_page reference.  By dropping the
> odd refactoring patch there is no need to check for cur_page being
> left as a special condition as that still is the normal loop exit
> state and properly handled, just as in the original iomap code.

DOH.  Yes, yes it does.  Thanks for pointing that out. :)

/me hands himself another cup of coffee,
Reviewed-by: Darrick J. Wong 

--D


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v8 25/25] iomap: Convert from readpages to readahead

2020-02-26 Thread Darrick J. Wong

On Tue, Feb 25, 2020 at 01:48:38PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in iomap.  Convert XFS and ZoneFS to
> use it.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/iomap/buffered-io.c | 90 +++---
>  fs/iomap/trace.h   |  2 +-
>  fs/xfs/xfs_aops.c  | 13 +++---
>  fs/zonefs/super.c  |  7 ++--
>  include/linux/iomap.h  |  3 +-
>  5 files changed, 41 insertions(+), 74 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index cb3511eb152a..83438b3257de 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -214,9 +214,8 @@ iomap_read_end_io(struct bio *bio)
>  struct iomap_readpage_ctx {
>   struct page *cur_page;
>   boolcur_page_in_bio;
> - boolis_readahead;
>   struct bio  *bio;
> - struct list_head*pages;
> + struct readahead_control *rac;
>  };
>  
>  static void
> @@ -307,11 +306,11 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   if (ctx->bio)
>   submit_bio(ctx->bio);
>  
> - if (ctx->is_readahead) /* same as readahead_gfp_mask */
> + if (ctx->rac) /* same as readahead_gfp_mask */
>   gfp |= __GFP_NORETRY | __GFP_NOWARN;
>   ctx->bio = bio_alloc(gfp, min(BIO_MAX_PAGES, nr_vecs));
>   ctx->bio->bi_opf = REQ_OP_READ;
> - if (ctx->is_readahead)
> + if (ctx->rac)
>   ctx->bio->bi_opf |= REQ_RAHEAD;
>   ctx->bio->bi_iter.bi_sector = sector;
>   bio_set_dev(ctx->bio, iomap->bdev);
> @@ -367,36 +366,8 @@ iomap_readpage(struct page *page, const struct iomap_ops 
> *ops)
>  }
>  EXPORT_SYMBOL_GPL(iomap_readpage);
>  
> -static struct page *
> -iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
> - loff_t length, loff_t *done)
> -{
> - while (!list_empty(pages)) {
> - struct page *page = lru_to_page(pages);
> -
> - if (page_offset(page) >= (u64)pos + length)
> - break;
> -
> - list_del(>lru);
> - if (!add_to_page_cache_lru(page, inode->i_mapping, page->index,
> - GFP_NOFS))
> - return page;
> -
> - /*
> -  * If we already have a page in the page cache at index we are
> -  * done.  Upper layers don't care if it is uptodate after the
> -  * readpages call itself as every page gets checked again once
> -  * actually needed.
> -  */
> - *done += PAGE_SIZE;
> - put_page(page);
> - }
> -
> - return NULL;
> -}
> -
>  static loff_t
> -iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> +iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
>   void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct iomap_readpage_ctx *ctx = data;
> @@ -410,10 +381,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   ctx->cur_page = NULL;
>   }
>   if (!ctx->cur_page) {
> - ctx->cur_page = iomap_next_page(inode, ctx->pages,
> - pos, length, );
> - if (!ctx->cur_page)
> - break;
> + ctx->cur_page = readahead_page(ctx->rac);
>   ctx->cur_page_in_bio = false;
>   }
>   ret = iomap_readpage_actor(inode, pos + done, length - done,
> @@ -423,32 +391,43 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   return done;
>  }
>  
> -int
> -iomap_readpages(struct address_space *mapping, struct list_head *pages,
> - unsigned nr_pages, const struct iomap_ops *ops)
> +/**
> + * iomap_readahead - Attempt to read pages from a file.
> + * @rac: Describes the pages to be read.
> + * @ops: The operations vector for the filesystem.
> + *
> + * This function is for filesystems to call to implement their readahead
> + * address_space operation.
> + *
> + * Context: The @ops callbacks may submit I/O (eg to read the addresses of
> + * blocks from disc), and may wait for it.  The caller may be trying to
> + * access a different page, and so sleeping excessively should be avoided.
> + * It may allocate memory, but should avoid costly allocations.  This
> + * function is called with memalloc_nofs set, so allocations will not cause
> + * the filesystem to be reentered.
> + */
> +void iomap_readahead(struct readahead_control *rac, const struct iomap_ops 
> *ops)
>  {
> + struct inode *inode = rac->mapping->host;
> + loff_t pos = readahead_pos(rac);
> + loff_t length =

Re: [f2fs-dev] [PATCH v7 22/24] iomap: Convert from readpages to readahead

2020-02-24 Thread Darrick J. Wong

On Sun, Feb 23, 2020 at 08:33:55PM -0800, Matthew Wilcox wrote:
> On Fri, Feb 21, 2020 at 05:00:13PM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 20, 2020 at 08:57:34AM -0800, Matthew Wilcox wrote:
> > > On Thu, Feb 20, 2020 at 07:49:12AM -0800, Christoph Hellwig wrote:
> > > +/**
> > > + * iomap_readahead - Attempt to read pages from a file.
> > > + * @rac: Describes the pages to be read.
> > > + * @ops: The operations vector for the filesystem.
> > > + *
> > > + * This function is for filesystems to call to implement their readahead
> > > + * address_space operation.
> > > + *
> > > + * Context: The file is pinned by the caller, and the pages to be read 
> > > are
> > > + * all locked and have an elevated refcount.  This function will unlock
> > > + * the pages (once I/O has completed on them, or I/O has been determined 
> > > to
> > > + * not be necessary).  It will also decrease the refcount once the pages
> > > + * have been submitted for I/O.  After this point, the page may be 
> > > removed
> > > + * from the page cache, and should not be referenced.
> > > + */
> > > 
> > > > Isn't the context documentation something that belongs into the aop
> > > > documentation?  I've never really seen the value of duplicating this
> > > > information in method instances, as it is just bound to be out of date
> > > > rather sooner than later.
> > > 
> > > I'm in two minds about it as well.  There's definitely no value in
> > > providing kernel-doc for implementations of a common interface ... so
> > > rather than fixing the nilfs2 kernel-doc, I just deleted it.  But this
> > > isn't just the implementation, like nilfs2_readahead() is, it's a library
> > > function for filesystems to call, so it deserves documentation.  On the
> > > other hand, there's no real thought to this on the part of the filesystem;
> > > the implementation just calls this with the appropriate ops pointer.
> > > 
> > > Then again, I kind of feel like we need more documentation of iomap to
> > > help filesystems convert to using it.  But maybe kernel-doc isn't the
> > > mechanism to provide that.
> > 
> > I think we need more documentation of the parts of iomap where it can
> > call back into the filesystem (looking at you, iomap_dio_ops).
> > 
> > I'm not opposed to letting this comment stay, though I don't see it as
> > all that necessary since iomap_readahead implements a callout that's
> > documented in vfs.rst and is thus subject to all the constraints listed
> > in the (*readahead) documentation.
> 
> Right.  And that's not currently in kernel-doc format, but should be.
> Something for a different patchset, IMO.
> 
> What we need documenting _here_ is the conditions under which the
> iomap_ops are called so the filesystem author doesn't need to piece them
> together from three different places.  Here's what I currently have:
> 
>  * Context: The @ops callbacks may submit I/O (eg to read the addresses of
>  * blocks from disc), and may wait for it.  The caller may be trying to
>  * access a different page, and so sleeping excessively should be avoided.
>  * It may allocate memory, but should avoid large allocations.  This
>  * function is called with memalloc_nofs set, so allocations will not cause
>  * the filesystem to be reentered.

How large? :)

--D


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v7 22/24] iomap: Convert from readpages to readahead

2020-02-21 Thread Darrick J. Wong

On Wed, Feb 19, 2020 at 01:01:01PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> Use the new readahead operation in iomap.  Convert XFS and ZoneFS to
> use it.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Ok... so from what I saw in the mm patches, this series changes
readahead to shove the locked pages into the page cache before calling
the filesystem's ->readahead function.  Therefore, this (and the
previous patch) are more or less just getting rid of all the iomap
machinery to put pages in the cache and instead pulling them out of the
mapping prior to submitting a read bio?

If so,

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 90 +++---
>  fs/iomap/trace.h   |  2 +-
>  fs/xfs/xfs_aops.c  | 13 +++---
>  fs/zonefs/super.c  |  7 ++--
>  include/linux/iomap.h  |  3 +-
>  5 files changed, 41 insertions(+), 74 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 31899e6cb0f8..66cf453f4bb7 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -214,9 +214,8 @@ iomap_read_end_io(struct bio *bio)
>  struct iomap_readpage_ctx {
>   struct page *cur_page;
>   boolcur_page_in_bio;
> - boolis_readahead;
>   struct bio  *bio;
> - struct list_head*pages;
> + struct readahead_control *rac;
>  };
>  
>  static void
> @@ -307,11 +306,11 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   if (ctx->bio)
>   submit_bio(ctx->bio);
>  
> - if (ctx->is_readahead) /* same as readahead_gfp_mask */
> + if (ctx->rac) /* same as readahead_gfp_mask */
>   gfp |= __GFP_NORETRY | __GFP_NOWARN;
>   ctx->bio = bio_alloc(gfp, min(BIO_MAX_PAGES, nr_vecs));
>   ctx->bio->bi_opf = REQ_OP_READ;
> - if (ctx->is_readahead)
> + if (ctx->rac)
>   ctx->bio->bi_opf |= REQ_RAHEAD;
>   ctx->bio->bi_iter.bi_sector = sector;
>   bio_set_dev(ctx->bio, iomap->bdev);
> @@ -367,36 +366,8 @@ iomap_readpage(struct page *page, const struct iomap_ops 
> *ops)
>  }
>  EXPORT_SYMBOL_GPL(iomap_readpage);
>  
> -static struct page *
> -iomap_next_page(struct inode *inode, struct list_head *pages, loff_t pos,
> - loff_t length, loff_t *done)
> -{
> - while (!list_empty(pages)) {
> - struct page *page = lru_to_page(pages);
> -
> - if (page_offset(page) >= (u64)pos + length)
> - break;
> -
> - list_del(>lru);
> - if (!add_to_page_cache_lru(page, inode->i_mapping, page->index,
> - GFP_NOFS))
> - return page;
> -
> - /*
> -  * If we already have a page in the page cache at index we are
> -  * done.  Upper layers don't care if it is uptodate after the
> -  * readpages call itself as every page gets checked again once
> -  * actually needed.
> -  */
> - *done += PAGE_SIZE;
> - put_page(page);
> - }
> -
> - return NULL;
> -}
> -
>  static loff_t
> -iomap_readpages_actor(struct inode *inode, loff_t pos, loff_t length,
> +iomap_readahead_actor(struct inode *inode, loff_t pos, loff_t length,
>   void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct iomap_readpage_ctx *ctx = data;
> @@ -404,10 +375,7 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>  
>   while (done < length) {
>   if (!ctx->cur_page) {
> - ctx->cur_page = iomap_next_page(inode, ctx->pages,
> - pos, length, );
> - if (!ctx->cur_page)
> - break;
> + ctx->cur_page = readahead_page(ctx->rac);
>   ctx->cur_page_in_bio = false;
>   }
>   ret = iomap_readpage_actor(inode, pos + done, length - done,
> @@ -431,44 +399,48 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   return done;
>  }
>  
> -int
> -iomap_readpages(struct address_space *mapping, struct list_head *pages,
> - unsigned nr_pages, const struct iomap_ops *ops)
> +/**
> + * iomap_readahead - Attempt to read pages from a file.
> + * @rac: Describes the pages to be read.
> + * @op

Re: [f2fs-dev] [PATCH v7 22/24] iomap: Convert from readpages to readahead

2020-02-21 Thread Darrick J. Wong

On Thu, Feb 20, 2020 at 08:57:34AM -0800, Matthew Wilcox wrote:
> On Thu, Feb 20, 2020 at 07:49:12AM -0800, Christoph Hellwig wrote:
> > > +/**
> > > + * iomap_readahead - Attempt to read pages from a file.
> > > + * @rac: Describes the pages to be read.
> > > + * @ops: The operations vector for the filesystem.
> > > + *
> > > + * This function is for filesystems to call to implement their readahead
> > > + * address_space operation.
> > > + *
> > > + * Context: The file is pinned by the caller, and the pages to be read 
> > > are
> > > + * all locked and have an elevated refcount.  This function will unlock
> > > + * the pages (once I/O has completed on them, or I/O has been determined 
> > > to
> > > + * not be necessary).  It will also decrease the refcount once the pages
> > > + * have been submitted for I/O.  After this point, the page may be 
> > > removed
> > > + * from the page cache, and should not be referenced.
> > > + */
> > 
> > Isn't the context documentation something that belongs into the aop
> > documentation?  I've never really seen the value of duplicating this
> > information in method instances, as it is just bound to be out of date
> > rather sooner than later.
> 
> I'm in two minds about it as well.  There's definitely no value in
> providing kernel-doc for implementations of a common interface ... so
> rather than fixing the nilfs2 kernel-doc, I just deleted it.  But this
> isn't just the implementation, like nilfs2_readahead() is, it's a library
> function for filesystems to call, so it deserves documentation.  On the
> other hand, there's no real thought to this on the part of the filesystem;
> the implementation just calls this with the appropriate ops pointer.
> 
> Then again, I kind of feel like we need more documentation of iomap to
> help filesystems convert to using it.  But maybe kernel-doc isn't the
> mechanism to provide that.

I think we need more documentation of the parts of iomap where it can
call back into the filesystem (looking at you, iomap_dio_ops).

I'm not opposed to letting this comment stay, though I don't see it as
all that necessary since iomap_readahead implements a callout that's
documented in vfs.rst and is thus subject to all the constraints listed
in the (*readahead) documentation.

--D


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v7 21/24] iomap: Restructure iomap_readpages_actor

2020-02-21 Thread Darrick J. Wong

On Wed, Feb 19, 2020 at 01:01:00PM -0800, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> By putting the 'have we reached the end of the page' condition at the end
> of the loop instead of the beginning, we can remove the 'submit the last
> page' code from iomap_readpages().  Also check that iomap_readpage_actor()
> didn't return 0, which would lead to an endless loop.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/iomap/buffered-io.c | 32 ++--
>  1 file changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index cb3511eb152a..31899e6cb0f8 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -400,15 +400,9 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   void *data, struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct iomap_readpage_ctx *ctx = data;
> - loff_t done, ret;
> -
> - for (done = 0; done < length; done += ret) {
> - if (ctx->cur_page && offset_in_page(pos + done) == 0) {
> - if (!ctx->cur_page_in_bio)
> - unlock_page(ctx->cur_page);
> - put_page(ctx->cur_page);
> - ctx->cur_page = NULL;
> - }
> + loff_t ret, done = 0;
> +
> + while (done < length) {
>   if (!ctx->cur_page) {
>   ctx->cur_page = iomap_next_page(inode, ctx->pages,
>   pos, length, );
> @@ -418,6 +412,20 @@ iomap_readpages_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   }
>   ret = iomap_readpage_actor(inode, pos + done, length - done,
>   ctx, iomap, srcmap);
> + done += ret;
> +
> + /* Keep working on a partial page */
> + if (ret && offset_in_page(pos + done))
> + continue;
> +
> + if (!ctx->cur_page_in_bio)
> + unlock_page(ctx->cur_page);
> + put_page(ctx->cur_page);
> + ctx->cur_page = NULL;
> +
> + /* Don't loop forever if we made no progress */
> + if (WARN_ON(!ret))
> + break;
>   }
>  
>   return done;
> @@ -451,11 +459,7 @@ iomap_readpages(struct address_space *mapping, struct 
> list_head *pages,
>  done:
>   if (ctx.bio)
>   submit_bio(ctx.bio);
> - if (ctx.cur_page) {
> - if (!ctx.cur_page_in_bio)
> - unlock_page(ctx.cur_page);
> - put_page(ctx.cur_page);
> - }
> + BUG_ON(ctx.cur_page);

Whoah, is the system totally unrecoverably hosed at this point?

I get that this /shouldn't/ happen, but should we somehow end up with a
page here, are we unable either to release it or even just leak it?  I'd
have thought a WARN_ON would be just fine here.

--D

>  
>   /*
>* Check that we didn't lose a page due to the arcance calling
> -- 
> 2.25.0
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v4] fs: Fix page_mkwrite off-by-one errors

2020-01-15 Thread Darrick J. Wong

On Wed, Jan 08, 2020 at 08:57:10AM -0800, Christoph Hellwig wrote:
> I don't want to be the party pooper, but shouldn't this be a series
> with one patch to add the helper, and then once for each fs / piece
> of common code switched over?

The current patch in the iomap branch contains the chunks that add the
helper function, fix iomap, and whatever chunks for other filesystems
that don't cause /any/ merge complaints in for-next.  That means btrfs,
ceph, ext4, and ubifs will get fixed this time around.

Seeing as it's been floating around in for-next for a week now I'd
rather not rebase the branch just to rip out the four parts that haven't
given me any headaches so that they can be applied separately. :)

The acks from the other fs maintainers were very helpful, but at the
same time, I don't want to become a shadow vfs maintainer.

Therefore, whatever's in this v4 patch that isn't in [1] will have to be
sent separately.

[1] 
https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=iomap-5.6-merge=62e298db3fc3ebf41d996f3c86b44cbbdd3286bc

> On Wed, Jan 08, 2020 at 02:15:28PM +0100, Andreas Gruenbacher wrote:
> > Hi Darrick,
> > 
> > here's an updated version with the latest feedback incorporated.  Hope
> > you find that useful.
> > 
> > As far as the f2fs merge conflict goes, I've been told by Linus not to
> > resolve those kinds of conflicts but to point them out when sending the
> > merge request.  So this shouldn't be a big deal.
> 
> Also this isn't really the proper way to write a commit message.  This
> text would go into the cover letter if it was a series..

 Yeah.

--D

___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH v3] fs: Fix page_mkwrite off-by-one errors

2020-01-07 Thread Darrick J. Wong

On Wed, Dec 18, 2019 at 02:09:35PM +0100, Andreas Gruenbacher wrote:
> Hi Darrick,
> 
> can this fix go in via the xfs tree?
> 
> Thanks,
> Andreas
> 
> --
> 
> The check in block_page_mkwrite that is meant to determine whether an
> offset is within the inode size is off by one.  This bug has been copied
> into iomap_page_mkwrite and several filesystems (ubifs, ext4, f2fs,
> ceph).
> 
> Fix that by introducing a new page_mkwrite_check_truncate helper that
> checks for truncate and computes the bytes in the page up to EOF.  Use
> the helper in the above mentioned filesystems.
> 
> In addition, use the new helper in btrfs as well.
> 
> Signed-off-by: Andreas Gruenbacher 
> Acked-by: David Sterba  (btrfs part)
> Acked-by: Richard Weinberger  (ubifs part)
> ---
>  fs/btrfs/inode.c| 15 ---
>  fs/buffer.c | 16 +++-
>  fs/ceph/addr.c  |  2 +-
>  fs/ext4/inode.c | 14 --
>  fs/f2fs/file.c  | 19 +++

Well, the f2fs developers never acked this and there was a conflict when
I put this into for-next, so I removed the f2fs part (and fixed the
unused variable warning in the ext4 part)...

--D

>  fs/iomap/buffered-io.c  | 18 +-
>  fs/ubifs/file.c |  3 +--
>  include/linux/pagemap.h | 28 
>  8 files changed, 53 insertions(+), 62 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 56032c518b26..86c6fcd8139d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9016,13 +9016,11 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>   ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
>  again:
>   lock_page(page);
> - size = i_size_read(inode);
>  
> - if ((page->mapping != inode->i_mapping) ||
> - (page_start >= size)) {
> - /* page got truncated out from underneath us */
> + ret2 = page_mkwrite_check_truncate(page, inode);
> + if (ret2 < 0)
>   goto out_unlock;
> - }
> + zero_start = ret2;
>   wait_on_page_writeback(page);
>  
>   lock_extent_bits(io_tree, page_start, page_end, _state);
> @@ -9043,6 +9041,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>   goto again;
>   }
>  
> + size = i_size_read(inode);
>   if (page->index == ((size - 1) >> PAGE_SHIFT)) {
>   reserved_space = round_up(size - page_start,
> fs_info->sectorsize);
> @@ -9075,12 +9074,6 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>   }
>   ret2 = 0;
>  
> - /* page is wholly or partially inside EOF */
> - if (page_start + PAGE_SIZE > size)
> - zero_start = offset_in_page(size);
> - else
> - zero_start = PAGE_SIZE;
> -
>   if (zero_start != PAGE_SIZE) {
>   kaddr = kmap(page);
>   memset(kaddr + zero_start, 0, PAGE_SIZE - zero_start);
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d8c7242426bb..53aabde57ca7 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2499,23 +2499,13 @@ int block_page_mkwrite(struct vm_area_struct *vma, 
> struct vm_fault *vmf,
>   struct page *page = vmf->page;
>   struct inode *inode = file_inode(vma->vm_file);
>   unsigned long end;
> - loff_t size;
>   int ret;
>  
>   lock_page(page);
> - size = i_size_read(inode);
> - if ((page->mapping != inode->i_mapping) ||
> - (page_offset(page) > size)) {
> - /* We overload EFAULT to mean page got truncated */
> - ret = -EFAULT;
> + ret = page_mkwrite_check_truncate(page, inode);
> + if (ret < 0)
>   goto out_unlock;
> - }
> -
> - /* page is wholly or partially inside EOF */
> - if (((page->index + 1) << PAGE_SHIFT) > size)
> - end = size & ~PAGE_MASK;
> - else
> - end = PAGE_SIZE;
> + end = ret;
>  
>   ret = __block_write_begin(page, 0, end, get_block);
>   if (!ret)
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 7ab616601141..ef958aa4adb4 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1575,7 +1575,7 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault 
> *vmf)
>   do {
>   lock_page(page);
>  
> - if ((off > size) || (page->mapping != inode->i_mapping)) {
> + if (page_mkwrite_check_truncate(page, inode) < 0) {
>   unlock_page(page);
>   ret = VM_FAULT_NOPAGE;
>   break;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 28f28de0c1b6..51ab1d2cac80 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5871,7 +5871,6 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   struct page *page = vmf->page;
> - loff_t size;
>   unsigned long len;
>   int err;
>   vm_fault_t ret;
> @@ -5907,18 +5906,13 @@ vm_fault_t

Re: [f2fs-dev] [PATCH v6 2/9] block: Add encryption context to struct bio

2019-12-18 Thread Darrick J. Wong

On Wed, Dec 18, 2019 at 06:51:29AM -0800, Satya Tangirala wrote:
> We must have some way of letting a storage device driver know what
> encryption context it should use for en/decrypting a request. However,
> it's the filesystem/fscrypt that knows about and manages encryption
> contexts. As such, when the filesystem layer submits a bio to the block
> layer, and this bio eventually reaches a device driver with support for
> inline encryption, the device driver will need to have been told the
> encryption context for that bio.
> 
> We want to communicate the encryption context from the filesystem layer
> to the storage device along with the bio, when the bio is submitted to the
> block layer. To do this, we add a struct bio_crypt_ctx to struct bio, which
> can represent an encryption context (note that we can't use the bi_private
> field in struct bio to do this because that field does not function to pass
> information across layers in the storage stack). We also introduce various
> functions to manipulate the bio_crypt_ctx and make the bio/request merging
> logic aware of the bio_crypt_ctx.
> 
> Signed-off-by: Satya Tangirala 
> ---
>  block/Makefile|   2 +-
>  block/bio-crypt-ctx.c | 131 ++
>  block/bio.c   |  16 ++--
>  block/blk-core.c  |   3 +
>  block/blk-merge.c |  11 +++
>  block/bounce.c|  12 ++-
>  drivers/md/dm.c   |   3 +-
>  include/linux/bio-crypt-ctx.h | 146 +-
>  include/linux/blk_types.h |   6 ++
>  9 files changed, 312 insertions(+), 18 deletions(-)
>  create mode 100644 block/bio-crypt-ctx.c
> 
> diff --git a/block/Makefile b/block/Makefile
> index 7c603669f216..79f2b8b3fc5d 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -37,4 +37,4 @@ obj-$(CONFIG_BLK_DEBUG_FS)  += blk-mq-debugfs.o
>  obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
>  obj-$(CONFIG_BLK_SED_OPAL)   += sed-opal.o
>  obj-$(CONFIG_BLK_PM) += blk-pm.o
> -obj-$(CONFIG_BLK_INLINE_ENCRYPTION)  += keyslot-manager.o
> \ No newline at end of file
> +obj-$(CONFIG_BLK_INLINE_ENCRYPTION)  += keyslot-manager.o bio-crypt-ctx.o
> \ No newline at end of file
> diff --git a/block/bio-crypt-ctx.c b/block/bio-crypt-ctx.c
> new file mode 100644
> index ..dadf0da3c21b
> --- /dev/null
> +++ b/block/bio-crypt-ctx.c
> @@ -0,0 +1,131 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright 2019 Google LLC
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +
> +static int num_prealloc_crypt_ctxs = 128;
> +
> +module_param(num_prealloc_crypt_ctxs, int, 0444);
> +MODULE_PARM_DESC(num_prealloc_crypt_ctxs,
> + "Number of bio crypto contexts to preallocate");
> +
> +static struct kmem_cache *bio_crypt_ctx_cache;
> +static mempool_t *bio_crypt_ctx_pool;
> +
> +int __init bio_crypt_ctx_init(void)
> +{
> + bio_crypt_ctx_cache = KMEM_CACHE(bio_crypt_ctx, 0);
> + if (!bio_crypt_ctx_cache)
> + return -ENOMEM;
> +
> + bio_crypt_ctx_pool = mempool_create_slab_pool(num_prealloc_crypt_ctxs,
> +   bio_crypt_ctx_cache);
> + if (!bio_crypt_ctx_pool)
> + return -ENOMEM;
> +
> + /* This is assumed in various places. */
> + BUILD_BUG_ON(BLK_ENCRYPTION_MODE_INVALID != 0);
> +
> + return 0;
> +}
> +
> +struct bio_crypt_ctx *bio_crypt_alloc_ctx(gfp_t gfp_mask)
> +{
> + return mempool_alloc(bio_crypt_ctx_pool, gfp_mask);
> +}
> +
> +void bio_crypt_free_ctx(struct bio *bio)
> +{
> + mempool_free(bio->bi_crypt_context, bio_crypt_ctx_pool);
> + bio->bi_crypt_context = NULL;
> +}
> +
> +void bio_crypt_clone(struct bio *dst, struct bio *src, gfp_t gfp_mask)
> +{
> + const struct bio_crypt_ctx *src_bc = src->bi_crypt_context;
> +
> + /*
> +  * If a bio is swhandled, then it will be decrypted when bio_endio
> +  * is called. As we only want the data to be decrypted once, copies
> +  * of the bio must not have have a crypt context.
> +  */
> + if (!src_bc)
> + return;
> +
> + dst->bi_crypt_context = bio_crypt_alloc_ctx(gfp_mask);
> + *dst->bi_crypt_context = *src_bc;
> +
> + if (src_bc->bc_keyslot >= 0)
> + keyslot_manager_get_slot(src_bc->bc_ksm, src_bc->bc_keyslot);
> +}
> +EXPORT_SYMBOL_GPL(bio_crypt_clone);
> +
> +bool bio_crypt_should_process(struct request *rq)
> +{
> + struct bio *bio = rq->bio;
> +
> + if (!bio || !bio->bi_crypt_context)
> + return false;
> +
> + return rq->q->ksm == bio->bi_crypt_context->bc_ksm;
> +}
> +EXPORT_SYMBOL_GPL(bio_crypt_should_process);
> +
> +/*
> + * Checks that two bio crypt contexts are compatible - i.e. that
> + * they are mergeable except for data_unit_num continuity.
> + */
> +bool bio_crypt_ctx_compatible(struct bio *b_1, struct bio *b_2)
> +{
> + struct bio_crypt_ctx *bc1 =

Re: [f2fs-dev] [PATCH v3] fs: Fix page_mkwrite off-by-one errors

2019-12-18 Thread Darrick J. Wong

On Wed, Dec 18, 2019 at 08:15:36PM +0100, Andreas Gruenbacher wrote:
> On Wed, Dec 18, 2019 at 7:55 PM Darrick J. Wong  
> wrote:
> > On Wed, Dec 18, 2019 at 02:09:35PM +0100, Andreas Gruenbacher wrote:
> > > Hi Darrick,
> > >
> > > can this fix go in via the xfs tree?
> >
> > Er, I'd rather not touch five other filesystems via the XFS tree.
> > However, a more immediate problem that I think I see is...
> >
> > > Thanks,
> > > Andreas
> > >
> > > --
> > >
> > > The check in block_page_mkwrite that is meant to determine whether an
> > > offset is within the inode size is off by one.  This bug has been copied
> > > into iomap_page_mkwrite and several filesystems (ubifs, ext4, f2fs,
> > > ceph).
> > >
> > > Fix that by introducing a new page_mkwrite_check_truncate helper that
> > > checks for truncate and computes the bytes in the page up to EOF.  Use
> > > the helper in the above mentioned filesystems.
> > >
> > > In addition, use the new helper in btrfs as well.
> > >
> > > Signed-off-by: Andreas Gruenbacher 
> > > Acked-by: David Sterba  (btrfs part)
> > > Acked-by: Richard Weinberger  (ubifs part)
> > > ---
> > >  fs/btrfs/inode.c| 15 ---
> > >  fs/buffer.c | 16 +++-
> > >  fs/ceph/addr.c  |  2 +-
> > >  fs/ext4/inode.c | 14 --
> > >  fs/f2fs/file.c  | 19 +++
> > >  fs/iomap/buffered-io.c  | 18 +-
> > >  fs/ubifs/file.c |  3 +--
> > >  include/linux/pagemap.h | 28 
> > >  8 files changed, 53 insertions(+), 62 deletions(-)
> > >
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 56032c518b26..86c6fcd8139d 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -9016,13 +9016,11 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault 
> > > *vmf)
> > >   ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
> > >  again:
> > >   lock_page(page);
> > > - size = i_size_read(inode);
> > >
> > > - if ((page->mapping != inode->i_mapping) ||
> > > - (page_start >= size)) {
> > > - /* page got truncated out from underneath us */
> > > + ret2 = page_mkwrite_check_truncate(page, inode);
> > > + if (ret2 < 0)
> > >   goto out_unlock;
> >
> > ...here we try to return -EFAULT as vm_fault_t.  Notice how btrfs returns
> > VM_FAULT_* values directly and never calls block_page_mkwrite_return?  I
> > know dsterba acked this, but I cannot see how this is correct?
> 
> Well, page_mkwrite_check_truncate can only fail with -EFAULT, in which
> case btrfs_page_mkwrite will return VM_FAULT_NOPAGE. It would be
> cleaner not to discard page_mkwrite_check_truncate's return value
> though.

*OH*, because we're stuffing the value in ret2, not ret.  Ok, that makes
more sense.  Er, I guess I don't mind pushing via iomap tree, but could
we get some acks from Ted and any of the ceph maintainers?

--D

> > > - }
> > > + zero_start = ret2;
> > >   wait_on_page_writeback(page);
> > >
> > >   lock_extent_bits(io_tree, page_start, page_end, _state);
> > > @@ -9043,6 +9041,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
> > >   goto again;
> > >   }
> > >
> > > + size = i_size_read(inode);
> > >   if (page->index == ((size - 1) >> PAGE_SHIFT)) {
> > >   reserved_space = round_up(size - page_start,
> > > fs_info->sectorsize);
> > > @@ -9075,12 +9074,6 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
> > >   }
> > >   ret2 = 0;
> > >
> > > - /* page is wholly or partially inside EOF */
> > > - if (page_start + PAGE_SIZE > size)
> > > - zero_start = offset_in_page(size);
> > > - else
> > > - zero_start = PAGE_SIZE;
> > > -
> > >   if (zero_start != PAGE_SIZE) {
> > >   kaddr = kmap(page);
> > >   memset(kaddr + zero_start, 0, PAGE_SIZE - zero_start);
> > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > index d8c7242426bb..53aabde57ca7 100644
> > > --- a/fs/buffer.c
> > > +++ b/fs/buffer.c
> > > @@ -2499,23 +2499,13 @@

Re: [f2fs-dev] [PATCH v3] fs: Fix page_mkwrite off-by-one errors

2019-12-18 Thread Darrick J. Wong

On Wed, Dec 18, 2019 at 02:09:35PM +0100, Andreas Gruenbacher wrote:
> Hi Darrick,
> 
> can this fix go in via the xfs tree?

Er, I'd rather not touch five other filesystems via the XFS tree.
However, a more immediate problem that I think I see is...

> Thanks,
> Andreas
> 
> --
> 
> The check in block_page_mkwrite that is meant to determine whether an
> offset is within the inode size is off by one.  This bug has been copied
> into iomap_page_mkwrite and several filesystems (ubifs, ext4, f2fs,
> ceph).
> 
> Fix that by introducing a new page_mkwrite_check_truncate helper that
> checks for truncate and computes the bytes in the page up to EOF.  Use
> the helper in the above mentioned filesystems.
> 
> In addition, use the new helper in btrfs as well.
> 
> Signed-off-by: Andreas Gruenbacher 
> Acked-by: David Sterba  (btrfs part)
> Acked-by: Richard Weinberger  (ubifs part)
> ---
>  fs/btrfs/inode.c| 15 ---
>  fs/buffer.c | 16 +++-
>  fs/ceph/addr.c  |  2 +-
>  fs/ext4/inode.c | 14 --
>  fs/f2fs/file.c  | 19 +++
>  fs/iomap/buffered-io.c  | 18 +-
>  fs/ubifs/file.c |  3 +--
>  include/linux/pagemap.h | 28 
>  8 files changed, 53 insertions(+), 62 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 56032c518b26..86c6fcd8139d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9016,13 +9016,11 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>   ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
>  again:
>   lock_page(page);
> - size = i_size_read(inode);
>  
> - if ((page->mapping != inode->i_mapping) ||
> - (page_start >= size)) {
> - /* page got truncated out from underneath us */
> + ret2 = page_mkwrite_check_truncate(page, inode);
> + if (ret2 < 0)
>   goto out_unlock;

...here we try to return -EFAULT as vm_fault_t.  Notice how btrfs returns
VM_FAULT_* values directly and never calls block_page_mkwrite_return?  I
know dsterba acked this, but I cannot see how this is correct?

--D

> - }
> + zero_start = ret2;
>   wait_on_page_writeback(page);
>  
>   lock_extent_bits(io_tree, page_start, page_end, _state);
> @@ -9043,6 +9041,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>   goto again;
>   }
>  
> + size = i_size_read(inode);
>   if (page->index == ((size - 1) >> PAGE_SHIFT)) {
>   reserved_space = round_up(size - page_start,
> fs_info->sectorsize);
> @@ -9075,12 +9074,6 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
>   }
>   ret2 = 0;
>  
> - /* page is wholly or partially inside EOF */
> - if (page_start + PAGE_SIZE > size)
> - zero_start = offset_in_page(size);
> - else
> - zero_start = PAGE_SIZE;
> -
>   if (zero_start != PAGE_SIZE) {
>   kaddr = kmap(page);
>   memset(kaddr + zero_start, 0, PAGE_SIZE - zero_start);
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d8c7242426bb..53aabde57ca7 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2499,23 +2499,13 @@ int block_page_mkwrite(struct vm_area_struct *vma, 
> struct vm_fault *vmf,
>   struct page *page = vmf->page;
>   struct inode *inode = file_inode(vma->vm_file);
>   unsigned long end;
> - loff_t size;
>   int ret;
>  
>   lock_page(page);
> - size = i_size_read(inode);
> - if ((page->mapping != inode->i_mapping) ||
> - (page_offset(page) > size)) {
> - /* We overload EFAULT to mean page got truncated */
> - ret = -EFAULT;
> + ret = page_mkwrite_check_truncate(page, inode);
> + if (ret < 0)
>   goto out_unlock;
> - }
> -
> - /* page is wholly or partially inside EOF */
> - if (((page->index + 1) << PAGE_SHIFT) > size)
> - end = size & ~PAGE_MASK;
> - else
> - end = PAGE_SIZE;
> + end = ret;
>  
>   ret = __block_write_begin(page, 0, end, get_block);
>   if (!ret)
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 7ab616601141..ef958aa4adb4 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1575,7 +1575,7 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault 
> *vmf)
>   do {
>   lock_page(page);
>  
> - if ((off > size) || (page->mapping != inode->i_mapping)) {
> + if (page_mkwrite_check_truncate(page, inode) < 0) {
>   unlock_page(page);
>   ret = VM_FAULT_NOPAGE;
>   break;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 28f28de0c1b6..51ab1d2cac80 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5871,7 +5871,6 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   struct page *page =

Re: [f2fs-dev] [PATCH] fs: introduce is_dot_dotdot helper for cleanup

2019-12-02 Thread Darrick J. Wong

On Tue, Dec 03, 2019 at 10:07:41AM +0800, Tiezhu Yang wrote:
> On 12/03/2019 04:03 AM, Matthew Wilcox wrote:
> > On Mon, Dec 02, 2019 at 06:10:13PM +0800, Tiezhu Yang wrote:
> > > There exists many similar and duplicate codes to check "." and "..",
> > > so introduce is_dot_dotdot helper to make the code more clean.
> > The idea is good.  The implementation is, I'm afraid, badly chosen.
> > Did you benchmark this change at all?  In general, you should prefer the
> > core kernel implementation to that of some less-interesting filesystems.
> > I measured the performance with the attached test program on my laptop
> > (Core-i7 Kaby Lake):
> > 
> > qstr . time_1 0.020531 time_2 0.005786
> > qstr .. time_1 0.017892 time_2 0.008798
> > qstr a time_1 0.017633 time_2 0.003634
> > qstr matthew time_1 0.011820 time_2 0.003605
> > qstr .a time_1 0.017909 time_2 0.008710
> > qstr , time_1 0.017631 time_2 0.003619
> > 
> > The results are quite stable:
> > 
> > qstr . time_1 0.021137 time_2 0.005780
> > qstr .. time_1 0.017964 time_2 0.008675
> > qstr a time_1 0.017899 time_2 0.003654
> > qstr matthew time_1 0.011821 time_2 0.003620
> > qstr .a time_1 0.017889 time_2 0.008662
> > qstr , time_1 0.017764 time_2 0.003613
> > 
> > Feel free to suggest some different strings we could use for testing.
> > These seemed like interesting strings to test with.  It's always possible
> > I've messed up something with this benchmark that causes it to not
> > accurately represent the performance of each algorithm, so please check
> > that too.
> 
> [Sorry to resend this email because the mail list server
> was denied due to it is not plain text.]
> 
> Hi Matthew,
> 
> Thanks for your reply and suggestion. I measured the
> performance with the test program, the following
> implementation is better for various of test cases:
> 
> bool is_dot_dotdot(const struct qstr *str)
> {
> if (unlikely(str->name[0] == '.')) {
> if (str->len < 2 || (str->len == 2 && str->name[1] == '.'))
> return true;
> }
> 
> return false;
> }
> 
> I will send a v2 patch used with this implementation.

Can you make it a static inline since it's such a short function?

--D

> Thanks,
> 
> Tiezhu Yang
> 
> > 
> > > +bool is_dot_dotdot(const struct qstr *str)
> > > +{
> > > + if (str->len == 1 && str->name[0] == '.')
> > > + return true;
> > > +
> > > + if (str->len == 2 && str->name[0] == '.' && str->name[1] == '.')
> > > + return true;
> > > +
> > > + return false;
> > > +}
> > > +EXPORT_SYMBOL(is_dot_dotdot);
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 2dda552..7730a3b 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -2458,10 +2458,8 @@ static int lookup_one_len_common(const char *name, 
> > > struct dentry *base,
> > >   if (!len)
> > >   return -EACCES;
> > > - if (unlikely(name[0] == '.')) {
> > > - if (len < 2 || (len == 2 && name[1] == '.'))
> > > - return -EACCES;
> > > - }
> > > + if (unlikely(is_dot_dotdot(this)))
> > > + return -EACCES;
> > >   while (len--) {
> > >   unsigned int c = *(const unsigned char *)name++;
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH] fs: Fix page_mkwrite off-by-one errors

2019-11-27 Thread Darrick J. Wong

On Wed, Nov 27, 2019 at 04:18:11PM +0100, Andreas Gruenbacher wrote:
> Fix a check in block_page_mkwrite meant to determine whether an offset
> is within the inode size.  This error has spread to several filesystems
> and to iomap_page_mkwrite, so fix those instances as well.

Seeing how this has gotten screwed up at least six times in the kernel,
maybe we need a static inline helper to do this for us?

> Signed-off-by: Andreas Gruenbacher 

The iomap part looks ok,
Reviewed-by: Darrick J. Wong 

(I might just extract the iomap part and put it in the iomap tree if
someone doesn't merge this one before I get to it...)

--D

> 
> ---
> 
> This patch has a trivial conflict with commit "iomap: Fix overflow in
> iomap_page_mkwrite" in Darrick's iomap pull request for 5.5:
> 
>   https://lore.kernel.org/lkml/20191125190907.GN6219@magnolia/
> ---
>  fs/buffer.c| 2 +-
>  fs/ceph/addr.c | 2 +-
>  fs/ext4/inode.c| 2 +-
>  fs/f2fs/file.c | 2 +-
>  fs/iomap/buffered-io.c | 2 +-
>  fs/ubifs/file.c| 2 +-
>  6 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 86a38b979323..152d391858d4 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2465,7 +2465,7 @@ int block_page_mkwrite(struct vm_area_struct *vma, 
> struct vm_fault *vmf,
>   lock_page(page);
>   size = i_size_read(inode);
>   if ((page->mapping != inode->i_mapping) ||
> - (page_offset(page) > size)) {
> + (page_offset(page) >= size)) {
>   /* We overload EFAULT to mean page got truncated */
>   ret = -EFAULT;
>   goto out_unlock;
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 7ab616601141..9fa0729ece41 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1575,7 +1575,7 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault 
> *vmf)
>   do {
>   lock_page(page);
>  
> - if ((off > size) || (page->mapping != inode->i_mapping)) {
> + if ((off >= size) || (page->mapping != inode->i_mapping)) {
>   unlock_page(page);
>   ret = VM_FAULT_NOPAGE;
>   break;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 516faa280ced..6dd4efe2fb63 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -6224,7 +6224,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>   lock_page(page);
>   size = i_size_read(inode);
>   /* Page got truncated from under us? */
> - if (page->mapping != mapping || page_offset(page) > size) {
> + if (page->mapping != mapping || page_offset(page) >= size) {
>   unlock_page(page);
>   ret = VM_FAULT_NOPAGE;
>   goto out;
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 29bc0a542759..3436be01af45 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -71,7 +71,7 @@ static vm_fault_t f2fs_vm_page_mkwrite(struct vm_fault *vmf)
>   down_read(_I(inode)->i_mmap_sem);
>   lock_page(page);
>   if (unlikely(page->mapping != inode->i_mapping ||
> - page_offset(page) > i_size_read(inode) ||
> + page_offset(page) >= i_size_read(inode) ||
>   !PageUptodate(page))) {
>   unlock_page(page);
>   err = -EFAULT;
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e25901ae3ff4..d454dbab5133 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1041,7 +1041,7 @@ vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf, 
> const struct iomap_ops *ops)
>   lock_page(page);
>   size = i_size_read(inode);
>   if ((page->mapping != inode->i_mapping) ||
> - (page_offset(page) > size)) {
> + (page_offset(page) >= size)) {
>   /* We overload EFAULT to mean page got truncated */
>   ret = -EFAULT;
>   goto out_unlock;
> diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
> index cd52585c8f4f..ca0148ec77e6 100644
> --- a/fs/ubifs/file.c
> +++ b/fs/ubifs/file.c
> @@ -1564,7 +1564,7 @@ static vm_fault_t ubifs_vm_page_mkwrite(struct vm_fault 
> *vmf)
>  
>   lock_page(page);
>   if (unlikely(page->mapping != inode->i_mapping ||
> -  page_offset(page) > i_size_read(inode))) {
> +  page_offset(page) >= i_size_read(inode))) {
>   /* Page got truncated out from underneath us */
>   goto sigbus;
>   }
> -- 
> 2.20.1
> 


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Re: [f2fs-dev] [PATCH] fscrypt: support passing a keyring key to FS_IOC_ADD_ENCRYPTION_KEY

2019-11-17 Thread Darrick J. Wong

On Fri, Nov 15, 2019 at 07:01:39PM -0500, Theodore Y. Ts'o wrote:
> On Sat, Nov 16, 2019 at 12:53:19AM +0200, Jarkko Sakkinen wrote:
> > > I'm working on an xfstest for this:
> > > 
> > >   
> > > https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/xfstests-dev.git/commit/?h=fscrypt-provisioning=24ab6abb7cf6a80be44b7c72b73f0519ccaa5a97
> > > 
> > > It's not quite ready, though.  I'll post it for review when it is.
> > > 
> > > Someone is also planning to update Android userspace to use this.  So if 
> > > there
> > > are any issues from that, I'll hear about it.
> > 
> > Cool. Can you combine this patch and matching test (once it is done) to
> > a patch set?
> 
> That's generally not done since the test goes to a different repo
> (xfstests.git) which has a different review process from the kernel
> change.

FWIW I generally send one series per git tree (kernel, *progs, fstests)
one right after another so that they'll all land more or less together
in everybody's inboxes.

--D

>   - Ted


___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

1 2 >

1 - 100 of 195 matches

Mail list logo