Re: [Cluster-devel] [PATCH 02/11] xfs: add NOWAIT semantics for readdir

2023-08-29 Thread Matthew Wilcox
On Tue, Aug 29, 2023 at 03:41:43PM +0800, Hao Xu wrote:
> On 8/28/23 04:44, Matthew Wilcox wrote:
> > > @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents(
> > >   bp = NULL;
> > >   }
> > > - if (*lock_mode == 0)
> > > - *lock_mode = xfs_ilock_data_map_shared(dp);
> > > + if (*lock_mode == 0) {
> > > + *lock_mode =
> > > + xfs_ilock_data_map_shared_generic(dp,
> > > + ctx->flags & DIR_CONTEXT_F_NOWAIT);
> > > + if (!*lock_mode) {
> > > + error = -EAGAIN;
> > > + break;
> > > + }
> > > + }
> > 
> > 'generic' doesn't seem like a great suffix to mean 'takes nowait flag'.
> > And this is far too far indented.
> > 
> > xfs_dir2_lock(dp, ctx, lock_mode);
> > 
> > with:
> > 
> > STATIC void xfs_dir2_lock(struct xfs_inode *dp, struct dir_context *ctx,
> > unsigned int lock_mode)
> > {
> > if (*lock_mode)
> > return;
> > if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
> > return xfs_ilock_data_map_shared_nowait(dp);
> > return xfs_ilock_data_map_shared(dp);
> > }
> > 
> > ... which I think you can use elsewhere in this patch (reformat it to
> > XFS coding style, of course).  And then you don't need
> > xfs_ilock_data_map_shared_generic().
> 
> How about rename xfs_ilock_data_map_shared() to xfs_ilock_data_map_block()
> and rename xfs_ilock_data_map_shared_generic() to
> xfs_ilock_data_map_shared()?
> 
> STATIC void xfs_ilock_data_map_shared(struct xfs_inode *dp, struct
> dir_context *ctx, unsigned int lock_mode)
> {
>   if (*lock_mode)
>   return;
>   if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
>   return xfs_ilock_data_map_shared_nowait(dp);
>   return xfs_ilock_data_map_shared_block(dp);
> }

xfs_ilock_data_map_shared() is used for a lot of things which are not
directories.  I think a new function name is appropriate, and that
function name should include the word 'dir' in it somewhere.



Re: [Cluster-devel] [PATCH 07/11] vfs: add nowait parameter for file_accessed()

2023-08-29 Thread Matthew Wilcox
On Tue, Aug 29, 2023 at 03:46:13PM +0800, Hao Xu wrote:
> On 8/28/23 05:32, Matthew Wilcox wrote:
> > On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> > > From: Hao Xu 
> > > 
> > > Add a boolean parameter for file_accessed() to support nowait semantics.
> > > Currently it is true only with io_uring as its initial caller.
> > 
> > So why do we need to do this as part of this series?  Apparently it
> > hasn't caused any problems for filemap_read().
> > 
> 
> We need this parameter to indicate if nowait semantics should be enforced in
> touch_atime(), There are locks and maybe IOs in it.

That's not my point.  We currently call file_accessed() and
touch_atime() for nowait reads and nowait writes.  You haven't done
anything to fix those.

I suspect you can trim this patchset down significantly by avoiding
fixing the file_accessed() problem.  And then come back with a later
patchset that fixes it for all nowait i/o.  Or do a separate prep series
first that fixes it for the existing nowait users, and then a second
series to do all the directory stuff.

I'd do the first thing.  Just ignore the problem.  Directory atime
updates cause I/O so rarely that you can afford to ignore it.  Almost
everyone uses relatime or nodiratime.



Re: [Cluster-devel] [PATCH 07/11] vfs: add nowait parameter for file_accessed()

2023-08-27 Thread Matthew Wilcox
On Sun, Aug 27, 2023 at 09:28:31PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> Add a boolean parameter for file_accessed() to support nowait semantics.
> Currently it is true only with io_uring as its initial caller.

So why do we need to do this as part of this series?  Apparently it
hasn't caused any problems for filemap_read().

> +++ b/mm/filemap.c
> @@ -2723,7 +2723,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct 
> iov_iter *iter,
>   folio_batch_init();
>   } while (iov_iter_count(iter) && iocb->ki_pos < isize && !error);
>  
> - file_accessed(filp);
> + file_accessed(filp, false);
>  
>   return already_read ? already_read : error;
>  }
> @@ -2809,7 +2809,7 @@ generic_file_read_iter(struct kiocb *iocb, struct 
> iov_iter *iter)
>   retval = kiocb_write_and_wait(iocb, count);
>   if (retval < 0)
>   return retval;
> - file_accessed(file);
> + file_accessed(file, false);
>  
>   retval = mapping->a_ops->direct_IO(iocb, iter);
>   if (retval >= 0) {
> @@ -2978,7 +2978,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t 
> *ppos,
>  
>  out:
>   folio_batch_release();
> - file_accessed(in);
> + file_accessed(in, false);
>  
>   return total_spliced ? total_spliced : error;
>  }



Re: [Cluster-devel] [PATCH 09/11] vfs: error out -EAGAIN if atime needs to be updated

2023-08-27 Thread Matthew Wilcox
On Sun, Aug 27, 2023 at 09:28:33PM +0800, Hao Xu wrote:
> From: Hao Xu 
> 
> To enforce nowait semantics, error out -EAGAIN if atime needs to be
> updated.

Squash this into patch 6.  Otherwise patch 6 makes no sense.



Re: [Cluster-devel] [PATCH 04/11] vfs: add a vfs helper for io_uring file pos lock

2023-08-27 Thread Matthew Wilcox
On Sun, Aug 27, 2023 at 09:28:28PM +0800, Hao Xu wrote:
> +++ b/include/linux/file.h
> @@ -81,6 +81,8 @@ static inline void fdput_pos(struct fd f)
>   fdput(f);
>  }
>  
> +extern int file_pos_lock_nowait(struct file *file, bool nowait);

No extern on function declarations.



Re: [Cluster-devel] [PATCH 02/11] xfs: add NOWAIT semantics for readdir

2023-08-27 Thread Matthew Wilcox
On Sun, Aug 27, 2023 at 09:28:26PM +0800, Hao Xu wrote:
> +++ b/fs/xfs/libxfs/xfs_da_btree.c
> @@ -2643,16 +2643,32 @@ xfs_da_read_buf(
>   struct xfs_buf_map  map, *mapp = 
>   int nmap = 1;
>   int error;
> + int buf_flags = 0;
>  
>   *bpp = NULL;
>   error = xfs_dabuf_map(dp, bno, flags, whichfork, , );
>   if (error || !nmap)
>   goto out_free;
>  
> + /*
> +  * NOWAIT semantics mean we don't wait on the buffer lock nor do we
> +  * issue IO for this buffer if it is not already in memory. Caller will
> +  * retry. This will return -EAGAIN if the buffer is in memory and cannot
> +  * be locked, and no buffer and no error if it isn't in memory.  We
> +  * translate both of those into a return state of -EAGAIN and *bpp =
> +  * NULL.
> +  */

I would not include this comment.

> + if (flags & XFS_DABUF_NOWAIT)
> + buf_flags |= XBF_TRYLOCK | XBF_INCORE;
>   error = xfs_trans_read_buf_map(mp, tp, mp->m_ddev_targp, mapp, nmap, 0,
>   , ops);

what tsting did you do with this?  Because you don't actually _use_
buf_flags anywhere in this patch (presumably they should be the
sixth argument to xfs_trans_read_buf_map() instead of 0).  So I can only
conclude that either you didn't test, or your testing was inadequate.

>   if (error)
>   goto out_free;
> + if (!bp) {
> + ASSERT(flags & XFS_DABUF_NOWAIT);

I don't think this ASSERT is appropriate.

> @@ -391,10 +401,17 @@ xfs_dir2_leaf_getdents(
>   bp = NULL;
>   }
>  
> - if (*lock_mode == 0)
> - *lock_mode = xfs_ilock_data_map_shared(dp);
> + if (*lock_mode == 0) {
> + *lock_mode =
> + xfs_ilock_data_map_shared_generic(dp,
> + ctx->flags & DIR_CONTEXT_F_NOWAIT);
> + if (!*lock_mode) {
> + error = -EAGAIN;
> + break;
> + }
> + }

'generic' doesn't seem like a great suffix to mean 'takes nowait flag'.
And this is far too far indented.

xfs_dir2_lock(dp, ctx, lock_mode);

with:

STATIC void xfs_dir2_lock(struct xfs_inode *dp, struct dir_context *ctx,
unsigned int lock_mode)
{
if (*lock_mode)
return;
if (ctx->flags & DIR_CONTEXT_F_NOWAIT)
return xfs_ilock_data_map_shared_nowait(dp);
return xfs_ilock_data_map_shared(dp);
}

... which I think you can use elsewhere in this patch (reformat it to
XFS coding style, of course).  And then you don't need
xfs_ilock_data_map_shared_generic().



Re: [Cluster-devel] [PATCH 22/29] xfs: comment page allocation for nowait case in xfs_buf_find_insert()

2023-08-25 Thread Matthew Wilcox
On Fri, Aug 25, 2023 at 09:54:24PM +0800, Hao Xu wrote:
> @@ -633,6 +633,8 @@ xfs_buf_find_insert(
>* allocate the memory from the heap to minimise memory usage. If we
>* can't get heap memory for these small buffers, we fall back to using
>* the page allocator.
> +  * xfs_buf_alloc_kmem may return -EAGAIN, let's not return it but turn
> +  * to page allocator as well.

This new sentence seems like it says exactly the same thing as the
previous sentence.  What am I missing?



Re: [Cluster-devel] [PATCH 12/29] xfs: enforce GFP_NOIO implicitly during nowait time update

2023-08-25 Thread Matthew Wilcox
On Fri, Aug 25, 2023 at 09:54:14PM +0800, Hao Xu wrote:
> +++ b/fs/xfs/xfs_iops.c
> @@ -1037,6 +1037,8 @@ xfs_vn_update_time(
>   int log_flags = XFS_ILOG_TIMESTAMP;
>   struct xfs_trans*tp;
>   int error;
> + int old_pflags;
> + boolnowait = flags & S_NOWAIT;
>  
>   trace_xfs_update_time(ip);
>  
> @@ -1049,13 +1051,18 @@ xfs_vn_update_time(
>   log_flags |= XFS_ILOG_CORE;
>   }
>  
> + if (nowait)
> + old_pflags = memalloc_noio_save();
> +
>   error = xfs_trans_alloc(mp, _RES(mp)->tr_fsyncts, 0, 0, 0, );

This is an abuse of the memalloc_noio_save() interface.  You shouldn't
be setting it around individual allocations; it's the part of the kernel
which decides "I can't afford to do I/O" that should be setting it.
In this case, it should probably be set by io_uring, way way way up at
the top.

But Jens didn't actually answer my question about that:

https://lore.kernel.org/all/zmhzh2eypmh1w...@casper.infradead.org/



Re: [Cluster-devel] gfs2 write bandwidth regression on 6.4-rc3 compareto 5.15.y

2023-07-10 Thread Matthew Wilcox
On Mon, Jul 10, 2023 at 03:19:54PM +0200, Andreas Gruenbacher wrote:
> Hi Wang Yugui,
> 
> On Sun, May 28, 2023 at 5:53 PM Wang Yugui  wrote:
> > Hi,
> >
> > > Hi,
> > >
> > > gfs2 write bandwidth regression on 6.4-rc3 compare to 5.15.y.
> > >
> > > we added  linux-xfs@ and linux-fsdevel@ because some related problem[1]
> > > and related patches[2].
> > >
> > > we compared 6.4-rc3(rather than 6.1.y) to 5.15.y because some related 
> > > patches[2]
> > > work only for 6.4 now.
> > >
> > > [1] 
> > > https://lore.kernel.org/linux-xfs/20230508172406.1cf3.40950...@e16-tech.com/
> > > [2] 
> > > https://lore.kernel.org/linux-xfs/20230520163603.1794256-1-wi...@infradead.org/
> > >
> > >
> > > test case:
> > > 1) PCIe3 SSD *4 with LVM
> > > 2) gfs2 lock_nolock
> > > gfs2 attr(T) GFS2_AF_ORLOV
> > ># chattr +T /mnt/test
> > > 3) fio
> > > fio --name=global --rw=write -bs=1024Ki -size=32Gi -runtime=30 -iodepth 1
> > > -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=1 \
> > >   -name write-bandwidth-1 -filename=/mnt/test/sub1/1.txt \
> > >   -name write-bandwidth-2 -filename=/mnt/test/sub2/1.txt \
> > >   -name write-bandwidth-3 -filename=/mnt/test/sub3/1.txt \
> > >   -name write-bandwidth-4 -filename=/mnt/test/sub4/1.txt
> > > 4) patches[2] are applied to 6.4-rc3.
> > >
> > >
> > > 5.15.y result
> > >   fio WRITE: bw=5139MiB/s (5389MB/s),
> > > 6.4-rc3 result
> > >   fio  WRITE: bw=2599MiB/s (2725MB/s)
> >
> > more test result:
> >
> > 5.17.0  WRITE: bw=4988MiB/s (5231MB/s)
> > 5.18.0  WRITE: bw=5165MiB/s (5416MB/s)
> > 5.19.0  WRITE: bw=5511MiB/s (5779MB/s)
> > 6.0.5   WRITE: bw=3055MiB/s (3203MB/s), WRITE: bw=3225MiB/s (3382MB/s)
> > 6.1.30  WRITE: bw=2579MiB/s (2705MB/s)
> >
> > so this regression  happen in some code introduced in 6.0,
> > and maybe some minor regression in 6.1 too?
> 
> thanks for this bug report. Bob has noticed a similar looking
> performance regression recently, and it turned out that commit
> e1fa9ea85ce8 ("gfs2: Stop using glock holder auto-demotion for now")
> inadvertently caused buffered writes to fall back to writing single
> pages instead of multiple pages at once. That patch was added in
> v5.18, so it doesn't perfectly align with the regression history
> you're reporting, but maybe there's something else going on that we're
> not aware of.

Dave gave a good explanation of the problem here:

https://lore.kernel.org/linux-xfs/zkybxcxzmui1t...@dread.disaster.area/

It's a pagecache locking contention problem rather than an individual
filesystem problem.

... are you interested in supporting large folios in gfs2?  ;-)



[Cluster-devel] [PATCH v3 04/14] buffer: Convert __block_write_full_page() to __block_write_full_folio()

2023-06-12 Thread Matthew Wilcox (Oracle)
Remove nine hidden calls to compound_head() by using a folio instead
of a page.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Bob Peterson 
Reviewed-by: Bob Peterson 
---
 fs/buffer.c | 53 +++--
 fs/gfs2/aops.c  |  5 ++--
 fs/ntfs/aops.c  |  2 +-
 fs/reiserfs/inode.c |  2 +-
 include/linux/buffer_head.h |  2 +-
 5 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index a7fc561758b1..4d518df50fab 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1764,7 +1764,7 @@ static struct buffer_head *folio_create_buffers(struct 
folio *folio,
  * WB_SYNC_ALL, the writes are posted using REQ_SYNC; this
  * causes the writes to be flagged as synchronous writes.
  */
-int __block_write_full_page(struct inode *inode, struct page *page,
+int __block_write_full_folio(struct inode *inode, struct folio *folio,
get_block_t *get_block, struct writeback_control *wbc,
bh_end_io_t *handler)
 {
@@ -1776,14 +1776,14 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
int nr_underway = 0;
blk_opf_t write_flags = wbc_to_write_flags(wbc);
 
-   head = folio_create_buffers(page_folio(page), inode,
+   head = folio_create_buffers(folio, inode,
(1 << BH_Dirty) | (1 << BH_Uptodate));
 
/*
 * Be very careful.  We have no exclusion from block_dirty_folio
 * here, and the (potentially unmapped) buffers may become dirty at
 * any time.  If a buffer becomes dirty here after we've inspected it
-* then we just miss that fact, and the page stays dirty.
+* then we just miss that fact, and the folio stays dirty.
 *
 * Buffers outside i_size may be dirtied by block_dirty_folio;
 * handle that here by just cleaning them.
@@ -1793,7 +1793,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
blocksize = bh->b_size;
bbits = block_size_bits(blocksize);
 
-   block = (sector_t)page->index << (PAGE_SHIFT - bbits);
+   block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
last_block = (i_size_read(inode) - 1) >> bbits;
 
/*
@@ -1804,7 +1804,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
if (block > last_block) {
/*
 * mapped buffers outside i_size will occur, because
-* this page can be outside i_size when there is a
+* this folio can be outside i_size when there is a
 * truncate in progress.
 */
/*
@@ -1834,7 +1834,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
continue;
/*
 * If it's a fully non-blocking write attempt and we cannot
-* lock the buffer then redirty the page.  Note that this can
+* lock the buffer then redirty the folio.  Note that this can
 * potentially cause a busy-wait loop from writeback threads
 * and kswapd activity, but those code paths have their own
 * higher-level throttling.
@@ -1842,7 +1842,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
if (wbc->sync_mode != WB_SYNC_NONE) {
lock_buffer(bh);
} else if (!trylock_buffer(bh)) {
-   redirty_page_for_writepage(wbc, page);
+   folio_redirty_for_writepage(wbc, folio);
continue;
}
if (test_clear_buffer_dirty(bh)) {
@@ -1853,11 +1853,11 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
} while ((bh = bh->b_this_page) != head);
 
/*
-* The page and its buffers are protected by PageWriteback(), so we can
-* drop the bh refcounts early.
+* The folio and its buffers are protected by the writeback flag,
+* so we can drop the bh refcounts early.
 */
-   BUG_ON(PageWriteback(page));
-   set_page_writeback(page);
+   BUG_ON(folio_test_writeback(folio));
+   folio_start_writeback(folio);
 
do {
struct buffer_head *next = bh->b_this_page;
@@ -1867,20 +1867,20 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
}
bh = next;
} while (bh != head);
-   unlock_page(page);
+   folio_unlock(folio);
 
err = 0;
 done:
if (nr_underway == 0) {
/*
-* The page was marked dirty, but the buffers were
+* The folio was marked dirty, but the buffers were
   

[Cluster-devel] [PATCH v3 05/14] gfs2: Support ludicrously large folios in gfs2_trans_add_databufs()

2023-06-12 Thread Matthew Wilcox (Oracle)
We may someday support folios larger than 4GB, so use a size_t for
the byte count within a folio to prevent unpleasant truncations.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Bob Peterson 
Reviewed-by: Bob Peterson 
---
 fs/gfs2/aops.c | 6 +++---
 fs/gfs2/aops.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 3a2be1901e1e..1c407eba1e30 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -38,13 +38,13 @@
 
 
 void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
-unsigned int from, unsigned int len)
+size_t from, size_t len)
 {
struct buffer_head *head = folio_buffers(folio);
unsigned int bsize = head->b_size;
struct buffer_head *bh;
-   unsigned int to = from + len;
-   unsigned int start, end;
+   size_t to = from + len;
+   size_t start, end;
 
for (bh = head, start = 0; bh != head || !start;
 bh = bh->b_this_page, start = end) {
diff --git a/fs/gfs2/aops.h b/fs/gfs2/aops.h
index 09db1914425e..f08322ef41cf 100644
--- a/fs/gfs2/aops.h
+++ b/fs/gfs2/aops.h
@@ -10,6 +10,6 @@
 
 extern void adjust_fs_space(struct inode *inode);
 extern void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
-   unsigned int from, unsigned int len);
+   size_t from, size_t len);
 
 #endif /* __AOPS_DOT_H__ */
-- 
2.39.2



[Cluster-devel] [PATCH v3 08/14] buffer: Convert __block_commit_write() to take a folio

2023-06-12 Thread Matthew Wilcox (Oracle)
This removes a hidden call to compound_head() inside
__block_commit_write() and moves it to those callers which are still
page based.  Also make block_write_end() safe for large folios.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 0af167e8a9c6..97c64b05151f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2116,15 +2116,15 @@ int __block_write_begin(struct page *page, loff_t pos, 
unsigned len,
 }
 EXPORT_SYMBOL(__block_write_begin);
 
-static int __block_commit_write(struct inode *inode, struct page *page,
-   unsigned from, unsigned to)
+static int __block_commit_write(struct inode *inode, struct folio *folio,
+   size_t from, size_t to)
 {
-   unsigned block_start, block_end;
-   int partial = 0;
+   size_t block_start, block_end;
+   bool partial = false;
unsigned blocksize;
struct buffer_head *bh, *head;
 
-   bh = head = page_buffers(page);
+   bh = head = folio_buffers(folio);
blocksize = bh->b_size;
 
block_start = 0;
@@ -2132,7 +2132,7 @@ static int __block_commit_write(struct inode *inode, 
struct page *page,
block_end = block_start + blocksize;
if (block_end <= from || block_start >= to) {
if (!buffer_uptodate(bh))
-   partial = 1;
+   partial = true;
} else {
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
@@ -2147,11 +2147,11 @@ static int __block_commit_write(struct inode *inode, 
struct page *page,
/*
 * If this is a partial write which happened to make all buffers
 * uptodate then we can optimize away a bogus read_folio() for
-* the next read(). Here we 'discover' whether the page went
+* the next read(). Here we 'discover' whether the folio went
 * uptodate as a result of this (potentially partial) write.
 */
if (!partial)
-   SetPageUptodate(page);
+   folio_mark_uptodate(folio);
return 0;
 }
 
@@ -2188,10 +2188,9 @@ int block_write_end(struct file *file, struct 
address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
 {
+   struct folio *folio = page_folio(page);
struct inode *inode = mapping->host;
-   unsigned start;
-
-   start = pos & (PAGE_SIZE - 1);
+   size_t start = pos - folio_pos(folio);
 
if (unlikely(copied < len)) {
/*
@@ -2203,18 +2202,18 @@ int block_write_end(struct file *file, struct 
address_space *mapping,
 * read_folio might come in and destroy our partial write.
 *
 * Do the simplest thing, and just treat any short write to a
-* non uptodate page as a zero-length write, and force the
+* non uptodate folio as a zero-length write, and force the
 * caller to redo the whole thing.
 */
-   if (!PageUptodate(page))
+   if (!folio_test_uptodate(folio))
copied = 0;
 
-   page_zero_new_buffers(page, start+copied, start+len);
+   page_zero_new_buffers(>page, start+copied, start+len);
}
-   flush_dcache_page(page);
+   flush_dcache_folio(folio);
 
/* This could be a short (even 0-length) commit */
-   __block_commit_write(inode, page, start, start+copied);
+   __block_commit_write(inode, folio, start, start + copied);
 
return copied;
 }
@@ -2537,8 +2536,9 @@ EXPORT_SYMBOL(cont_write_begin);
 
 int block_commit_write(struct page *page, unsigned from, unsigned to)
 {
-   struct inode *inode = page->mapping->host;
-   __block_commit_write(inode,page,from,to);
+   struct folio *folio = page_folio(page);
+   struct inode *inode = folio->mapping->host;
+   __block_commit_write(inode, folio, from, to);
return 0;
 }
 EXPORT_SYMBOL(block_commit_write);
@@ -2586,7 +2586,7 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf,
 
ret = __block_write_begin_int(folio, 0, end, get_block, NULL);
if (!ret)
-   ret = block_commit_write(>page, 0, end);
+   ret = __block_commit_write(inode, folio, 0, end);
 
if (unlikely(ret < 0))
goto out_unlock;
-- 
2.39.2



[Cluster-devel] [PATCH v3 00/14] gfs2/buffer folio changes for 6.5

2023-06-12 Thread Matthew Wilcox (Oracle)
This kind of started off as a gfs2 patch series, then became entwined
with buffer heads once I realised that gfs2 was the only remaining
caller of __block_write_full_page().  For those not in the gfs2 world,
the big point of this series is that block_write_full_page() should now
handle large folios correctly.

Andrew, if you want, I'll drop it into the pagecache tree, or you
can just take it.

v3:
 - Fix a patch title
 - Fix some checks against i_size to be >= instead of >
 - Call folio_mark_dirty() instead of folio_set_dirty()

Matthew Wilcox (Oracle) (14):
  gfs2: Use a folio inside gfs2_jdata_writepage()
  gfs2: Pass a folio to __gfs2_jdata_write_folio()
  gfs2: Convert gfs2_write_jdata_page() to gfs2_write_jdata_folio()
  buffer: Convert __block_write_full_page() to
__block_write_full_folio()
  gfs2: Support ludicrously large folios in gfs2_trans_add_databufs()
  buffer: Make block_write_full_page() handle large folios correctly
  buffer: Convert block_page_mkwrite() to use a folio
  buffer: Convert __block_commit_write() to take a folio
  buffer: Convert page_zero_new_buffers() to folio_zero_new_buffers()
  buffer: Convert grow_dev_page() to use a folio
  buffer: Convert init_page_buffers() to folio_init_buffers()
  buffer: Convert link_dev_buffers to take a folio
  buffer: Use a folio in __find_get_block_slow()
  buffer: Convert block_truncate_page() to use a folio

 fs/buffer.c | 257 ++--
 fs/ext4/inode.c |   4 +-
 fs/gfs2/aops.c  |  69 +-
 fs/gfs2/aops.h  |   2 +-
 fs/ntfs/aops.c  |   2 +-
 fs/reiserfs/inode.c |   9 +-
 include/linux/buffer_head.h |   4 +-
 7 files changed, 172 insertions(+), 175 deletions(-)

-- 
2.39.2



[Cluster-devel] [PATCH v3 01/14] gfs2: Use a folio inside gfs2_jdata_writepage()

2023-06-12 Thread Matthew Wilcox (Oracle)
Replace a few implicit calls to compound_head() with one explicit one.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Bob Peterson 
Reviewed-by: Bob Peterson 
---
 fs/gfs2/aops.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index a5f4be6b9213..0518861df783 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -150,20 +150,21 @@ static int __gfs2_jdata_writepage(struct page *page, 
struct writeback_control *w
 
 static int gfs2_jdata_writepage(struct page *page, struct writeback_control 
*wbc)
 {
+   struct folio *folio = page_folio(page);
struct inode *inode = page->mapping->host;
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_sbd *sdp = GFS2_SB(inode);
 
if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
goto out;
-   if (PageChecked(page) || current->journal_info)
+   if (folio_test_checked(folio) || current->journal_info)
goto out_ignore;
-   return __gfs2_jdata_writepage(page, wbc);
+   return __gfs2_jdata_writepage(>page, wbc);
 
 out_ignore:
-   redirty_page_for_writepage(wbc, page);
+   folio_redirty_for_writepage(wbc, folio);
 out:
-   unlock_page(page);
+   folio_unlock(folio);
return 0;
 }
 
-- 
2.39.2



[Cluster-devel] [PATCH v3 02/14] gfs2: Pass a folio to __gfs2_jdata_write_folio()

2023-06-12 Thread Matthew Wilcox (Oracle)
Remove a couple of folio->page conversions in the callers, and two
calls to compound_head() in the function itself.  Rename it from
__gfs2_jdata_writepage() to __gfs2_jdata_write_folio().

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Bob Peterson 
Reviewed-by: Bob Peterson 
---
 fs/gfs2/aops.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 0518861df783..749135252d52 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -113,30 +113,31 @@ static int gfs2_write_jdata_page(struct page *page,
 }
 
 /**
- * __gfs2_jdata_writepage - The core of jdata writepage
- * @page: The page to write
+ * __gfs2_jdata_write_folio - The core of jdata writepage
+ * @folio: The folio to write
  * @wbc: The writeback control
  *
  * This is shared between writepage and writepages and implements the
  * core of the writepage operation. If a transaction is required then
- * PageChecked will have been set and the transaction will have
+ * the checked flag will have been set and the transaction will have
  * already been started before this is called.
  */
-
-static int __gfs2_jdata_writepage(struct page *page, struct writeback_control 
*wbc)
+static int __gfs2_jdata_write_folio(struct folio *folio,
+   struct writeback_control *wbc)
 {
-   struct inode *inode = page->mapping->host;
+   struct inode *inode = folio->mapping->host;
struct gfs2_inode *ip = GFS2_I(inode);
 
-   if (PageChecked(page)) {
-   ClearPageChecked(page);
-   if (!page_has_buffers(page)) {
-   create_empty_buffers(page, inode->i_sb->s_blocksize,
-BIT(BH_Dirty)|BIT(BH_Uptodate));
+   if (folio_test_checked(folio)) {
+   folio_clear_checked(folio);
+   if (!folio_buffers(folio)) {
+   folio_create_empty_buffers(folio,
+   inode->i_sb->s_blocksize,
+   BIT(BH_Dirty)|BIT(BH_Uptodate));
}
-   gfs2_trans_add_databufs(ip, page_folio(page), 0, PAGE_SIZE);
+   gfs2_trans_add_databufs(ip, folio, 0, folio_size(folio));
}
-   return gfs2_write_jdata_page(page, wbc);
+   return gfs2_write_jdata_page(>page, wbc);
 }
 
 /**
@@ -159,7 +160,7 @@ static int gfs2_jdata_writepage(struct page *page, struct 
writeback_control *wbc
goto out;
if (folio_test_checked(folio) || current->journal_info)
goto out_ignore;
-   return __gfs2_jdata_writepage(>page, wbc);
+   return __gfs2_jdata_write_folio(folio, wbc);
 
 out_ignore:
folio_redirty_for_writepage(wbc, folio);
@@ -256,7 +257,7 @@ static int gfs2_write_jdata_batch(struct address_space 
*mapping,
 
trace_wbc_writepage(wbc, inode_to_bdi(inode));
 
-   ret = __gfs2_jdata_writepage(>page, wbc);
+   ret = __gfs2_jdata_write_folio(folio, wbc);
if (unlikely(ret)) {
if (ret == AOP_WRITEPAGE_ACTIVATE) {
folio_unlock(folio);
-- 
2.39.2



[Cluster-devel] [PATCH v3 07/14] buffer: Convert block_page_mkwrite() to use a folio

2023-06-12 Thread Matthew Wilcox (Oracle)
If any page in a folio is dirtied, dirty the entire folio.  Removes a
number of hidden calls to compound_head() and references to page->mapping
and page->index.  Fixes a pre-existing bug where we could mark a folio
as dirty if the file is truncated to a multiple of the page size just
as we take the page fault.  I don't believe this bug has any bad effect,
it's just inefficient.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 34ecf55d2f12..0af167e8a9c6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2564,38 +2564,37 @@ EXPORT_SYMBOL(block_commit_write);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 get_block_t get_block)
 {
-   struct page *page = vmf->page;
+   struct folio *folio = page_folio(vmf->page);
struct inode *inode = file_inode(vma->vm_file);
unsigned long end;
loff_t size;
int ret;
 
-   lock_page(page);
+   folio_lock(folio);
size = i_size_read(inode);
-   if ((page->mapping != inode->i_mapping) ||
-   (page_offset(page) > size)) {
+   if ((folio->mapping != inode->i_mapping) ||
+   (folio_pos(folio) >= size)) {
/* We overload EFAULT to mean page got truncated */
ret = -EFAULT;
goto out_unlock;
}
 
-   /* page is wholly or partially inside EOF */
-   if (((page->index + 1) << PAGE_SHIFT) > size)
-   end = size & ~PAGE_MASK;
-   else
-   end = PAGE_SIZE;
+   end = folio_size(folio);
+   /* folio is wholly or partially inside EOF */
+   if (folio_pos(folio) + end > size)
+   end = size - folio_pos(folio);
 
-   ret = __block_write_begin(page, 0, end, get_block);
+   ret = __block_write_begin_int(folio, 0, end, get_block, NULL);
if (!ret)
-   ret = block_commit_write(page, 0, end);
+   ret = block_commit_write(>page, 0, end);
 
if (unlikely(ret < 0))
goto out_unlock;
-   set_page_dirty(page);
-   wait_for_stable_page(page);
+   folio_mark_dirty(folio);
+   folio_wait_stable(folio);
return 0;
 out_unlock:
-   unlock_page(page);
+   folio_unlock(folio);
return ret;
 }
 EXPORT_SYMBOL(block_page_mkwrite);
-- 
2.39.2



[Cluster-devel] [PATCH v3 06/14] buffer: Make block_write_full_page() handle large folios correctly

2023-06-12 Thread Matthew Wilcox (Oracle)
Keep the interface as struct page, but work entirely on the folio
internally.  Removes several PAGE_SIZE assumptions and removes
some references to page->index and page->mapping.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Bob Peterson 
Reviewed-by: Bob Peterson 
---
 fs/buffer.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4d518df50fab..34ecf55d2f12 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2678,33 +2678,31 @@ int block_write_full_page(struct page *page, 
get_block_t *get_block,
struct writeback_control *wbc)
 {
struct folio *folio = page_folio(page);
-   struct inode * const inode = page->mapping->host;
+   struct inode * const inode = folio->mapping->host;
loff_t i_size = i_size_read(inode);
-   const pgoff_t end_index = i_size >> PAGE_SHIFT;
-   unsigned offset;
 
-   /* Is the page fully inside i_size? */
-   if (page->index < end_index)
+   /* Is the folio fully inside i_size? */
+   if (folio_pos(folio) + folio_size(folio) <= i_size)
return __block_write_full_folio(inode, folio, get_block, wbc,
   end_buffer_async_write);
 
-   /* Is the page fully outside i_size? (truncate in progress) */
-   offset = i_size & (PAGE_SIZE-1);
-   if (page->index >= end_index+1 || !offset) {
+   /* Is the folio fully outside i_size? (truncate in progress) */
+   if (folio_pos(folio) >= i_size) {
folio_unlock(folio);
return 0; /* don't care */
}
 
/*
-* The page straddles i_size.  It must be zeroed out on each and every
+* The folio straddles i_size.  It must be zeroed out on each and every
 * writepage invocation because it may be mmapped.  "A file is mapped
 * in multiples of the page size.  For a file that is not a multiple of
-* the  page size, the remaining memory is zeroed when mapped, and
+* the page size, the remaining memory is zeroed when mapped, and
 * writes to that region are not written out to the file."
 */
-   zero_user_segment(page, offset, PAGE_SIZE);
+   folio_zero_segment(folio, offset_in_folio(folio, i_size),
+   folio_size(folio));
return __block_write_full_folio(inode, folio, get_block, wbc,
-   end_buffer_async_write);
+   end_buffer_async_write);
 }
 EXPORT_SYMBOL(block_write_full_page);
 
-- 
2.39.2



[Cluster-devel] [PATCH v3 13/14] buffer: Use a folio in __find_get_block_slow()

2023-06-12 Thread Matthew Wilcox (Oracle)
Saves a call to compound_head() and may be needed to support
block size > PAGE_SIZE.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4ca2eb2b3dca..c38fdcaa32ff 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -195,19 +195,19 @@ __find_get_block_slow(struct block_device *bdev, sector_t 
block)
pgoff_t index;
struct buffer_head *bh;
struct buffer_head *head;
-   struct page *page;
+   struct folio *folio;
int all_mapped = 1;
static DEFINE_RATELIMIT_STATE(last_warned, HZ, 1);
 
index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
-   page = find_get_page_flags(bd_mapping, index, FGP_ACCESSED);
-   if (!page)
+   folio = __filemap_get_folio(bd_mapping, index, FGP_ACCESSED, 0);
+   if (IS_ERR(folio))
goto out;
 
spin_lock(_mapping->private_lock);
-   if (!page_has_buffers(page))
+   head = folio_buffers(folio);
+   if (!head)
goto out_unlock;
-   head = page_buffers(page);
bh = head;
do {
if (!buffer_mapped(bh))
@@ -237,7 +237,7 @@ __find_get_block_slow(struct block_device *bdev, sector_t 
block)
}
 out_unlock:
spin_unlock(_mapping->private_lock);
-   put_page(page);
+   folio_put(folio);
 out:
return ret;
 }
-- 
2.39.2



[Cluster-devel] [PATCH v3 03/14] gfs2: Convert gfs2_write_jdata_page() to gfs2_write_jdata_folio()

2023-06-12 Thread Matthew Wilcox (Oracle)
Add support for large folios and remove some accesses to page->mapping
and page->index.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Bob Peterson 
Reviewed-by: Bob Peterson 
---
 fs/gfs2/aops.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 749135252d52..ec5b5c1ea634 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -82,33 +82,33 @@ static int gfs2_get_block_noalloc(struct inode *inode, 
sector_t lblock,
 }
 
 /**
- * gfs2_write_jdata_page - gfs2 jdata-specific version of block_write_full_page
- * @page: The page to write
+ * gfs2_write_jdata_folio - gfs2 jdata-specific version of 
block_write_full_page
+ * @folio: The folio to write
  * @wbc: The writeback control
  *
  * This is the same as calling block_write_full_page, but it also
  * writes pages outside of i_size
  */
-static int gfs2_write_jdata_page(struct page *page,
+static int gfs2_write_jdata_folio(struct folio *folio,
 struct writeback_control *wbc)
 {
-   struct inode * const inode = page->mapping->host;
+   struct inode * const inode = folio->mapping->host;
loff_t i_size = i_size_read(inode);
-   const pgoff_t end_index = i_size >> PAGE_SHIFT;
-   unsigned offset;
 
/*
-* The page straddles i_size.  It must be zeroed out on each and every
+* The folio straddles i_size.  It must be zeroed out on each and every
 * writepage invocation because it may be mmapped.  "A file is mapped
 * in multiples of the page size.  For a file that is not a multiple of
-* the  page size, the remaining memory is zeroed when mapped, and
+* the page size, the remaining memory is zeroed when mapped, and
 * writes to that region are not written out to the file."
 */
-   offset = i_size & (PAGE_SIZE - 1);
-   if (page->index == end_index && offset)
-   zero_user_segment(page, offset, PAGE_SIZE);
+   if (folio_pos(folio) < i_size &&
+   i_size < folio_pos(folio) + folio_size(folio))
+   folio_zero_segment(folio, offset_in_folio(folio, i_size),
+   folio_size(folio));
 
-   return __block_write_full_page(inode, page, gfs2_get_block_noalloc, wbc,
+   return __block_write_full_page(inode, >page,
+  gfs2_get_block_noalloc, wbc,
   end_buffer_async_write);
 }
 
@@ -137,7 +137,7 @@ static int __gfs2_jdata_write_folio(struct folio *folio,
}
gfs2_trans_add_databufs(ip, folio, 0, folio_size(folio));
}
-   return gfs2_write_jdata_page(>page, wbc);
+   return gfs2_write_jdata_folio(folio, wbc);
 }
 
 /**
-- 
2.39.2



[Cluster-devel] [PATCH v3 09/14] buffer: Convert page_zero_new_buffers() to folio_zero_new_buffers()

2023-06-12 Thread Matthew Wilcox (Oracle)
Most of the callers already have a folio; convert reiserfs_write_end()
to have a folio.  Removes a couple of hidden calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 27 ++-
 fs/ext4/inode.c |  4 ++--
 fs/reiserfs/inode.c |  7 ---
 include/linux/buffer_head.h |  2 +-
 4 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 97c64b05151f..e4bd465ecee8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1927,33 +1927,34 @@ int __block_write_full_folio(struct inode *inode, 
struct folio *folio,
 EXPORT_SYMBOL(__block_write_full_folio);
 
 /*
- * If a page has any new buffers, zero them out here, and mark them uptodate
+ * If a folio has any new buffers, zero them out here, and mark them uptodate
  * and dirty so they'll be written out (in order to prevent uninitialised
  * block data from leaking). And clear the new bit.
  */
-void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
+void folio_zero_new_buffers(struct folio *folio, size_t from, size_t to)
 {
-   unsigned int block_start, block_end;
+   size_t block_start, block_end;
struct buffer_head *head, *bh;
 
-   BUG_ON(!PageLocked(page));
-   if (!page_has_buffers(page))
+   BUG_ON(!folio_test_locked(folio));
+   head = folio_buffers(folio);
+   if (!head)
return;
 
-   bh = head = page_buffers(page);
+   bh = head;
block_start = 0;
do {
block_end = block_start + bh->b_size;
 
if (buffer_new(bh)) {
if (block_end > from && block_start < to) {
-   if (!PageUptodate(page)) {
-   unsigned start, size;
+   if (!folio_test_uptodate(folio)) {
+   size_t start, xend;
 
start = max(from, block_start);
-   size = min(to, block_end) - start;
+   xend = min(to, block_end);
 
-   zero_user(page, start, size);
+   folio_zero_segment(folio, start, xend);
set_buffer_uptodate(bh);
}
 
@@ -1966,7 +1967,7 @@ void page_zero_new_buffers(struct page *page, unsigned 
from, unsigned to)
bh = bh->b_this_page;
} while (bh != head);
 }
-EXPORT_SYMBOL(page_zero_new_buffers);
+EXPORT_SYMBOL(folio_zero_new_buffers);
 
 static void
 iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
@@ -2104,7 +2105,7 @@ int __block_write_begin_int(struct folio *folio, loff_t 
pos, unsigned len,
err = -EIO;
}
if (unlikely(err))
-   page_zero_new_buffers(>page, from, to);
+   folio_zero_new_buffers(folio, from, to);
return err;
 }
 
@@ -2208,7 +2209,7 @@ int block_write_end(struct file *file, struct 
address_space *mapping,
if (!folio_test_uptodate(folio))
copied = 0;
 
-   page_zero_new_buffers(>page, start+copied, start+len);
+   folio_zero_new_buffers(folio, start+copied, start+len);
}
flush_dcache_folio(folio);
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 02de439bf1f0..9ca583360166 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1093,7 +1093,7 @@ static int ext4_block_write_begin(struct folio *folio, 
loff_t pos, unsigned len,
err = -EIO;
}
if (unlikely(err)) {
-   page_zero_new_buffers(>page, from, to);
+   folio_zero_new_buffers(folio, from, to);
} else if (fscrypt_inode_uses_fs_layer_crypto(inode)) {
for (i = 0; i < nr_wait; i++) {
int err2;
@@ -1339,7 +1339,7 @@ static int ext4_write_end(struct file *file,
 }
 
 /*
- * This is a private version of page_zero_new_buffers() which doesn't
+ * This is a private version of folio_zero_new_buffers() which doesn't
  * set the buffer to be dirty, since in data=journalled mode we need
  * to call ext4_dirty_journalled_data() instead.
  */
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index ff34ee49106f..77bd3b27059f 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -2872,6 +2872,7 @@ static int reiserfs_write_end(struct file *file, struct 
address_space *mapping,
  loff_t pos, unsigned len, unsigned copied,
  struct page *page, void *fsdata)
 {
+   struct folio *folio = page_folio(page);
struct inode *inode = page->mapping->host;
int ret = 0;
int update_sd = 0;
@@ -2887,12 +2888,12 @@ static int reiserfs_write_end(struct file

[Cluster-devel] [PATCH v3 11/14] buffer: Convert init_page_buffers() to folio_init_buffers()

2023-06-12 Thread Matthew Wilcox (Oracle)
Use the folio API and pass the folio from both callers.
Saves a hidden call to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 06d031e28bee..9b9dee417467 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -934,15 +934,14 @@ static sector_t blkdev_max_block(struct block_device 
*bdev, unsigned int size)
 }
 
 /*
- * Initialise the state of a blockdev page's buffers.
+ * Initialise the state of a blockdev folio's buffers.
  */ 
-static sector_t
-init_page_buffers(struct page *page, struct block_device *bdev,
-   sector_t block, int size)
+static sector_t folio_init_buffers(struct folio *folio,
+   struct block_device *bdev, sector_t block, int size)
 {
-   struct buffer_head *head = page_buffers(page);
+   struct buffer_head *head = folio_buffers(folio);
struct buffer_head *bh = head;
-   int uptodate = PageUptodate(page);
+   bool uptodate = folio_test_uptodate(folio);
sector_t end_block = blkdev_max_block(bdev, size);
 
do {
@@ -998,9 +997,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
bh = folio_buffers(folio);
if (bh) {
if (bh->b_size == size) {
-   end_block = init_page_buffers(>page, bdev,
-   (sector_t)index << sizebits,
-   size);
+   end_block = folio_init_buffers(folio, bdev,
+   (sector_t)index << sizebits, size);
goto done;
}
if (!try_to_free_buffers(folio))
@@ -1016,7 +1014,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 */
spin_lock(>i_mapping->private_lock);
link_dev_buffers(>page, bh);
-   end_block = init_page_buffers(>page, bdev,
+   end_block = folio_init_buffers(folio, bdev,
(sector_t)index << sizebits, size);
spin_unlock(>i_mapping->private_lock);
 done:
-- 
2.39.2



[Cluster-devel] [PATCH v3 14/14] buffer: Convert block_truncate_page() to use a folio

2023-06-12 Thread Matthew Wilcox (Oracle)
Support large folios in block_truncate_page() and avoid three hidden
calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c38fdcaa32ff..5a5b0c9d9769 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2598,17 +2598,16 @@ int block_truncate_page(struct address_space *mapping,
loff_t from, get_block_t *get_block)
 {
pgoff_t index = from >> PAGE_SHIFT;
-   unsigned offset = from & (PAGE_SIZE-1);
unsigned blocksize;
sector_t iblock;
-   unsigned length, pos;
+   size_t offset, length, pos;
struct inode *inode = mapping->host;
-   struct page *page;
+   struct folio *folio;
struct buffer_head *bh;
int err = 0;
 
blocksize = i_blocksize(inode);
-   length = offset & (blocksize - 1);
+   length = from & (blocksize - 1);
 
/* Block boundary? Nothing to do */
if (!length)
@@ -2617,15 +2616,18 @@ int block_truncate_page(struct address_space *mapping,
length = blocksize - length;
iblock = (sector_t)index << (PAGE_SHIFT - inode->i_blkbits);

-   page = grab_cache_page(mapping, index);
-   if (!page)
+   folio = filemap_grab_folio(mapping, index);
+   if (!folio)
return -ENOMEM;
 
-   if (!page_has_buffers(page))
-   create_empty_buffers(page, blocksize, 0);
+   bh = folio_buffers(folio);
+   if (!bh) {
+   folio_create_empty_buffers(folio, blocksize, 0);
+   bh = folio_buffers(folio);
+   }
 
/* Find the buffer that contains "offset" */
-   bh = page_buffers(page);
+   offset = offset_in_folio(folio, from);
pos = blocksize;
while (offset >= pos) {
bh = bh->b_this_page;
@@ -2644,7 +2646,7 @@ int block_truncate_page(struct address_space *mapping,
}
 
/* Ok, it's mapped. Make sure it's up-to-date */
-   if (PageUptodate(page))
+   if (folio_test_uptodate(folio))
set_buffer_uptodate(bh);
 
if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) 
{
@@ -2654,12 +2656,12 @@ int block_truncate_page(struct address_space *mapping,
goto unlock;
}
 
-   zero_user(page, offset, length);
+   folio_zero_range(folio, offset, length);
mark_buffer_dirty(bh);
 
 unlock:
-   unlock_page(page);
-   put_page(page);
+   folio_unlock(folio);
+   folio_put(folio);
 
return err;
 }
-- 
2.39.2



[Cluster-devel] [PATCH v3 12/14] buffer: Convert link_dev_buffers to take a folio

2023-06-12 Thread Matthew Wilcox (Oracle)
Its one caller already has a folio, so switch it to use the
folio API.  Removes a hidden call to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9b9dee417467..4ca2eb2b3dca 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -907,8 +907,8 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
 }
 EXPORT_SYMBOL_GPL(alloc_page_buffers);
 
-static inline void
-link_dev_buffers(struct page *page, struct buffer_head *head)
+static inline void link_dev_buffers(struct folio *folio,
+   struct buffer_head *head)
 {
struct buffer_head *bh, *tail;
 
@@ -918,7 +918,7 @@ link_dev_buffers(struct page *page, struct buffer_head 
*head)
bh = bh->b_this_page;
} while (bh);
tail->b_this_page = head;
-   attach_page_private(page, head);
+   folio_attach_private(folio, head);
 }
 
 static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
@@ -1013,7 +1013,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 * run under the folio lock.
 */
spin_lock(>i_mapping->private_lock);
-   link_dev_buffers(>page, bh);
+   link_dev_buffers(folio, bh);
end_block = folio_init_buffers(folio, bdev,
(sector_t)index << sizebits, size);
spin_unlock(>i_mapping->private_lock);
-- 
2.39.2



[Cluster-devel] [PATCH v3 10/14] buffer: Convert grow_dev_page() to use a folio

2023-06-12 Thread Matthew Wilcox (Oracle)
Get a folio from the page cache instead of a page, then use the
folio API throughout.  Removes a few calls to compound_head()
and may be needed to support block size > PAGE_SIZE.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 34 +++---
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e4bd465ecee8..06d031e28bee 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -976,7 +976,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
  pgoff_t index, int size, int sizebits, gfp_t gfp)
 {
struct inode *inode = bdev->bd_inode;
-   struct page *page;
+   struct folio *folio;
struct buffer_head *bh;
sector_t end_block;
int ret = 0;
@@ -992,42 +992,38 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 */
gfp_mask |= __GFP_NOFAIL;
 
-   page = find_or_create_page(inode->i_mapping, index, gfp_mask);
-
-   BUG_ON(!PageLocked(page));
+   folio = __filemap_get_folio(inode->i_mapping, index,
+   FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp_mask);
 
-   if (page_has_buffers(page)) {
-   bh = page_buffers(page);
+   bh = folio_buffers(folio);
+   if (bh) {
if (bh->b_size == size) {
-   end_block = init_page_buffers(page, bdev,
+   end_block = init_page_buffers(>page, bdev,
(sector_t)index << sizebits,
size);
goto done;
}
-   if (!try_to_free_buffers(page_folio(page)))
+   if (!try_to_free_buffers(folio))
goto failed;
}
 
-   /*
-* Allocate some buffers for this page
-*/
-   bh = alloc_page_buffers(page, size, true);
+   bh = folio_alloc_buffers(folio, size, true);
 
/*
-* Link the page to the buffers and initialise them.  Take the
+* Link the folio to the buffers and initialise them.  Take the
 * lock to be atomic wrt __find_get_block(), which does not
-* run under the page lock.
+* run under the folio lock.
 */
spin_lock(>i_mapping->private_lock);
-   link_dev_buffers(page, bh);
-   end_block = init_page_buffers(page, bdev, (sector_t)index << sizebits,
-   size);
+   link_dev_buffers(>page, bh);
+   end_block = init_page_buffers(>page, bdev,
+   (sector_t)index << sizebits, size);
spin_unlock(>i_mapping->private_lock);
 done:
ret = (block < end_block) ? 1 : -ENXIO;
 failed:
-   unlock_page(page);
-   put_page(page);
+   folio_unlock(folio);
+   folio_put(folio);
return ret;
 }
 
-- 
2.39.2



[Cluster-devel] [PATCH v2 06/14] buffer: Make block_write_full_page() handle large folios correctly

2023-06-06 Thread Matthew Wilcox (Oracle)
Keep the interface as struct page, but work entirely on the folio
internally.  Removes several PAGE_SIZE assumptions and removes
some references to page->index and page->mapping.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Bob Peterson 
Reviewed-by: Bob Peterson 
---
 fs/buffer.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4d518df50fab..d8c2c000676b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2678,33 +2678,31 @@ int block_write_full_page(struct page *page, 
get_block_t *get_block,
struct writeback_control *wbc)
 {
struct folio *folio = page_folio(page);
-   struct inode * const inode = page->mapping->host;
+   struct inode * const inode = folio->mapping->host;
loff_t i_size = i_size_read(inode);
-   const pgoff_t end_index = i_size >> PAGE_SHIFT;
-   unsigned offset;
 
-   /* Is the page fully inside i_size? */
-   if (page->index < end_index)
+   /* Is the folio fully inside i_size? */
+   if (folio_pos(folio) + folio_size(folio) <= i_size)
return __block_write_full_folio(inode, folio, get_block, wbc,
   end_buffer_async_write);
 
-   /* Is the page fully outside i_size? (truncate in progress) */
-   offset = i_size & (PAGE_SIZE-1);
-   if (page->index >= end_index+1 || !offset) {
+   /* Is the folio fully outside i_size? (truncate in progress) */
+   if (folio_pos(folio) > i_size) {
folio_unlock(folio);
return 0; /* don't care */
}
 
/*
-* The page straddles i_size.  It must be zeroed out on each and every
+* The folio straddles i_size.  It must be zeroed out on each and every
 * writepage invocation because it may be mmapped.  "A file is mapped
 * in multiples of the page size.  For a file that is not a multiple of
-* the  page size, the remaining memory is zeroed when mapped, and
+* the page size, the remaining memory is zeroed when mapped, and
 * writes to that region are not written out to the file."
 */
-   zero_user_segment(page, offset, PAGE_SIZE);
+   folio_zero_segment(folio, offset_in_folio(folio, i_size),
+   folio_size(folio));
return __block_write_full_folio(inode, folio, get_block, wbc,
-   end_buffer_async_write);
+   end_buffer_async_write);
 }
 EXPORT_SYMBOL(block_write_full_page);
 
-- 
2.39.2



[Cluster-devel] [PATCH v2 08/14] buffer: Convert __block_commit_write() to take a folio

2023-06-06 Thread Matthew Wilcox (Oracle)
This removes a hidden call to compound_head() inside
__block_commit_write() and moves it to those callers which are still
page based.  Also make block_write_end() safe for large folios.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 38 +++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index f34ed29b1085..8ea9edd86519 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2116,15 +2116,15 @@ int __block_write_begin(struct page *page, loff_t pos, 
unsigned len,
 }
 EXPORT_SYMBOL(__block_write_begin);
 
-static int __block_commit_write(struct inode *inode, struct page *page,
-   unsigned from, unsigned to)
+static int __block_commit_write(struct inode *inode, struct folio *folio,
+   size_t from, size_t to)
 {
-   unsigned block_start, block_end;
-   int partial = 0;
+   size_t block_start, block_end;
+   bool partial = false;
unsigned blocksize;
struct buffer_head *bh, *head;
 
-   bh = head = page_buffers(page);
+   bh = head = folio_buffers(folio);
blocksize = bh->b_size;
 
block_start = 0;
@@ -2132,7 +2132,7 @@ static int __block_commit_write(struct inode *inode, 
struct page *page,
block_end = block_start + blocksize;
if (block_end <= from || block_start >= to) {
if (!buffer_uptodate(bh))
-   partial = 1;
+   partial = true;
} else {
set_buffer_uptodate(bh);
mark_buffer_dirty(bh);
@@ -2147,11 +2147,11 @@ static int __block_commit_write(struct inode *inode, 
struct page *page,
/*
 * If this is a partial write which happened to make all buffers
 * uptodate then we can optimize away a bogus read_folio() for
-* the next read(). Here we 'discover' whether the page went
+* the next read(). Here we 'discover' whether the folio went
 * uptodate as a result of this (potentially partial) write.
 */
if (!partial)
-   SetPageUptodate(page);
+   folio_mark_uptodate(folio);
return 0;
 }
 
@@ -2188,10 +2188,9 @@ int block_write_end(struct file *file, struct 
address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata)
 {
+   struct folio *folio = page_folio(page);
struct inode *inode = mapping->host;
-   unsigned start;
-
-   start = pos & (PAGE_SIZE - 1);
+   size_t start = pos - folio_pos(folio);
 
if (unlikely(copied < len)) {
/*
@@ -2203,18 +2202,18 @@ int block_write_end(struct file *file, struct 
address_space *mapping,
 * read_folio might come in and destroy our partial write.
 *
 * Do the simplest thing, and just treat any short write to a
-* non uptodate page as a zero-length write, and force the
+* non uptodate folio as a zero-length write, and force the
 * caller to redo the whole thing.
 */
-   if (!PageUptodate(page))
+   if (!folio_test_uptodate(folio))
copied = 0;
 
-   page_zero_new_buffers(page, start+copied, start+len);
+   page_zero_new_buffers(>page, start+copied, start+len);
}
-   flush_dcache_page(page);
+   flush_dcache_folio(folio);
 
/* This could be a short (even 0-length) commit */
-   __block_commit_write(inode, page, start, start+copied);
+   __block_commit_write(inode, folio, start, start + copied);
 
return copied;
 }
@@ -2537,8 +2536,9 @@ EXPORT_SYMBOL(cont_write_begin);
 
 int block_commit_write(struct page *page, unsigned from, unsigned to)
 {
-   struct inode *inode = page->mapping->host;
-   __block_commit_write(inode,page,from,to);
+   struct folio *folio = page_folio(page);
+   struct inode *inode = folio->mapping->host;
+   __block_commit_write(inode, folio, from, to);
return 0;
 }
 EXPORT_SYMBOL(block_commit_write);
@@ -2586,7 +2586,7 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct 
vm_fault *vmf,
 
ret = __block_write_begin_int(folio, 0, end, get_block, NULL);
if (!ret)
-   ret = block_commit_write(>page, 0, end);
+   ret = __block_commit_write(inode, folio, 0, end);
 
if (unlikely(ret < 0))
goto out_unlock;
-- 
2.39.2



[Cluster-devel] [PATCH v2 05/14] gfs2: Support ludicrously large folios in gfs2_trans_add_databufs()

2023-06-06 Thread Matthew Wilcox (Oracle)
We may someday support folios larger than 4GB, so use a size_t for
the byte count within a folio to prevent unpleasant truncations.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Bob Peterson 
Reviewed-by: Bob Peterson 
---
 fs/gfs2/aops.c | 6 +++---
 fs/gfs2/aops.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 3a2be1901e1e..1c407eba1e30 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -38,13 +38,13 @@
 
 
 void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
-unsigned int from, unsigned int len)
+size_t from, size_t len)
 {
struct buffer_head *head = folio_buffers(folio);
unsigned int bsize = head->b_size;
struct buffer_head *bh;
-   unsigned int to = from + len;
-   unsigned int start, end;
+   size_t to = from + len;
+   size_t start, end;
 
for (bh = head, start = 0; bh != head || !start;
 bh = bh->b_this_page, start = end) {
diff --git a/fs/gfs2/aops.h b/fs/gfs2/aops.h
index 09db1914425e..f08322ef41cf 100644
--- a/fs/gfs2/aops.h
+++ b/fs/gfs2/aops.h
@@ -10,6 +10,6 @@
 
 extern void adjust_fs_space(struct inode *inode);
 extern void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
-   unsigned int from, unsigned int len);
+   size_t from, size_t len);
 
 #endif /* __AOPS_DOT_H__ */
-- 
2.39.2



[Cluster-devel] [PATCH v2 04/14] buffer: Convert __block_write_full_page() to __block_write_full_folio()

2023-06-06 Thread Matthew Wilcox (Oracle)
Remove nine hidden calls to compound_head() by using a folio instead
of a page.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Bob Peterson 
Reviewed-by: Bob Peterson 
---
 fs/buffer.c | 53 +++--
 fs/gfs2/aops.c  |  5 ++--
 fs/ntfs/aops.c  |  2 +-
 fs/reiserfs/inode.c |  2 +-
 include/linux/buffer_head.h |  2 +-
 5 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index a7fc561758b1..4d518df50fab 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1764,7 +1764,7 @@ static struct buffer_head *folio_create_buffers(struct 
folio *folio,
  * WB_SYNC_ALL, the writes are posted using REQ_SYNC; this
  * causes the writes to be flagged as synchronous writes.
  */
-int __block_write_full_page(struct inode *inode, struct page *page,
+int __block_write_full_folio(struct inode *inode, struct folio *folio,
get_block_t *get_block, struct writeback_control *wbc,
bh_end_io_t *handler)
 {
@@ -1776,14 +1776,14 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
int nr_underway = 0;
blk_opf_t write_flags = wbc_to_write_flags(wbc);
 
-   head = folio_create_buffers(page_folio(page), inode,
+   head = folio_create_buffers(folio, inode,
(1 << BH_Dirty) | (1 << BH_Uptodate));
 
/*
 * Be very careful.  We have no exclusion from block_dirty_folio
 * here, and the (potentially unmapped) buffers may become dirty at
 * any time.  If a buffer becomes dirty here after we've inspected it
-* then we just miss that fact, and the page stays dirty.
+* then we just miss that fact, and the folio stays dirty.
 *
 * Buffers outside i_size may be dirtied by block_dirty_folio;
 * handle that here by just cleaning them.
@@ -1793,7 +1793,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
blocksize = bh->b_size;
bbits = block_size_bits(blocksize);
 
-   block = (sector_t)page->index << (PAGE_SHIFT - bbits);
+   block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
last_block = (i_size_read(inode) - 1) >> bbits;
 
/*
@@ -1804,7 +1804,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
if (block > last_block) {
/*
 * mapped buffers outside i_size will occur, because
-* this page can be outside i_size when there is a
+* this folio can be outside i_size when there is a
 * truncate in progress.
 */
/*
@@ -1834,7 +1834,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
continue;
/*
 * If it's a fully non-blocking write attempt and we cannot
-* lock the buffer then redirty the page.  Note that this can
+* lock the buffer then redirty the folio.  Note that this can
 * potentially cause a busy-wait loop from writeback threads
 * and kswapd activity, but those code paths have their own
 * higher-level throttling.
@@ -1842,7 +1842,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
if (wbc->sync_mode != WB_SYNC_NONE) {
lock_buffer(bh);
} else if (!trylock_buffer(bh)) {
-   redirty_page_for_writepage(wbc, page);
+   folio_redirty_for_writepage(wbc, folio);
continue;
}
if (test_clear_buffer_dirty(bh)) {
@@ -1853,11 +1853,11 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
} while ((bh = bh->b_this_page) != head);
 
/*
-* The page and its buffers are protected by PageWriteback(), so we can
-* drop the bh refcounts early.
+* The folio and its buffers are protected by the writeback flag,
+* so we can drop the bh refcounts early.
 */
-   BUG_ON(PageWriteback(page));
-   set_page_writeback(page);
+   BUG_ON(folio_test_writeback(folio));
+   folio_start_writeback(folio);
 
do {
struct buffer_head *next = bh->b_this_page;
@@ -1867,20 +1867,20 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
}
bh = next;
} while (bh != head);
-   unlock_page(page);
+   folio_unlock(folio);
 
err = 0;
 done:
if (nr_underway == 0) {
/*
-* The page was marked dirty, but the buffers were
+* The folio was marked dirty, but the buffers were
   

[Cluster-devel] [PATCH v2 02/14] gfs2: Pass a folio to __gfs2_jdata_write_folio()

2023-06-06 Thread Matthew Wilcox (Oracle)
Remove a couple of folio->page conversions in the callers, and two
calls to compound_head() in the function itself.  Rename it from
__gfs2_jdata_writepage() to __gfs2_jdata_write_folio().

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Bob Peterson 
Reviewed-by: Bob Peterson 
---
 fs/gfs2/aops.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 0518861df783..749135252d52 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -113,30 +113,31 @@ static int gfs2_write_jdata_page(struct page *page,
 }
 
 /**
- * __gfs2_jdata_writepage - The core of jdata writepage
- * @page: The page to write
+ * __gfs2_jdata_write_folio - The core of jdata writepage
+ * @folio: The folio to write
  * @wbc: The writeback control
  *
  * This is shared between writepage and writepages and implements the
  * core of the writepage operation. If a transaction is required then
- * PageChecked will have been set and the transaction will have
+ * the checked flag will have been set and the transaction will have
  * already been started before this is called.
  */
-
-static int __gfs2_jdata_writepage(struct page *page, struct writeback_control 
*wbc)
+static int __gfs2_jdata_write_folio(struct folio *folio,
+   struct writeback_control *wbc)
 {
-   struct inode *inode = page->mapping->host;
+   struct inode *inode = folio->mapping->host;
struct gfs2_inode *ip = GFS2_I(inode);
 
-   if (PageChecked(page)) {
-   ClearPageChecked(page);
-   if (!page_has_buffers(page)) {
-   create_empty_buffers(page, inode->i_sb->s_blocksize,
-BIT(BH_Dirty)|BIT(BH_Uptodate));
+   if (folio_test_checked(folio)) {
+   folio_clear_checked(folio);
+   if (!folio_buffers(folio)) {
+   folio_create_empty_buffers(folio,
+   inode->i_sb->s_blocksize,
+   BIT(BH_Dirty)|BIT(BH_Uptodate));
}
-   gfs2_trans_add_databufs(ip, page_folio(page), 0, PAGE_SIZE);
+   gfs2_trans_add_databufs(ip, folio, 0, folio_size(folio));
}
-   return gfs2_write_jdata_page(page, wbc);
+   return gfs2_write_jdata_page(>page, wbc);
 }
 
 /**
@@ -159,7 +160,7 @@ static int gfs2_jdata_writepage(struct page *page, struct 
writeback_control *wbc
goto out;
if (folio_test_checked(folio) || current->journal_info)
goto out_ignore;
-   return __gfs2_jdata_writepage(>page, wbc);
+   return __gfs2_jdata_write_folio(folio, wbc);
 
 out_ignore:
folio_redirty_for_writepage(wbc, folio);
@@ -256,7 +257,7 @@ static int gfs2_write_jdata_batch(struct address_space 
*mapping,
 
trace_wbc_writepage(wbc, inode_to_bdi(inode));
 
-   ret = __gfs2_jdata_writepage(>page, wbc);
+   ret = __gfs2_jdata_write_folio(folio, wbc);
if (unlikely(ret)) {
if (ret == AOP_WRITEPAGE_ACTIVATE) {
folio_unlock(folio);
-- 
2.39.2



[Cluster-devel] [PATCH v2 09/14] buffer; Convert page_zero_new_buffers() to folio_zero_new_buffers()

2023-06-06 Thread Matthew Wilcox (Oracle)
Most of the callers already have a folio; convert reiserfs_write_end()
to have a folio.  Removes a couple of hidden calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 27 ++-
 fs/ext4/inode.c |  4 ++--
 fs/reiserfs/inode.c |  7 ---
 include/linux/buffer_head.h |  2 +-
 4 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 8ea9edd86519..5f758bab5bcb 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1927,33 +1927,34 @@ int __block_write_full_folio(struct inode *inode, 
struct folio *folio,
 EXPORT_SYMBOL(__block_write_full_folio);
 
 /*
- * If a page has any new buffers, zero them out here, and mark them uptodate
+ * If a folio has any new buffers, zero them out here, and mark them uptodate
  * and dirty so they'll be written out (in order to prevent uninitialised
  * block data from leaking). And clear the new bit.
  */
-void page_zero_new_buffers(struct page *page, unsigned from, unsigned to)
+void folio_zero_new_buffers(struct folio *folio, size_t from, size_t to)
 {
-   unsigned int block_start, block_end;
+   size_t block_start, block_end;
struct buffer_head *head, *bh;
 
-   BUG_ON(!PageLocked(page));
-   if (!page_has_buffers(page))
+   BUG_ON(!folio_test_locked(folio));
+   head = folio_buffers(folio);
+   if (!head)
return;
 
-   bh = head = page_buffers(page);
+   bh = head;
block_start = 0;
do {
block_end = block_start + bh->b_size;
 
if (buffer_new(bh)) {
if (block_end > from && block_start < to) {
-   if (!PageUptodate(page)) {
-   unsigned start, size;
+   if (!folio_test_uptodate(folio)) {
+   size_t start, xend;
 
start = max(from, block_start);
-   size = min(to, block_end) - start;
+   xend = min(to, block_end);
 
-   zero_user(page, start, size);
+   folio_zero_segment(folio, start, xend);
set_buffer_uptodate(bh);
}
 
@@ -1966,7 +1967,7 @@ void page_zero_new_buffers(struct page *page, unsigned 
from, unsigned to)
bh = bh->b_this_page;
} while (bh != head);
 }
-EXPORT_SYMBOL(page_zero_new_buffers);
+EXPORT_SYMBOL(folio_zero_new_buffers);
 
 static void
 iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
@@ -2104,7 +2105,7 @@ int __block_write_begin_int(struct folio *folio, loff_t 
pos, unsigned len,
err = -EIO;
}
if (unlikely(err))
-   page_zero_new_buffers(>page, from, to);
+   folio_zero_new_buffers(folio, from, to);
return err;
 }
 
@@ -2208,7 +2209,7 @@ int block_write_end(struct file *file, struct 
address_space *mapping,
if (!folio_test_uptodate(folio))
copied = 0;
 
-   page_zero_new_buffers(>page, start+copied, start+len);
+   folio_zero_new_buffers(folio, start+copied, start+len);
}
flush_dcache_folio(folio);
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 02de439bf1f0..9ca583360166 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1093,7 +1093,7 @@ static int ext4_block_write_begin(struct folio *folio, 
loff_t pos, unsigned len,
err = -EIO;
}
if (unlikely(err)) {
-   page_zero_new_buffers(>page, from, to);
+   folio_zero_new_buffers(folio, from, to);
} else if (fscrypt_inode_uses_fs_layer_crypto(inode)) {
for (i = 0; i < nr_wait; i++) {
int err2;
@@ -1339,7 +1339,7 @@ static int ext4_write_end(struct file *file,
 }
 
 /*
- * This is a private version of page_zero_new_buffers() which doesn't
+ * This is a private version of folio_zero_new_buffers() which doesn't
  * set the buffer to be dirty, since in data=journalled mode we need
  * to call ext4_dirty_journalled_data() instead.
  */
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index ff34ee49106f..77bd3b27059f 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -2872,6 +2872,7 @@ static int reiserfs_write_end(struct file *file, struct 
address_space *mapping,
  loff_t pos, unsigned len, unsigned copied,
  struct page *page, void *fsdata)
 {
+   struct folio *folio = page_folio(page);
struct inode *inode = page->mapping->host;
int ret = 0;
int update_sd = 0;
@@ -2887,12 +2888,12 @@ static int reiserfs_write_end(struct file

[Cluster-devel] [PATCH v2 07/14] buffer: Convert block_page_mkwrite() to use a folio

2023-06-06 Thread Matthew Wilcox (Oracle)
If any page in a folio is dirtied, dirty the entire folio.  Removes a
number of hidden calls to compound_head() and references to page->mapping
and page->index.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index d8c2c000676b..f34ed29b1085 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2564,38 +2564,37 @@ EXPORT_SYMBOL(block_commit_write);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 get_block_t get_block)
 {
-   struct page *page = vmf->page;
+   struct folio *folio = page_folio(vmf->page);
struct inode *inode = file_inode(vma->vm_file);
unsigned long end;
loff_t size;
int ret;
 
-   lock_page(page);
+   folio_lock(folio);
size = i_size_read(inode);
-   if ((page->mapping != inode->i_mapping) ||
-   (page_offset(page) > size)) {
+   if ((folio->mapping != inode->i_mapping) ||
+   (folio_pos(folio) > size)) {
/* We overload EFAULT to mean page got truncated */
ret = -EFAULT;
goto out_unlock;
}
 
-   /* page is wholly or partially inside EOF */
-   if (((page->index + 1) << PAGE_SHIFT) > size)
-   end = size & ~PAGE_MASK;
-   else
-   end = PAGE_SIZE;
+   end = folio_size(folio);
+   /* folio is wholly or partially inside EOF */
+   if (folio_pos(folio) + end > size)
+   end = size - folio_pos(folio);
 
-   ret = __block_write_begin(page, 0, end, get_block);
+   ret = __block_write_begin_int(folio, 0, end, get_block, NULL);
if (!ret)
-   ret = block_commit_write(page, 0, end);
+   ret = block_commit_write(>page, 0, end);
 
if (unlikely(ret < 0))
goto out_unlock;
-   set_page_dirty(page);
-   wait_for_stable_page(page);
+   folio_set_dirty(folio);
+   folio_wait_stable(folio);
return 0;
 out_unlock:
-   unlock_page(page);
+   folio_unlock(folio);
return ret;
 }
 EXPORT_SYMBOL(block_page_mkwrite);
-- 
2.39.2



[Cluster-devel] [PATCH v2 03/14] gfs2: Convert gfs2_write_jdata_page() to gfs2_write_jdata_folio()

2023-06-06 Thread Matthew Wilcox (Oracle)
Add support for large folios and remove some accesses to page->mapping
and page->index.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Bob Peterson 
Reviewed-by: Bob Peterson 
---
 fs/gfs2/aops.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 749135252d52..ec5b5c1ea634 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -82,33 +82,33 @@ static int gfs2_get_block_noalloc(struct inode *inode, 
sector_t lblock,
 }
 
 /**
- * gfs2_write_jdata_page - gfs2 jdata-specific version of block_write_full_page
- * @page: The page to write
+ * gfs2_write_jdata_folio - gfs2 jdata-specific version of 
block_write_full_page
+ * @folio: The folio to write
  * @wbc: The writeback control
  *
  * This is the same as calling block_write_full_page, but it also
  * writes pages outside of i_size
  */
-static int gfs2_write_jdata_page(struct page *page,
+static int gfs2_write_jdata_folio(struct folio *folio,
 struct writeback_control *wbc)
 {
-   struct inode * const inode = page->mapping->host;
+   struct inode * const inode = folio->mapping->host;
loff_t i_size = i_size_read(inode);
-   const pgoff_t end_index = i_size >> PAGE_SHIFT;
-   unsigned offset;
 
/*
-* The page straddles i_size.  It must be zeroed out on each and every
+* The folio straddles i_size.  It must be zeroed out on each and every
 * writepage invocation because it may be mmapped.  "A file is mapped
 * in multiples of the page size.  For a file that is not a multiple of
-* the  page size, the remaining memory is zeroed when mapped, and
+* the page size, the remaining memory is zeroed when mapped, and
 * writes to that region are not written out to the file."
 */
-   offset = i_size & (PAGE_SIZE - 1);
-   if (page->index == end_index && offset)
-   zero_user_segment(page, offset, PAGE_SIZE);
+   if (folio_pos(folio) < i_size &&
+   i_size < folio_pos(folio) + folio_size(folio))
+   folio_zero_segment(folio, offset_in_folio(folio, i_size),
+   folio_size(folio));
 
-   return __block_write_full_page(inode, page, gfs2_get_block_noalloc, wbc,
+   return __block_write_full_page(inode, >page,
+  gfs2_get_block_noalloc, wbc,
   end_buffer_async_write);
 }
 
@@ -137,7 +137,7 @@ static int __gfs2_jdata_write_folio(struct folio *folio,
}
gfs2_trans_add_databufs(ip, folio, 0, folio_size(folio));
}
-   return gfs2_write_jdata_page(>page, wbc);
+   return gfs2_write_jdata_folio(folio, wbc);
 }
 
 /**
-- 
2.39.2



[Cluster-devel] [PATCH v2 01/14] gfs2: Use a folio inside gfs2_jdata_writepage()

2023-06-06 Thread Matthew Wilcox (Oracle)
Replace a few implicit calls to compound_head() with one explicit one.

Signed-off-by: Matthew Wilcox (Oracle) 
Tested-by: Bob Peterson 
Reviewed-by: Bob Peterson 
---
 fs/gfs2/aops.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index a5f4be6b9213..0518861df783 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -150,20 +150,21 @@ static int __gfs2_jdata_writepage(struct page *page, 
struct writeback_control *w
 
 static int gfs2_jdata_writepage(struct page *page, struct writeback_control 
*wbc)
 {
+   struct folio *folio = page_folio(page);
struct inode *inode = page->mapping->host;
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_sbd *sdp = GFS2_SB(inode);
 
if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
goto out;
-   if (PageChecked(page) || current->journal_info)
+   if (folio_test_checked(folio) || current->journal_info)
goto out_ignore;
-   return __gfs2_jdata_writepage(page, wbc);
+   return __gfs2_jdata_writepage(>page, wbc);
 
 out_ignore:
-   redirty_page_for_writepage(wbc, page);
+   folio_redirty_for_writepage(wbc, folio);
 out:
-   unlock_page(page);
+   folio_unlock(folio);
return 0;
 }
 
-- 
2.39.2



[Cluster-devel] [PATCH v2 13/14] buffer: Use a folio in __find_get_block_slow()

2023-06-06 Thread Matthew Wilcox (Oracle)
Saves a call to compound_head() and may be needed to support
block size > PAGE_SIZE.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c81b8b20ad64..9f761a201e32 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -195,19 +195,19 @@ __find_get_block_slow(struct block_device *bdev, sector_t 
block)
pgoff_t index;
struct buffer_head *bh;
struct buffer_head *head;
-   struct page *page;
+   struct folio *folio;
int all_mapped = 1;
static DEFINE_RATELIMIT_STATE(last_warned, HZ, 1);
 
index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
-   page = find_get_page_flags(bd_mapping, index, FGP_ACCESSED);
-   if (!page)
+   folio = __filemap_get_folio(bd_mapping, index, FGP_ACCESSED, 0);
+   if (IS_ERR(folio))
goto out;
 
spin_lock(_mapping->private_lock);
-   if (!page_has_buffers(page))
+   head = folio_buffers(folio);
+   if (!head)
goto out_unlock;
-   head = page_buffers(page);
bh = head;
do {
if (!buffer_mapped(bh))
@@ -237,7 +237,7 @@ __find_get_block_slow(struct block_device *bdev, sector_t 
block)
}
 out_unlock:
spin_unlock(_mapping->private_lock);
-   put_page(page);
+   folio_put(folio);
 out:
return ret;
 }
-- 
2.39.2



[Cluster-devel] [PATCH v2 10/14] buffer: Convert grow_dev_page() to use a folio

2023-06-06 Thread Matthew Wilcox (Oracle)
Get a folio from the page cache instead of a page, then use the
folio API throughout.  Removes a few calls to compound_head()
and may be needed to support block size > PAGE_SIZE.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 34 +++---
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 5f758bab5bcb..c4fc4b3b8aab 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -976,7 +976,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
  pgoff_t index, int size, int sizebits, gfp_t gfp)
 {
struct inode *inode = bdev->bd_inode;
-   struct page *page;
+   struct folio *folio;
struct buffer_head *bh;
sector_t end_block;
int ret = 0;
@@ -992,42 +992,38 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 */
gfp_mask |= __GFP_NOFAIL;
 
-   page = find_or_create_page(inode->i_mapping, index, gfp_mask);
-
-   BUG_ON(!PageLocked(page));
+   folio = __filemap_get_folio(inode->i_mapping, index,
+   FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp_mask);
 
-   if (page_has_buffers(page)) {
-   bh = page_buffers(page);
+   bh = folio_buffers(folio);
+   if (bh) {
if (bh->b_size == size) {
-   end_block = init_page_buffers(page, bdev,
+   end_block = init_page_buffers(>page, bdev,
(sector_t)index << sizebits,
size);
goto done;
}
-   if (!try_to_free_buffers(page_folio(page)))
+   if (!try_to_free_buffers(folio))
goto failed;
}
 
-   /*
-* Allocate some buffers for this page
-*/
-   bh = alloc_page_buffers(page, size, true);
+   bh = folio_alloc_buffers(folio, size, true);
 
/*
-* Link the page to the buffers and initialise them.  Take the
+* Link the folio to the buffers and initialise them.  Take the
 * lock to be atomic wrt __find_get_block(), which does not
-* run under the page lock.
+* run under the folio lock.
 */
spin_lock(>i_mapping->private_lock);
-   link_dev_buffers(page, bh);
-   end_block = init_page_buffers(page, bdev, (sector_t)index << sizebits,
-   size);
+   link_dev_buffers(>page, bh);
+   end_block = init_page_buffers(>page, bdev,
+   (sector_t)index << sizebits, size);
spin_unlock(>i_mapping->private_lock);
 done:
ret = (block < end_block) ? 1 : -ENXIO;
 failed:
-   unlock_page(page);
-   put_page(page);
+   folio_unlock(folio);
+   folio_put(folio);
return ret;
 }
 
-- 
2.39.2



[Cluster-devel] [PATCH v2 11/14] buffer: Convert init_page_buffers() to folio_init_buffers()

2023-06-06 Thread Matthew Wilcox (Oracle)
Use the folio API and pass the folio from both callers.
Saves a hidden call to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c4fc4b3b8aab..9b789f109a57 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -934,15 +934,14 @@ static sector_t blkdev_max_block(struct block_device 
*bdev, unsigned int size)
 }
 
 /*
- * Initialise the state of a blockdev page's buffers.
+ * Initialise the state of a blockdev folio's buffers.
  */ 
-static sector_t
-init_page_buffers(struct page *page, struct block_device *bdev,
-   sector_t block, int size)
+static sector_t folio_init_buffers(struct folio *folio,
+   struct block_device *bdev, sector_t block, int size)
 {
-   struct buffer_head *head = page_buffers(page);
+   struct buffer_head *head = folio_buffers(folio);
struct buffer_head *bh = head;
-   int uptodate = PageUptodate(page);
+   bool uptodate = folio_test_uptodate(folio);
sector_t end_block = blkdev_max_block(bdev, size);
 
do {
@@ -998,9 +997,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
bh = folio_buffers(folio);
if (bh) {
if (bh->b_size == size) {
-   end_block = init_page_buffers(>page, bdev,
-   (sector_t)index << sizebits,
-   size);
+   end_block = folio_init_buffers(folio, bdev,
+   (sector_t)index << sizebits, size);
goto done;
}
if (!try_to_free_buffers(folio))
@@ -1016,7 +1014,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 */
spin_lock(>i_mapping->private_lock);
link_dev_buffers(>page, bh);
-   end_block = init_page_buffers(>page, bdev,
+   end_block = folio_init_buffers(folio, bdev,
(sector_t)index << sizebits, size);
spin_unlock(>i_mapping->private_lock);
 done:
-- 
2.39.2



[Cluster-devel] [PATCH v2 12/14] buffer: Convert link_dev_buffers to take a folio

2023-06-06 Thread Matthew Wilcox (Oracle)
Its one caller already has a folio, so switch it to use the
folio API.  Removes a hidden call to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9b789f109a57..c81b8b20ad64 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -907,8 +907,8 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
 }
 EXPORT_SYMBOL_GPL(alloc_page_buffers);
 
-static inline void
-link_dev_buffers(struct page *page, struct buffer_head *head)
+static inline void link_dev_buffers(struct folio *folio,
+   struct buffer_head *head)
 {
struct buffer_head *bh, *tail;
 
@@ -918,7 +918,7 @@ link_dev_buffers(struct page *page, struct buffer_head 
*head)
bh = bh->b_this_page;
} while (bh);
tail->b_this_page = head;
-   attach_page_private(page, head);
+   folio_attach_private(folio, head);
 }
 
 static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
@@ -1013,7 +1013,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 * run under the folio lock.
 */
spin_lock(>i_mapping->private_lock);
-   link_dev_buffers(>page, bh);
+   link_dev_buffers(folio, bh);
end_block = folio_init_buffers(folio, bdev,
(sector_t)index << sizebits, size);
spin_unlock(>i_mapping->private_lock);
-- 
2.39.2



[Cluster-devel] [PATCH v2 14/14] buffer: Convert block_truncate_page() to use a folio

2023-06-06 Thread Matthew Wilcox (Oracle)
Support large folios in block_truncate_page() and avoid three hidden
calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 28 +++-
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9f761a201e32..9e1c33f7e02c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2598,17 +2598,16 @@ int block_truncate_page(struct address_space *mapping,
loff_t from, get_block_t *get_block)
 {
pgoff_t index = from >> PAGE_SHIFT;
-   unsigned offset = from & (PAGE_SIZE-1);
unsigned blocksize;
sector_t iblock;
-   unsigned length, pos;
+   size_t offset, length, pos;
struct inode *inode = mapping->host;
-   struct page *page;
+   struct folio *folio;
struct buffer_head *bh;
int err = 0;
 
blocksize = i_blocksize(inode);
-   length = offset & (blocksize - 1);
+   length = from & (blocksize - 1);
 
/* Block boundary? Nothing to do */
if (!length)
@@ -2617,15 +2616,18 @@ int block_truncate_page(struct address_space *mapping,
length = blocksize - length;
iblock = (sector_t)index << (PAGE_SHIFT - inode->i_blkbits);

-   page = grab_cache_page(mapping, index);
-   if (!page)
+   folio = filemap_grab_folio(mapping, index);
+   if (!folio)
return -ENOMEM;
 
-   if (!page_has_buffers(page))
-   create_empty_buffers(page, blocksize, 0);
+   bh = folio_buffers(folio);
+   if (!bh) {
+   folio_create_empty_buffers(folio, blocksize, 0);
+   bh = folio_buffers(folio);
+   }
 
/* Find the buffer that contains "offset" */
-   bh = page_buffers(page);
+   offset = offset_in_folio(folio, from);
pos = blocksize;
while (offset >= pos) {
bh = bh->b_this_page;
@@ -2644,7 +2646,7 @@ int block_truncate_page(struct address_space *mapping,
}
 
/* Ok, it's mapped. Make sure it's up-to-date */
-   if (PageUptodate(page))
+   if (folio_test_uptodate(folio))
set_buffer_uptodate(bh);
 
if (!buffer_uptodate(bh) && !buffer_delay(bh) && !buffer_unwritten(bh)) 
{
@@ -2654,12 +2656,12 @@ int block_truncate_page(struct address_space *mapping,
goto unlock;
}
 
-   zero_user(page, offset, length);
+   folio_zero_range(folio, offset, length);
mark_buffer_dirty(bh);
 
 unlock:
-   unlock_page(page);
-   put_page(page);
+   folio_unlock(folio);
+   folio_put(folio);
 
return err;
 }
-- 
2.39.2



[Cluster-devel] [PATCH v2 00/14] gfs2/buffer folio changes for 6.5

2023-06-06 Thread Matthew Wilcox (Oracle)
This kind of started off as a gfs2 patch series, then became entwined
with buffer heads once I realised that gfs2 was the only remaining
caller of __block_write_full_page().  For those not in the gfs2 world,
the big point of this series is that block_write_full_page() should now
handle large folios correctly.

Andrew, if you want, I'll drop it into the pagecache tree, or you
can just take it.

Matthew Wilcox (Oracle) (14):
  gfs2: Use a folio inside gfs2_jdata_writepage()
  gfs2: Pass a folio to __gfs2_jdata_write_folio()
  gfs2: Convert gfs2_write_jdata_page() to gfs2_write_jdata_folio()
  buffer: Convert __block_write_full_page() to
__block_write_full_folio()
  gfs2: Support ludicrously large folios in gfs2_trans_add_databufs()
  buffer: Make block_write_full_page() handle large folios correctly
  buffer: Convert block_page_mkwrite() to use a folio
  buffer: Convert __block_commit_write() to take a folio
  buffer; Convert page_zero_new_buffers() to folio_zero_new_buffers()
  buffer: Convert grow_dev_page() to use a folio
  buffer: Convert init_page_buffers() to folio_init_buffers()
  buffer: Convert link_dev_buffers to take a folio
  buffer: Use a folio in __find_get_block_slow()
  buffer: Convert block_truncate_page() to use a folio

 fs/buffer.c | 257 ++--
 fs/ext4/inode.c |   4 +-
 fs/gfs2/aops.c  |  69 +-
 fs/gfs2/aops.h  |   2 +-
 fs/ntfs/aops.c  |   2 +-
 fs/reiserfs/inode.c |   9 +-
 include/linux/buffer_head.h |   4 +-
 7 files changed, 172 insertions(+), 175 deletions(-)

-- 
2.39.2



Re: [Cluster-devel] [PATCH 3/6] gfs2: Convert gfs2_write_jdata_page() to gfs2_write_jdata_folio()

2023-06-03 Thread Matthew Wilcox
On Sat, Jun 03, 2023 at 11:34:14AM +0200, Andreas Gruenbacher wrote:
> >   * This is the same as calling block_write_full_page, but it also
> >   * writes pages outside of i_size
> >   */
> > -static int gfs2_write_jdata_page(struct page *page,
> > +static int gfs2_write_jdata_folio(struct folio *folio,
> >  struct writeback_control *wbc)
> >  {
> > -   struct inode * const inode = page->mapping->host;
> > +   struct inode * const inode = folio->mapping->host;
> > loff_t i_size = i_size_read(inode);
> > -   const pgoff_t end_index = i_size >> PAGE_SHIFT;
> > -   unsigned offset;
> >
> > +   if (folio_pos(folio) >= i_size)
> > +   return 0;
> 
> Function gfs2_write_jdata_page was originally introduced as
> gfs2_write_full_page in commit fd4c5748b8d3 ("gfs2: writeout truncated
> pages") to allow writing pages even when they are beyond EOF, as the
> function description documents.

Well, that was stupid of me.

> This hack was added because simply skipping journaled pages isn't
> enough on gfs2; before a journaled page can be freed, it needs to be
> marked as "revoked" in the journal. Journal recovery will then skip
> the revoked blocks, which allows them to be reused for regular,
> non-journaled data. We can end up here in contexts in which we cannot
> "revoke" pages, so instead, we write the original pages even when they
> are beyond EOF. This hack could be revisited, but it's pretty nasty
> code to pick apart.
> 
> So at least the above if needs to go for now.

Understood.  So we probably don't want to waste time zeroing the folio
if it is entirely beyond i_size, right?  Because at the moment we'd
zero some essentially random part of the folio if I just take out the
check.  Should it look like this?

if (folio_pos(folio) < i_size &&
i_size < folio_pos(folio) + folio_size(folio))
   folio_zero_segment(folio, offset_in_folio(folio, i_size),
folio_size(folio));



Re: [Cluster-devel] [PATCH v6 20/20] block: mark bio_add_folio as __must_check

2023-05-30 Thread Matthew Wilcox
On Tue, May 30, 2023 at 08:49:23AM -0700, Johannes Thumshirn wrote:
> Now that all callers of bio_add_folio() check the return value, mark it as
> __must_check.
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Matthew Wilcox (Oracle) 



Re: [Cluster-devel] [PATCH v6 19/20] fs: iomap: use __bio_add_folio where possible

2023-05-30 Thread Matthew Wilcox
On Tue, May 30, 2023 at 08:49:22AM -0700, Johannes Thumshirn wrote:
> When the iomap buffered-io code can't add a folio to a bio, it allocates a
> new bio and adds the folio to that one. This is done using bio_add_folio(),
> but doesn't check for errors.
> 
> As adding a folio to a newly created bio can't fail, use the newly
> introduced __bio_add_folio() function.
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Matthew Wilcox (Oracle) 



Re: [Cluster-devel] [PATCH v6 18/20] block: add __bio_add_folio

2023-05-30 Thread Matthew Wilcox
On Tue, May 30, 2023 at 08:49:21AM -0700, Johannes Thumshirn wrote:
> Just like for bio_add_pages() add a no-fail variant for bio_add_folio().
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Matthew Wilcox (Oracle) 



Re: [Cluster-devel] [PATCH 16/17] block: use iomap for writes to block devices

2023-05-24 Thread Matthew Wilcox
On Wed, May 24, 2023 at 08:27:13AM +1000, Dave Chinner wrote:
> On Fri, May 19, 2023 at 04:22:01PM +0200, Hannes Reinecke wrote:
> > I'm hitting this during booting:
> > [5.016324]  
> > [5.030256]  iomap_iter+0x11a/0x350
> > [5.030264]  iomap_readahead+0x1eb/0x2c0
> > [5.030272]  read_pages+0x5d/0x220
> > [5.030279]  page_cache_ra_unbounded+0x131/0x180
> > [5.030284]  filemap_get_pages+0xff/0x5a0
> 
> Why is filemap_get_pages() using unbounded readahead? Surely
> readahead should be limited to reading within EOF

It isn't using unbounded readahead; that's an artifact of this
incomplete stack trace.  Actual call stack:

page_cache_ra_unbounded
do_page_cache_ra
ondemand_readahead
page_cache_sync_ra
page_cache_sync_readahead
filemap_get_pages

As you can see, do_page_cache_ra() does limit readahead to i_size.
Is ractl->mapping->host the correct way to find the inode?  I always
get confused.

> I think Christoph's code is correct. IMO, any attempt to read beyond
> the end of the device should throw out a warning and return an
> error, not silently return zeros.
> 
> If readahead is trying to read beyond the end of the device, then it
> really seems to me like the problem here is readahead, not the iomap
> code detecting the OOB read request
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com



Re: [Cluster-devel] [PATCH 5/6] gfs2: Support ludicrously large folios in gfs2_trans_add_databufs()

2023-05-23 Thread Matthew Wilcox
On Tue, May 23, 2023 at 02:46:07PM +0200, Andreas Gruenbacher wrote:
> >  void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
> > -unsigned int from, unsigned int len)
> > +size_t from, size_t len)
> >  {
> > struct buffer_head *head = folio_buffers(folio);
> > unsigned int bsize = head->b_size;
> 
> This only makes sense if the to, start, and end variables in
> gfs2_trans_add_databufs() are changed from unsigned int to size_t as
> well.

The history of this patch is that I started doing conversions from page
-> folio in gfs2, then you came out with a very similar series.  This
patch is the remainder after rebasing my patches on yours.  So we can
either drop this patch or just apply it.  I wasn't making a concerted
effort to make gfs2 support 4GB+ sized folios, it's just part of the
conversion that I do.

> >  extern void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio 
> > *folio,
> > -   unsigned int from, unsigned int len);
> > +   size_t from, size_t len);



Re: [Cluster-devel] [PATCH 08/13] iomap: assign current->backing_dev_info in iomap_file_buffered_write

2023-05-22 Thread Matthew Wilcox
On Mon, May 22, 2023 at 06:06:27PM -0700, Darrick J. Wong wrote:
> On Fri, May 19, 2023 at 11:35:16AM +0200, Christoph Hellwig wrote:
> > Move the assignment to current->backing_dev_info from the callers into
> > iomap_file_buffered_write to reduce boiler plate code and reduce the
> > scope to just around the page dirtying loop.
> > 
> > Note that zonefs was missing this assignment before.
> 
> I'm still wondering (a) what the hell current->backing_dev_info is for,
> and (b) if we need it around the iomap_unshare operation.
> 
> $ git grep current..backing_dev_info
[results show it only set, never used]
> 
> AFAICT nobody uses it at all?  Unless there's some bizarre user that
> isn't extracting it from @current?
> 
> Oh, hey, new question (c) isn't this set incorrectly for xfs realtime
> files?

Some git archaelogy ...

This was first introduced in commit 2f45a06517a62 (in the
linux-fullhistory tree) in 2002 by one Andrew Morton.  At the time,
it added this check to the page scanner:

+   if (page->pte.direct ||
+   page->mapping->backing_dev_info ==
+   current->backing_dev_info) {
+   wait_on_page_writeback(page);
+   }

AFAICT (the code went through some metamorphoses in the intervening
twenty years), the last use of it ended up in current_may_throttle(),
and it was removed in March 2022 by Neil Brown in commit b9b1335e6403.
Since then, there have been no users of task->backing_dev_info, and I'm
pretty sure it can go away.



Re: [Cluster-devel] [PATCH 4/6] buffer: Convert __block_write_full_page() to __block_write_full_folio()

2023-05-17 Thread Matthew Wilcox
On Wed, May 17, 2023 at 04:47:01PM +0200, Pankaj Raghav wrote:
> > @@ -1793,7 +1793,7 @@ int __block_write_full_page(struct inode *inode, 
> > struct page *page,
> > blocksize = bh->b_size;
> > bbits = block_size_bits(blocksize);
> >  
> > -   block = (sector_t)page->index << (PAGE_SHIFT - bbits);
> > +   block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
> 
> Shouldn't the PAGE_SHIFT be folio_shift(folio) as you allow larger
> folios to be passed to this function in the later patches?

No, the folio->index is expressed in multiples of PAGE_SIZE.

> > last_block = (i_size_read(inode) - 1) >> bbits;
> >  




[Cluster-devel] [PATCH 4/6] buffer: Convert __block_write_full_page() to __block_write_full_folio()

2023-05-16 Thread Matthew Wilcox (Oracle)
Remove nine hidden calls to compound_head() by using a folio instead
of a page.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 53 +++--
 fs/gfs2/aops.c  |  5 ++--
 fs/ntfs/aops.c  |  2 +-
 fs/reiserfs/inode.c |  2 +-
 include/linux/buffer_head.h |  2 +-
 5 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index a7fc561758b1..4d518df50fab 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1764,7 +1764,7 @@ static struct buffer_head *folio_create_buffers(struct 
folio *folio,
  * WB_SYNC_ALL, the writes are posted using REQ_SYNC; this
  * causes the writes to be flagged as synchronous writes.
  */
-int __block_write_full_page(struct inode *inode, struct page *page,
+int __block_write_full_folio(struct inode *inode, struct folio *folio,
get_block_t *get_block, struct writeback_control *wbc,
bh_end_io_t *handler)
 {
@@ -1776,14 +1776,14 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
int nr_underway = 0;
blk_opf_t write_flags = wbc_to_write_flags(wbc);
 
-   head = folio_create_buffers(page_folio(page), inode,
+   head = folio_create_buffers(folio, inode,
(1 << BH_Dirty) | (1 << BH_Uptodate));
 
/*
 * Be very careful.  We have no exclusion from block_dirty_folio
 * here, and the (potentially unmapped) buffers may become dirty at
 * any time.  If a buffer becomes dirty here after we've inspected it
-* then we just miss that fact, and the page stays dirty.
+* then we just miss that fact, and the folio stays dirty.
 *
 * Buffers outside i_size may be dirtied by block_dirty_folio;
 * handle that here by just cleaning them.
@@ -1793,7 +1793,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
blocksize = bh->b_size;
bbits = block_size_bits(blocksize);
 
-   block = (sector_t)page->index << (PAGE_SHIFT - bbits);
+   block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
last_block = (i_size_read(inode) - 1) >> bbits;
 
/*
@@ -1804,7 +1804,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
if (block > last_block) {
/*
 * mapped buffers outside i_size will occur, because
-* this page can be outside i_size when there is a
+* this folio can be outside i_size when there is a
 * truncate in progress.
 */
/*
@@ -1834,7 +1834,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
continue;
/*
 * If it's a fully non-blocking write attempt and we cannot
-* lock the buffer then redirty the page.  Note that this can
+* lock the buffer then redirty the folio.  Note that this can
 * potentially cause a busy-wait loop from writeback threads
 * and kswapd activity, but those code paths have their own
 * higher-level throttling.
@@ -1842,7 +1842,7 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
if (wbc->sync_mode != WB_SYNC_NONE) {
lock_buffer(bh);
} else if (!trylock_buffer(bh)) {
-   redirty_page_for_writepage(wbc, page);
+   folio_redirty_for_writepage(wbc, folio);
continue;
}
if (test_clear_buffer_dirty(bh)) {
@@ -1853,11 +1853,11 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
} while ((bh = bh->b_this_page) != head);
 
/*
-* The page and its buffers are protected by PageWriteback(), so we can
-* drop the bh refcounts early.
+* The folio and its buffers are protected by the writeback flag,
+* so we can drop the bh refcounts early.
 */
-   BUG_ON(PageWriteback(page));
-   set_page_writeback(page);
+   BUG_ON(folio_test_writeback(folio));
+   folio_start_writeback(folio);
 
do {
struct buffer_head *next = bh->b_this_page;
@@ -1867,20 +1867,20 @@ int __block_write_full_page(struct inode *inode, struct 
page *page,
}
bh = next;
} while (bh != head);
-   unlock_page(page);
+   folio_unlock(folio);
 
err = 0;
 done:
if (nr_underway == 0) {
/*
-* The page was marked dirty, but the buffers were
+* The folio was marked dirty, but the buffers were
 * clean.  Someone wrote them back by hand with
  

[Cluster-devel] [PATCH 1/6] gfs2: Use a folio inside gfs2_jdata_writepage()

2023-05-16 Thread Matthew Wilcox (Oracle)
Replace a few implicit calls to compound_head() with one explicit one.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/gfs2/aops.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index a5f4be6b9213..0518861df783 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -150,20 +150,21 @@ static int __gfs2_jdata_writepage(struct page *page, 
struct writeback_control *w
 
 static int gfs2_jdata_writepage(struct page *page, struct writeback_control 
*wbc)
 {
+   struct folio *folio = page_folio(page);
struct inode *inode = page->mapping->host;
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_sbd *sdp = GFS2_SB(inode);
 
if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
goto out;
-   if (PageChecked(page) || current->journal_info)
+   if (folio_test_checked(folio) || current->journal_info)
goto out_ignore;
-   return __gfs2_jdata_writepage(page, wbc);
+   return __gfs2_jdata_writepage(>page, wbc);
 
 out_ignore:
-   redirty_page_for_writepage(wbc, page);
+   folio_redirty_for_writepage(wbc, folio);
 out:
-   unlock_page(page);
+   folio_unlock(folio);
return 0;
 }
 
-- 
2.39.2



[Cluster-devel] [PATCH 6/6] buffer: Make block_write_full_page() handle large folios correctly

2023-05-16 Thread Matthew Wilcox (Oracle)
Keep the interface as struct page, but work entirely on the folio
internally.  Removes several PAGE_SIZE assumptions and removes
some references to page->index and page->mapping.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/buffer.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4d518df50fab..d8c2c000676b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2678,33 +2678,31 @@ int block_write_full_page(struct page *page, 
get_block_t *get_block,
struct writeback_control *wbc)
 {
struct folio *folio = page_folio(page);
-   struct inode * const inode = page->mapping->host;
+   struct inode * const inode = folio->mapping->host;
loff_t i_size = i_size_read(inode);
-   const pgoff_t end_index = i_size >> PAGE_SHIFT;
-   unsigned offset;
 
-   /* Is the page fully inside i_size? */
-   if (page->index < end_index)
+   /* Is the folio fully inside i_size? */
+   if (folio_pos(folio) + folio_size(folio) <= i_size)
return __block_write_full_folio(inode, folio, get_block, wbc,
   end_buffer_async_write);
 
-   /* Is the page fully outside i_size? (truncate in progress) */
-   offset = i_size & (PAGE_SIZE-1);
-   if (page->index >= end_index+1 || !offset) {
+   /* Is the folio fully outside i_size? (truncate in progress) */
+   if (folio_pos(folio) > i_size) {
folio_unlock(folio);
return 0; /* don't care */
}
 
/*
-* The page straddles i_size.  It must be zeroed out on each and every
+* The folio straddles i_size.  It must be zeroed out on each and every
 * writepage invocation because it may be mmapped.  "A file is mapped
 * in multiples of the page size.  For a file that is not a multiple of
-* the  page size, the remaining memory is zeroed when mapped, and
+* the page size, the remaining memory is zeroed when mapped, and
 * writes to that region are not written out to the file."
 */
-   zero_user_segment(page, offset, PAGE_SIZE);
+   folio_zero_segment(folio, offset_in_folio(folio, i_size),
+   folio_size(folio));
return __block_write_full_folio(inode, folio, get_block, wbc,
-   end_buffer_async_write);
+   end_buffer_async_write);
 }
 EXPORT_SYMBOL(block_write_full_page);
 
-- 
2.39.2



[Cluster-devel] [PATCH 0/6] gfs2/buffer folio changes

2023-05-16 Thread Matthew Wilcox (Oracle)
This kind of started off as a gfs2 patch series, then became entwined
with buffer heads once I realised that gfs2 was the only remaining
caller of __block_write_full_page().  For those not in the gfs2 world,
the big point of this series is that block_write_full_page() should now
handle large folios correctly.

It probably makes most sense to take this through Andrew's tree, once
enough people have signed off on it?

Matthew Wilcox (Oracle) (6):
  gfs2: Use a folio inside gfs2_jdata_writepage()
  gfs2: Pass a folio to __gfs2_jdata_write_folio()
  gfs2: Convert gfs2_write_jdata_page() to gfs2_write_jdata_folio()
  buffer: Convert __block_write_full_page() to
__block_write_full_folio()
  gfs2: Support ludicrously large folios in gfs2_trans_add_databufs()
  buffer: Make block_write_full_page() handle large folios correctly

 fs/buffer.c | 75 ++---
 fs/gfs2/aops.c  | 66 
 fs/gfs2/aops.h  |  2 +-
 fs/ntfs/aops.c  |  2 +-
 fs/reiserfs/inode.c |  2 +-
 include/linux/buffer_head.h |  2 +-
 6 files changed, 75 insertions(+), 74 deletions(-)

-- 
2.39.2



[Cluster-devel] [PATCH 2/6] gfs2: Pass a folio to __gfs2_jdata_write_folio()

2023-05-16 Thread Matthew Wilcox (Oracle)
Remove a couple of folio->page conversions in the callers, and two
calls to compound_head() in the function itself.  Rename it from
__gfs2_jdata_writepage() to __gfs2_jdata_write_folio().

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/gfs2/aops.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 0518861df783..749135252d52 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -113,30 +113,31 @@ static int gfs2_write_jdata_page(struct page *page,
 }
 
 /**
- * __gfs2_jdata_writepage - The core of jdata writepage
- * @page: The page to write
+ * __gfs2_jdata_write_folio - The core of jdata writepage
+ * @folio: The folio to write
  * @wbc: The writeback control
  *
  * This is shared between writepage and writepages and implements the
  * core of the writepage operation. If a transaction is required then
- * PageChecked will have been set and the transaction will have
+ * the checked flag will have been set and the transaction will have
  * already been started before this is called.
  */
-
-static int __gfs2_jdata_writepage(struct page *page, struct writeback_control 
*wbc)
+static int __gfs2_jdata_write_folio(struct folio *folio,
+   struct writeback_control *wbc)
 {
-   struct inode *inode = page->mapping->host;
+   struct inode *inode = folio->mapping->host;
struct gfs2_inode *ip = GFS2_I(inode);
 
-   if (PageChecked(page)) {
-   ClearPageChecked(page);
-   if (!page_has_buffers(page)) {
-   create_empty_buffers(page, inode->i_sb->s_blocksize,
-BIT(BH_Dirty)|BIT(BH_Uptodate));
+   if (folio_test_checked(folio)) {
+   folio_clear_checked(folio);
+   if (!folio_buffers(folio)) {
+   folio_create_empty_buffers(folio,
+   inode->i_sb->s_blocksize,
+   BIT(BH_Dirty)|BIT(BH_Uptodate));
}
-   gfs2_trans_add_databufs(ip, page_folio(page), 0, PAGE_SIZE);
+   gfs2_trans_add_databufs(ip, folio, 0, folio_size(folio));
}
-   return gfs2_write_jdata_page(page, wbc);
+   return gfs2_write_jdata_page(>page, wbc);
 }
 
 /**
@@ -159,7 +160,7 @@ static int gfs2_jdata_writepage(struct page *page, struct 
writeback_control *wbc
goto out;
if (folio_test_checked(folio) || current->journal_info)
goto out_ignore;
-   return __gfs2_jdata_writepage(>page, wbc);
+   return __gfs2_jdata_write_folio(folio, wbc);
 
 out_ignore:
folio_redirty_for_writepage(wbc, folio);
@@ -256,7 +257,7 @@ static int gfs2_write_jdata_batch(struct address_space 
*mapping,
 
trace_wbc_writepage(wbc, inode_to_bdi(inode));
 
-   ret = __gfs2_jdata_writepage(>page, wbc);
+   ret = __gfs2_jdata_write_folio(folio, wbc);
if (unlikely(ret)) {
if (ret == AOP_WRITEPAGE_ACTIVATE) {
folio_unlock(folio);
-- 
2.39.2



[Cluster-devel] [PATCH 3/6] gfs2: Convert gfs2_write_jdata_page() to gfs2_write_jdata_folio()

2023-05-16 Thread Matthew Wilcox (Oracle)
This function now supports large folios, even if nothing around it does.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/gfs2/aops.c | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 749135252d52..0f92e3e117da 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -82,33 +82,34 @@ static int gfs2_get_block_noalloc(struct inode *inode, 
sector_t lblock,
 }
 
 /**
- * gfs2_write_jdata_page - gfs2 jdata-specific version of block_write_full_page
- * @page: The page to write
+ * gfs2_write_jdata_folio - gfs2 jdata-specific version of 
block_write_full_page
+ * @folio: The folio to write
  * @wbc: The writeback control
  *
  * This is the same as calling block_write_full_page, but it also
  * writes pages outside of i_size
  */
-static int gfs2_write_jdata_page(struct page *page,
+static int gfs2_write_jdata_folio(struct folio *folio,
 struct writeback_control *wbc)
 {
-   struct inode * const inode = page->mapping->host;
+   struct inode * const inode = folio->mapping->host;
loff_t i_size = i_size_read(inode);
-   const pgoff_t end_index = i_size >> PAGE_SHIFT;
-   unsigned offset;
 
+   if (folio_pos(folio) >= i_size)
+   return 0;
/*
-* The page straddles i_size.  It must be zeroed out on each and every
+* The folio straddles i_size.  It must be zeroed out on each and every
 * writepage invocation because it may be mmapped.  "A file is mapped
 * in multiples of the page size.  For a file that is not a multiple of
-* the  page size, the remaining memory is zeroed when mapped, and
+* the page size, the remaining memory is zeroed when mapped, and
 * writes to that region are not written out to the file."
 */
-   offset = i_size & (PAGE_SIZE - 1);
-   if (page->index == end_index && offset)
-   zero_user_segment(page, offset, PAGE_SIZE);
+   if (i_size < folio_pos(folio) + folio_size(folio))
+   folio_zero_segment(folio, offset_in_folio(folio, i_size),
+   folio_size(folio));
 
-   return __block_write_full_page(inode, page, gfs2_get_block_noalloc, wbc,
+   return __block_write_full_page(inode, >page,
+  gfs2_get_block_noalloc, wbc,
   end_buffer_async_write);
 }
 
@@ -137,7 +138,7 @@ static int __gfs2_jdata_write_folio(struct folio *folio,
}
gfs2_trans_add_databufs(ip, folio, 0, folio_size(folio));
}
-   return gfs2_write_jdata_page(>page, wbc);
+   return gfs2_write_jdata_folio(folio, wbc);
 }
 
 /**
-- 
2.39.2



[Cluster-devel] [PATCH 5/6] gfs2: Support ludicrously large folios in gfs2_trans_add_databufs()

2023-05-16 Thread Matthew Wilcox (Oracle)
We may someday support folios larger than 4GB, so use a size_t for
the byte count within a folio to prevent unpleasant truncations.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/gfs2/aops.c | 2 +-
 fs/gfs2/aops.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index e97462a5302e..8da4397aafc6 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -38,7 +38,7 @@
 
 
 void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
-unsigned int from, unsigned int len)
+size_t from, size_t len)
 {
struct buffer_head *head = folio_buffers(folio);
unsigned int bsize = head->b_size;
diff --git a/fs/gfs2/aops.h b/fs/gfs2/aops.h
index 09db1914425e..f08322ef41cf 100644
--- a/fs/gfs2/aops.h
+++ b/fs/gfs2/aops.h
@@ -10,6 +10,6 @@
 
 extern void adjust_fs_space(struct inode *inode);
 extern void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
-   unsigned int from, unsigned int len);
+   size_t from, size_t len);
 
 #endif /* __AOPS_DOT_H__ */
-- 
2.39.2



Re: [Cluster-devel] [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD

2023-05-01 Thread Matthew Wilcox
On Sun, Apr 30, 2023 at 08:14:03PM -0700, Luis Chamberlain wrote:
> On Sat, Apr 29, 2023 at 02:20:17AM +0100, Matthew Wilcox wrote:
> > > [   11.322212] Call Trace:
> > > [   11.323224]  
> > > [   11.324146]  iomap_readpage_iter+0x96/0x300
> > > [   11.325694]  iomap_readahead+0x174/0x2d0
> > > [   11.327129]  read_pages+0x69/0x1f0
> > > [   11.329751]  page_cache_ra_unbounded+0x187/0x1d0
> > 
> > ... that shouldn't be possible.  read_pages() allocates pages, puts them
> > in the page cache and tells the filesystem to fill them in.
> > 
> > In your patches, did you call mapping_set_large_folios() anywhere?
> 
> No but the only place to add that would be in the block cache. Adding
> that alone to the block cache doesn't fix the issue. The below patch
> however does get us by.

That's "working around the error", not fixing it ... probably the same
root cause as your other errors; at least I'm not diving into them until
the obvious one is fixed.

> >From my readings it does't seem like readahead_folio() should always
> return non-NULL, and also I couldn't easily verify the math is right.

readahead_folio() always returns non-NULL.  That's guaranteed by how
page_cache_ra_unbounded() and page_cache_ra_order() work.  It allocates
folios, until it can't (already-present folio, ENOMEM, EOF, max batch
size) and then calls the filesystem to make those folios uptodate,
telling it how many folios it put in the page cache, where they start.

Hm.  The fact that it's coming from page_cache_ra_unbounded() makes
me wonder if you updated this line:

folio = filemap_alloc_folio(gfp_mask, 0);

without updating this line:

ractl->_nr_pages++;

This is actually number of pages, not number of folios, so needs to be
ractl->_nr_pages += 1 << order;

various other parts of page_cache_ra_unbounded() need to be examined
carefully for assumptions of order-0; it's never been used for that
before.  all the large folio work has concentrated on
page_cache_ra_order()



Re: [Cluster-devel] [PATCH 17/17] fs: add CONFIG_BUFFER_HEAD

2023-04-28 Thread Matthew Wilcox
On Fri, Apr 28, 2023 at 05:11:57PM -0700, Luis Chamberlain wrote:
> [   11.245248] BUG: kernel NULL pointer dereference, address: 
> [   11.254581] #PF: supervisor read access in kernel mode
> [   11.257387] #PF: error_code(0x) - not-present page
> [   11.260921] PGD 0 P4D 0
> [   11.262600] Oops:  [#1] PREEMPT SMP PTI
> [   11.264993] CPU: 7 PID: 198 Comm: (udev-worker) Not tainted 
> 6.3.0-large-block-20230426 #2
> [   11.269385] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.16.0-debian-1.16.0-5 04/01/2014
> [   11.275054] RIP: 0010:iomap_page_create.isra.0+0xc/0xd0
> [   11.277924] Code: 41 5e 41 5f c3 cc cc cc cc 0f 1f 44 00 00 90 90 90 90 90 
> 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 54 55 48 89 f5 53 <48> 8b 
> 06 48 c1 e8 0d 89 c6 83 e6 01 0f 84 a1 00 00 00 4c 8b 65 28
> [   11.287293] RSP: 0018:b0f0805ef9d8 EFLAGS: 00010293
> [   11.289964] RAX: 9de3c1fa8388 RBX: b0f0805efa78 RCX: 
> 00037ffe
> [   11.293212] RDX:  RSI:  RDI: 
> 000d
> [   11.296485] RBP:  R08: 00021000 R09: 
> 9c733b20
> [   11.299724] R10: 0001 R11: c000 R12: 
> 
> [   11.302974] R13: 9be96260 R14: b0f0805efa58 R15: 
> 

RSI is argument 2, which is folio.

Code starting with the faulting instruction
===
   0:   48 8b 06mov(%rsi),%rax
   3:   48 c1 e8 0d shr$0xd,%rax

Looks to me like a NULL folio was passed into iomap_page_create().

> [   11.306206] FS:  7f03ea8368c0() GS:9de43bdc() 
> knlGS:
> [   11.309949] CS:  0010 DS:  ES:  CR0: 80050033
> [   11.312464] CR2:  CR3: 000117ec6006 CR4: 
> 00770ee0
> [   11.315442] DR0:  DR1:  DR2: 
> 
> [   11.318310] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   11.321010] PKRU: 5554
> [   11.322212] Call Trace:
> [   11.323224]  
> [   11.324146]  iomap_readpage_iter+0x96/0x300
> [   11.325694]  iomap_readahead+0x174/0x2d0
> [   11.327129]  read_pages+0x69/0x1f0
> [   11.329751]  page_cache_ra_unbounded+0x187/0x1d0

... that shouldn't be possible.  read_pages() allocates pages, puts them
in the page cache and tells the filesystem to fill them in.

In your patches, did you call mapping_set_large_folios() anywhere?



Re: [Cluster-devel] [PATCH 03/17] fs: rename and move block_page_mkwrite_return

2023-04-24 Thread Matthew Wilcox
On Mon, Apr 24, 2023 at 07:49:12AM +0200, Christoph Hellwig wrote:
> block_page_mkwrite_return is neither block nor mkwrite specific, and
> should not be under CONFIG_BLOCK.  Move it to mm.h and rename it to
> errno_to_vmfault.

Could you move it about 300 lines down and put it near vmf_error()
so we think about how to unify the two at some point?

Perhaps it should better be called vmf_fs_error() for now since the
errnos it handles are the kind generated by filesystems.

> +++ b/include/linux/mm.h
> @@ -3061,6 +3061,19 @@ extern vm_fault_t filemap_map_pages(struct vm_fault 
> *vmf,
>   pgoff_t start_pgoff, pgoff_t end_pgoff);
>  extern vm_fault_t filemap_page_mkwrite(struct vm_fault *vmf);
>  
> +/* Convert errno to return value from ->page_mkwrite() call */
> +static inline vm_fault_t errno_to_vmfault(int err)
> +{
> + if (err == 0)
> + return VM_FAULT_LOCKED;
> + if (err == -EFAULT || err == -EAGAIN)
> + return VM_FAULT_NOPAGE;
> + if (err == -ENOMEM)
> + return VM_FAULT_OOM;
> + /* -ENOSPC, -EDQUOT, -EIO ... */
> + return VM_FAULT_SIGBUS;
> +}
> +
>  extern unsigned long stack_guard_gap;



Re: [Cluster-devel] [PATCH v3 19/19] block: mark bio_add_page as __must_check

2023-04-19 Thread Matthew Wilcox
On Wed, Apr 19, 2023 at 04:09:29PM +0200, Johannes Thumshirn wrote:
> Now that all users of bio_add_page check for the return value, mark
> bio_add_page as __must_check.

Should probably add __must_check to bio_add_folio too?  If this is
really the way you want to go ... means we also need a
__bio_add_folio().



Re: [Cluster-devel] [PATCH 02/19] drbd: use __bio_add_page to add page to bio

2023-03-29 Thread Matthew Wilcox
On Wed, Mar 29, 2023 at 10:05:48AM -0700, Johannes Thumshirn wrote:
> +++ b/drivers/block/drbd/drbd_bitmap.c
> @@ -1043,9 +1043,11 @@ static void bm_page_io_async(struct drbd_bm_aio_ctx 
> *ctx, int page_nr) __must_ho
>   bio = bio_alloc_bioset(device->ldev->md_bdev, 1, op, GFP_NOIO,
>   _md_io_bio_set);
>   bio->bi_iter.bi_sector = on_disk_sector;
> - /* bio_add_page of a single page to an empty bio will always succeed,
> -  * according to api.  Do we want to assert that? */
> - bio_add_page(bio, page, len, 0);
> + /*
> +  * __bio_add_page of a single page to an empty bio will always succeed,
> +  * according to api.  Do we want to assert that?
> +  */
> + __bio_add_page(bio, page, len, 0);

Surely the comment should just be deleted?  With no return value to
check, what would you assert?



Re: [Cluster-devel] [RFC v6 05/10] iomap/gfs2: Get page in page_prepare handler

2023-01-31 Thread Matthew Wilcox
On Sun, Jan 08, 2023 at 08:40:29PM +0100, Andreas Gruenbacher wrote:
> +static struct folio *
> +gfs2_iomap_page_prepare(struct iomap_iter *iter, loff_t pos, unsigned len)
>  {
> + struct inode *inode = iter->inode;
>   unsigned int blockmask = i_blocksize(inode) - 1;
>   struct gfs2_sbd *sdp = GFS2_SB(inode);
>   unsigned int blocks;
> + struct folio *folio;
> + int status;
>  
>   blocks = ((pos & blockmask) + len + blockmask) >> inode->i_blkbits;
> - return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> + status = gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> + if (status)
> + return ERR_PTR(status);
> +
> + folio = iomap_get_folio(iter, pos);
> + if (IS_ERR(folio))
> + gfs2_trans_end(sdp);
> + return folio;
>  }

Hi Andreas,

I didn't think to mention this at the time, but I was reading through
buffered-io.c and this jumped out at me.  For filesystems which support
folios, we pass the entire length of the write (or at least the length
of the remaining iomap length).  That's intended to allow us to decide
how large a folio to allocate at some point in the future.

For GFS2, we do this:

if (!mapping_large_folio_support(iter->inode->i_mapping))
len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));

I'd like to drop that and pass the full length of the write to
->get_folio().  It looks like you'll have to clamp it yourself at this
point.  I am kind of curious why you do one transaction per page --
I would have thought you'd rather do one transaction for the entire write.



Re: [Cluster-devel] [PATCH 8/9] btrfs: handle a NULL folio in extent_range_redirty_for_io

2023-01-18 Thread Matthew Wilcox
On Wed, Jan 18, 2023 at 10:43:28AM +0100, Christoph Hellwig wrote:
> filemap_get_folio can return NULL, skip those cases.

Hmm, I'm not sure that's true.  We have one place that calls
extent_range_redirty_for_io(), and it previously calls
extent_range_clear_dirty_for_io() which has an explicit

BUG_ON(!page); /* Pages should be in the extent_io_tree */

so I'm going to say this one can't happen either.  I haven't delved far
enough into btrfs to figure out why it can't happen.

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/btrfs/extent_io.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index d55e4531ffd212..a54d2cf74ba020 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -230,6 +230,8 @@ void extent_range_redirty_for_io(struct inode *inode, u64 
> start, u64 end)
>  
>   while (index <= end_index) {
>   folio = filemap_get_folio(mapping, index);
> + if (!folio)
> + continue;
>   filemap_dirty_folio(mapping, folio);
>   folio_account_redirty(folio);
>   index += folio_nr_pages(folio);
> -- 
> 2.39.0
> 



Re: [Cluster-devel] [PATCH 7/9] gfs2: handle a NULL folio in gfs2_jhead_process_page

2023-01-18 Thread Matthew Wilcox
On Wed, Jan 18, 2023 at 10:43:27AM +0100, Christoph Hellwig wrote:
> filemap_get_folio can return NULL, so exit early for that case.

I'm not sure it can return NULL in this specific case.  As I understand
this code, we're scanning the journal looking for the log head.  We've
just submitted the bio to read this page.  I suppose memory pressure
could theoretically push the page out, but if it does, we're doing the
wrong thing by just returning here; we need to retry reading the page.

Assuming we're not willing to do the work to add that case, I think I'd
rather see the crash in folio_wait_locked() than get data corruption
from failing to find the head of the log.

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/gfs2/lops.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
> index 1902413d5d123e..51d4b610127cdb 100644
> --- a/fs/gfs2/lops.c
> +++ b/fs/gfs2/lops.c
> @@ -472,6 +472,8 @@ static void gfs2_jhead_process_page(struct gfs2_jdesc 
> *jd, unsigned long index,
>   struct folio *folio;
>  
>   folio = filemap_get_folio(jd->jd_inode->i_mapping, index);
> + if (!folio)
> + return;
>  
>   folio_wait_locked(folio);
>   if (folio_test_error(folio))
> -- 
> 2.39.0
> 



Re: [Cluster-devel] [PATCH 3/9] mm: use filemap_get_entry in filemap_get_incore_folio

2023-01-18 Thread Matthew Wilcox
On Wed, Jan 18, 2023 at 10:43:23AM +0100, Christoph Hellwig wrote:
> filemap_get_incore_folio wants to look at the details of xa_is_value
> entries, but doesn't need any of the other logic in filemap_get_folio.
> Switch it to use the lower-level filemap_get_entry interface.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Matthew Wilcox (Oracle) 



Re: [Cluster-devel] [PATCH 2/9] mm: make mapping_get_entry available outside of filemap.c

2023-01-18 Thread Matthew Wilcox
On Wed, Jan 18, 2023 at 10:43:22AM +0100, Christoph Hellwig wrote:
> mapping_get_entry is useful for page cache API users that need to know
> about xa_value internals.  Rename it and make it available in pagemap.h.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Matthew Wilcox (Oracle) 



Re: [Cluster-devel] [PATCH 1/9] mm: don't look at xarray value entries in split_huge_pages_in_file

2023-01-18 Thread Matthew Wilcox
On Wed, Jan 18, 2023 at 10:43:21AM +0100, Christoph Hellwig wrote:
> split_huge_pages_in_file never wants to do anything with the special
> value enties.  Switch to using filemap_get_folio to not even see them.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Matthew Wilcox (Oracle) 



Re: [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper

2023-01-16 Thread Matthew Wilcox
On Sun, Jan 15, 2023 at 11:34:26PM -0800, Christoph Hellwig wrote:
> We could do that.  But while reading what Darrick wrote I came up with
> another idea I quite like.  Just split the FGP_ENTRY handling into
> a separate helper.  The logic and use cases are quite different from
> the normal page cache lookup, and the returning of the xarray entry
> is exactly the kind of layering violation that Dave is complaining
> about.  So what about just splitting that use case into a separate
> self contained helper?

Essentially reverting 44835d20b2a0.  Although we retain the merging of
the lock & get functions via the use of FGP flags.  Let me think about
it for a day.

> ---
> >From b4d10f98ea57f8480c03c0b00abad6f2b7186f56 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig 
> Date: Mon, 16 Jan 2023 08:26:57 +0100
> Subject: mm: replace FGP_ENTRY with a new __filemap_get_folio_entry helper
> 
> Split the xarray entry returning logic into a separate helper.  This will
> allow returning ERR_PTRs from __filemap_get_folio, and also isolates the
> logic that needs to known about xarray internals into a separate
> function.  This causes some code duplication, but as most flags to
> __filemap_get_folio are not applicable for the users that care about an
> entry that amount is very limited.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/pagemap.h |  6 +++--
>  mm/filemap.c| 50 -
>  mm/huge_memory.c|  4 ++--
>  mm/shmem.c  |  5 ++---
>  mm/swap_state.c |  2 +-
>  5 files changed, 53 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 4b3a7124c76712..e06c14b610caf2 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -504,8 +504,7 @@ pgoff_t page_cache_prev_miss(struct address_space 
> *mapping,
>  #define FGP_NOFS 0x0010
>  #define FGP_NOWAIT   0x0020
>  #define FGP_FOR_MMAP 0x0040
> -#define FGP_ENTRY0x0080
> -#define FGP_STABLE   0x0100
> +#define FGP_STABLE   0x0080
>  
>  struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t 
> index,
>   int fgp_flags, gfp_t gfp);
> @@ -546,6 +545,9 @@ static inline struct folio *filemap_lock_folio(struct 
> address_space *mapping,
>   return __filemap_get_folio(mapping, index, FGP_LOCK, 0);
>  }
>  
> +struct folio *__filemap_get_folio_entry(struct address_space *mapping,
> + pgoff_t index, int fgp_flags);
> +
>  /**
>   * find_get_page - find and get a page reference
>   * @mapping: the address_space to search
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c4d4ace9cc7003..d04613347b3e71 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1887,8 +1887,6 @@ static void *mapping_get_entry(struct address_space 
> *mapping, pgoff_t index)
>   *
>   * * %FGP_ACCESSED - The folio will be marked accessed.
>   * * %FGP_LOCK - The folio is returned locked.
> - * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
> - *   instead of allocating a new folio to replace it.
>   * * %FGP_CREAT - If no page is present then a new page is allocated using
>   *   @gfp and added to the page cache and the VM's LRU list.
>   *   The page is returned locked and with an increased refcount.
> @@ -1914,11 +1912,8 @@ struct folio *__filemap_get_folio(struct address_space 
> *mapping, pgoff_t index,
>  
>  repeat:
>   folio = mapping_get_entry(mapping, index);
> - if (xa_is_value(folio)) {
> - if (fgp_flags & FGP_ENTRY)
> - return folio;
> + if (xa_is_value(folio))
>   folio = NULL;
> - }
>   if (!folio)
>   goto no_page;
>  
> @@ -1994,6 +1989,49 @@ struct folio *__filemap_get_folio(struct address_space 
> *mapping, pgoff_t index,
>  }
>  EXPORT_SYMBOL(__filemap_get_folio);
>  
> +
> +/**
> + * __filemap_get_folio_entry - Find and get a reference to a folio.
> + * @mapping: The address_space to search.
> + * @index: The page index.
> + * @fgp_flags: %FGP flags modify how the folio is returned.
> + *
> + * Looks up the page cache entry at @mapping & @index.  If there is a shadow 
> /
> + * swap / DAX entry, return it instead of allocating a new folio to replace 
> it.
> + *
> + * @fgp_flags can be zero or more of these flags:
> + *
> + * * %FGP_LOCK - The folio is returned locked.
> + *
> + * If there is a page cache page, it is returned with an increased refcount.
> + *
> + * Return: The found folio or %NULL otherwise.
> + */
> +struct folio *__filemap_get_folio_entry(struct address_space *mapping,
> + pgoff_t index, int fgp_flags)
> +{
> + struct folio *folio;
> +
> + if (WARN_ON_ONCE(fgp_flags & ~FGP_LOCK))
> + return NULL;
> +
> +repeat:
> + folio = mapping_get_entry(mapping, index);
> + if (folio && !xa_is_value(folio) && (fgp_flags & FGP_LOCK)) {
> + 

Re: [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper

2023-01-15 Thread Matthew Wilcox
On Sun, Jan 15, 2023 at 09:06:50AM -0800, Darrick J. Wong wrote:
> On Sun, Jan 15, 2023 at 09:01:22AM -0800, Darrick J. Wong wrote:
> > On Tue, Jan 10, 2023 at 01:34:16PM +, Matthew Wilcox wrote:
> > > On Tue, Jan 10, 2023 at 12:46:45AM -0800, Christoph Hellwig wrote:
> > > > On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > > > > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > > > > checking for that in iomap_get_folio().  Your patch then turns into
> > > > > the below.
> > > > 
> > > > Exactly.  And as I already pointed out in reply to Dave's original
> > > > patch what we really should be doing is returning an ERR_PTR from
> > > > __filemap_get_folio instead of reverse-engineering the expected
> > > > error code.
> > > 
> > > Ouch, we have a nasty problem.
> > > 
> > > If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> > > encodings for shadow entries overlap with the encodings for ERR_PTR,
> > > meaning that some shadow entries will look like errors.  The way I
> > > solved this in the XArray code is by shifting the error values by
> > > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> > > 
> > > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> > > but so far we haven't, and I'd like to make that decision intentionally.
> > 
> > Sorry, I'm not following this at all -- where in buffered-io.c does
> > anyone pass FGP_ENTRY?  Andreas' code doesn't seem to introduce it
> > either...?
> 
> Oh, never mind, I worked out that the conflict is between iomap not
> passing FGP_ENTRY and wanting a pointer or a negative errno; and someone
> who does FGP_ENTRY, in which case the xarray value can be confused for a
> negative errno.
> 
> OFC now I wonder, can we simply say that the return value is "The found
> folio or NULL if you set FGP_ENTRY; or the found folio or a negative
> errno if you don't" ?

Erm ... I would rather not!

Part of me remembers that x86-64 has the rather nice calling convention
of being able to return a struct containing two values in two registers:

: Integer return values up to 64 bits in size are stored in RAX while
: values up to 128 bit are stored in RAX and RDX.

so maybe we can return:

struct OptionFolio {
int err;
struct folio *folio;
};



Re: [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper

2023-01-11 Thread Matthew Wilcox
On Tue, Jan 10, 2023 at 07:24:27AM -0800, Christoph Hellwig wrote:
> On Tue, Jan 10, 2023 at 01:34:16PM +0000, Matthew Wilcox wrote:
> > > Exactly.  And as I already pointed out in reply to Dave's original
> > > patch what we really should be doing is returning an ERR_PTR from
> > > __filemap_get_folio instead of reverse-engineering the expected
> > > error code.
> > 
> > Ouch, we have a nasty problem.
> > 
> > If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
> > encodings for shadow entries overlap with the encodings for ERR_PTR,
> > meaning that some shadow entries will look like errors.  The way I
> > solved this in the XArray code is by shifting the error values by
> > two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).
> > 
> > I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
> > but so far we haven't, and I'd like to make that decision intentionally.
> 
> So what would be an alternative way to tell the callers why no folio
> was found instead of trying to reverse engineer that?  Return an errno
> and the folio by reference?  The would work, but the calling conventions
> would be awful.

Agreed.  How about an xa_filemap_get_folio()?

(there are a number of things to fix here; haven't decided if XA_ERROR
should return void *, or whether i should use a separate 'entry' and
'folio' until I know the entry is actually a folio ...)

Usage would seem pretty straightforward:

folio = xa_filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
fgp, mapping_gfp_mask(iter->inode->i_mapping));
status = xa_err(folio);
if (status)
goto out_no_page;

diff --git a/mm/filemap.c b/mm/filemap.c
index 7bf8442bcfaa..7d489f96c690 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1800,40 +1800,25 @@ static void *mapping_get_entry(struct address_space 
*mapping, pgoff_t index)
 }
 
 /**
- * __filemap_get_folio - Find and get a reference to a folio.
+ * xa_filemap_get_folio - Find and get a reference to a folio.
  * @mapping: The address_space to search.
  * @index: The page index.
  * @fgp_flags: %FGP flags modify how the folio is returned.
  * @gfp: Memory allocation flags to use if %FGP_CREAT is specified.
  *
- * Looks up the page cache entry at @mapping & @index.
- *
- * @fgp_flags can be zero or more of these flags:
- *
- * * %FGP_ACCESSED - The folio will be marked accessed.
- * * %FGP_LOCK - The folio is returned locked.
- * * %FGP_ENTRY - If there is a shadow / swap / DAX entry, return it
- *   instead of allocating a new folio to replace it.
- * * %FGP_CREAT - If no page is present then a new page is allocated using
- *   @gfp and added to the page cache and the VM's LRU list.
- *   The page is returned locked and with an increased refcount.
- * * %FGP_FOR_MMAP - The caller wants to do its own locking dance if the
- *   page is already in cache.  If the page was allocated, unlock it before
- *   returning so the caller can do the same dance.
- * * %FGP_WRITE - The page will be written to by the caller.
- * * %FGP_NOFS - __GFP_FS will get cleared in gfp.
- * * %FGP_NOWAIT - Don't get blocked by page lock.
- * * %FGP_STABLE - Wait for the folio to be stable (finished writeback)
- *
- * If %FGP_LOCK or %FGP_CREAT are specified then the function may sleep even
- * if the %GFP flags specified for %FGP_CREAT are atomic.
+ * Looks up the page cache entry at @mapping & @index.  See
+ * __filemap_get_folio() for a detailed description.
  *
- * If there is a page cache page, it is returned with an increased refcount.
+ * This differs from __filemap_get_folio() in that it will return an
+ * XArray error instead of NULL if something goes wrong, allowing the
+ * advanced user to distinguish why the failure happened.  We can't use an
+ * ERR_PTR() because its encodings overlap with shadow/swap/dax entries.
  *
- * Return: The found folio or %NULL otherwise.
+ * Return: The entry in the page cache or an xa_err() if there is no entry
+ * or it could not be appropiately locked.
  */
-struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-   int fgp_flags, gfp_t gfp)
+struct folio *xa_filemap_get_folio(struct address_space *mapping,
+   pgoff_t index, int fgp_flags, gfp_t gfp)
 {
struct folio *folio;
 
@@ -1851,7 +1836,7 @@ struct folio *__filemap_get_folio(struct address_space 
*mapping, pgoff_t index,
if (fgp_flags & FGP_NOWAIT) {
if (!folio_trylock(folio)) {
folio_put(folio);
-   return NULL;
+   return (struct folio *)XA_ERROR(-EAGAIN);
}
} else {
folio_lock(folio);
@@ -1890,7 +1875,7 @@ struct folio

Re: [Cluster-devel] [RFC v6 04/10] iomap: Add iomap_get_folio helper

2023-01-10 Thread Matthew Wilcox
On Tue, Jan 10, 2023 at 12:46:45AM -0800, Christoph Hellwig wrote:
> On Mon, Jan 09, 2023 at 01:46:42PM +0100, Andreas Gruenbacher wrote:
> > We can handle that by adding a new IOMAP_NOCREATE iterator flag and
> > checking for that in iomap_get_folio().  Your patch then turns into
> > the below.
> 
> Exactly.  And as I already pointed out in reply to Dave's original
> patch what we really should be doing is returning an ERR_PTR from
> __filemap_get_folio instead of reverse-engineering the expected
> error code.

Ouch, we have a nasty problem.

If somebody passes FGP_ENTRY, we can return a shadow entry.  And the
encodings for shadow entries overlap with the encodings for ERR_PTR,
meaning that some shadow entries will look like errors.  The way I
solved this in the XArray code is by shifting the error values by
two bits and encoding errors as XA_ERROR(-ENOMEM) (for example).

I don't _object_ to introducing XA_ERROR() / xa_err() into the VFS,
but so far we haven't, and I'd like to make that decision intentionally.



Re: [Cluster-devel] [PATCH v5 7/9] iomap/xfs: Eliminate the iomap_valid handler

2023-01-04 Thread Matthew Wilcox
On Wed, Jan 04, 2023 at 09:53:17AM -0800, Darrick J. Wong wrote:
> I wonder if this should be reworked a bit to reduce indenting:
> 
>   if (PTR_ERR(folio) == -ESTALE) {

FYI this is a bad habit to be in.  The compiler can optimise

if (folio == ERR_PTR(-ESTALE))

better than it can optimise the other way around.



Re: [Cluster-devel] [RFC v3 4/7] iomap: Add iomap_folio_prepare helper

2022-12-25 Thread Matthew Wilcox
On Fri, Dec 23, 2022 at 11:23:34PM -0800, Christoph Hellwig wrote:
> On Fri, Dec 23, 2022 at 10:05:05PM +0100, Andreas Grünbacher wrote:
> > > I'd name this __iomap_get_folio to match __filemap_get_folio.
> > 
> > I was looking at it from the filesystem point of view: this helper is
> > meant to be used in ->folio_prepare() handlers, so
> > iomap_folio_prepare() seemed to be a better name than
> > __iomap_get_folio().
> 
> Well, I think the right name for the methods that gets a folio is
> probably ->folio_get anyway.

For the a_ops, the convention I've been following is:

folio_mark_dirty()
 -> aops->dirty_folio()
   -> iomap_dirty_folio()

ie VERB_folio() as the name of the operation, and MODULE_VERB_folio()
as the implementation.  Seems to work pretty well.



Re: [Cluster-devel] [RFC v3 5/7] iomap: Get page in page_prepare handler

2022-12-16 Thread Matthew Wilcox
On Fri, Dec 16, 2022 at 04:06:24PM +0100, Andreas Gruenbacher wrote:
> + if (page_ops && page_ops->page_prepare)
> + folio = page_ops->page_prepare(iter, pos, len);
> + else
> + folio = iomap_folio_prepare(iter, pos);
> + if (IS_ERR_OR_NULL(folio)) {
> + if (!folio)
> + return (iter->flags & IOMAP_NOWAIT) ? -EAGAIN : -ENOMEM;
> + return PTR_ERR(folio);

Wouldn't it be cleaner if iomap_folio_prepare() always
returned an ERR_PTR on failure?



Re: [Cluster-devel] BUG: unable to handle kernel NULL pointer dereference in gfs2_evict_inode

2022-11-21 Thread Matthew Wilcox
On Fri, Nov 18, 2022 at 10:33:21AM +0100, Dmitry Vyukov wrote:
> On Fri, 18 Nov 2022 at 09:06, Wei Chen  wrote:
> >
> > Dear Linux developers,
> >
> > The bug persists in upstream Linux v6.0-rc5.
> 
> If you fix this, please also add the syzbot tag:
> 
> Reported-by: syzbot+8a5fc6416c175cece...@syzkaller.appspotmail.com
> https://lore.kernel.org/all/ab092305e268a...@google.com/

Hey Dmitri, does Wei Chen work with you?  They're not responding to
requests to understand what they're doing.  eg:

https://lore.kernel.org/all/ytvhvkpafzgmh...@casper.infradead.org/

https://lore.kernel.org/all/y0sat5grkumuw...@casper.infradead.org/

I'm just ignoring their reports now.



Re: [Cluster-devel] [PATCH] filelock: move file locking definitions to separate header file

2022-11-21 Thread Matthew Wilcox
On Sun, Nov 20, 2022 at 03:59:57PM -0500, Jeff Layton wrote:
> Move the file locking definitions to a new header file, and add the
> appropriate #include directives to the source files that need them. By
> doing this we trim down fs.h a bit and limit the amount of rebuilding
> that has to be done when we make changes to the file locking APIs.

I'm in favour of this in general, but I think there's a few implicit
includes.  Can you create a test.c that only #include
 and see if there's anything missing?

> + wait_queue_head_t fl_wait;
> + struct file *fl_file;

These two seem undefined at this point.

> + struct fasync_struct *  fl_fasync; /* for lease break notifications */

Likewise.



Re: [Cluster-devel] [PATCH] filelock: move file locking definitions to separate header file

2022-11-21 Thread Matthew Wilcox
On Mon, Nov 21, 2022 at 09:26:16AM +0800, Xiubo Li wrote:
[1300+ lines snipped]
> LGTM.
> 
> Reviewed-by: Xiubo Li 

You really don't need to quote the whole thing.  Please be more
considerate.



Re: [Cluster-devel] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()

2022-11-04 Thread Matthew Wilcox
On Fri, Nov 04, 2022 at 11:32:35AM +1100, Dave Chinner wrote:
> At minimum, it needs to be documented, though I'd much prefer that
> we explicitly duplicate write_cache_pages() as write_cache_folios()
> with a callback that takes a folio and change the code to be fully
> multi-page folio safe. Then filesystems that support folios (and
> large folios) natively can be passed folios without going through
> this crappy "folio->page, page->folio" dance because the writepage
> APIs are unaware of multi-page folio constructs.

There are a lot of places which go through the folio->page->folio
dance, and this one wasn't even close to the top of my list.  That
said, it has a fairly small number of callers -- ext4, fuse, iomap,
mpage, nfs, orangefs.  So Vishal, this seems like a good project for you
to take on next -- convert write_cache_pages() to write_cache_folios()
and writepage_t to write_folio_t.



Re: [Cluster-devel] [PATCH 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()

2022-11-04 Thread Matthew Wilcox
On Wed, Oct 19, 2022 at 08:01:52AM +1100, Dave Chinner wrote:
> On Thu, Sep 01, 2022 at 03:01:19PM -0700, Vishal Moola (Oracle) wrote:
> > @@ -2313,17 +2313,18 @@ int write_cache_pages(struct address_space *mapping,
> > while (!done && (index <= end)) {
> > int i;
> >  
> > -   nr_pages = pagevec_lookup_range_tag(, mapping, , end,
> > -   tag);
> > -   if (nr_pages == 0)
> > +   nr_folios = filemap_get_folios_tag(mapping, , end,
> > +   tag, );
> 
> This can find and return dirty multi-page folios if the filesystem
> enables them in the mapping at instantiation time, right?

Correct.  Just like before the patch.  pagevec_lookup_range_tag() has
only ever returned head pages, never tail pages.  This is probably
because shmem (which was our only fs that supported compound pages)
never supported writeback, so never looked up pages by tag.

> > trace_wbc_writepage(wbc, inode_to_bdi(mapping->host));
> > -   error = (*writepage)(page, wbc, data);
> > +   error = writepage(>page, wbc, data);
> 
> Yet, IIUC, this treats all folios as if they are single page folios.
> i.e. it passes the head page of a multi-page folio to a callback
> that will treat it as a single PAGE_SIZE page, because that's all
> the writepage callbacks are currently expected to be passed...
> 
> So won't this break writeback of dirty multipage folios?

No.  A filesystem only sets the flag to create multipage folios once its
writeback callback handles multipage folios correctly (amongst many other
things that have to be fixed and tested).  I haven't written down all
the things that a filesystem maintainer needs to check at least partly
because I don't know how representative XFS/iomap are of all filesystems.



Re: [Cluster-devel] [PATCH v3 04/23] page-writeback: Convert write_cache_pages() to use filemap_get_folios_tag()

2022-10-24 Thread Matthew Wilcox
On Mon, Oct 17, 2022 at 01:24:32PM -0700, Vishal Moola (Oracle) wrote:
> Converted function to use folios throughout. This is in preparation for
> the removal of find_get_pages_range_tag().

And removes eight calls to compound_head(), saving 296 bytes of kernel
text (!)  It also adds support for large folios to this function.

> Signed-off-by: Vishal Moola (Oracle) 

Reviewed-by: Matthew Wilcox (Oracle) 



Re: [Cluster-devel] [PATCH v3 03/23] filemap: Convert __filemap_fdatawait_range() to use filemap_get_folios_tag()

2022-10-24 Thread Matthew Wilcox
On Mon, Oct 17, 2022 at 01:24:31PM -0700, Vishal Moola (Oracle) wrote:
> Converted function to use folios. This is in preparation for the removal
> of find_get_pages_range_tag().

Yes, it is, but this patch also has some nice advantages of its own:

 - Removes a call to wait_on_page_writeback(), which removes a call
   to compound_head()
 - Removes a call to ClearPageError(), which removes another call
   to compound_head()
 - Removes a call to pagevec_release(), which will eventually
   remove a third call to compound_head() (it doesn't today, but
   one day ...)

So you can definitely say that it removes 50 bytes of text and two
calls to compound_head().  And that way, this patch justifies its
existance by itself ;-)

> Signed-off-by: Vishal Moola (Oracle) 

Reviewed-by: Matthew Wilcox (Oracle) 



Re: [Cluster-devel] [PATCH v3 01/23] pagemap: Add filemap_grab_folio()

2022-10-24 Thread Matthew Wilcox
On Mon, Oct 17, 2022 at 01:24:29PM -0700, Vishal Moola (Oracle) wrote:
> Add function filemap_grab_folio() to grab a folio from the page cache.
> This function is meant to serve as a folio replacement for
> grab_cache_page, and is used to facilitate the removal of
> find_get_pages_range_tag().

I'm still not loving the name, but it does have historical precedent
and I can't think of a better one.

Reviewed-by: Matthew Wilcox (Oracle) 



Re: [Cluster-devel] [PATCH v3 02/23] filemap: Added filemap_get_folios_tag()

2022-10-24 Thread Matthew Wilcox
On Mon, Oct 17, 2022 at 01:24:30PM -0700, Vishal Moola (Oracle) wrote:
> This is the equivalent of find_get_pages_range_tag(), except for folios
> instead of pages.
> 
> One noteable difference is filemap_get_folios_tag() does not take in a
> maximum pages argument. It instead tries to fill a folio batch and stops
> either once full (15 folios) or reaching the end of the search range.
> 
> The new function supports large folios, the initial function did not
> since all callers don't use large folios.

Reviewed-by: Matthew Wilcow (Oracle) 

> +/**
> + * filemap_get_folios_tag - Get a batch of folios matching @tag.
> + * @mapping:The address_space to search
> + * @start:  The starting page index
> + * @end:The final page index (inclusive)
> + * @tag:The tag index
> + * @fbatch: The batch to fill
> + *
> + * Same as filemap_get_folios, but only returning folios tagged with @tag

If you add () after filemap_get_folios, it turns into a nice link in
the html documentation.

> + *
> + * Return: The number of folios found

Missing full stop at the end of this line.

> + * Also update @start to index the next folio for traversal

Ditto.

> + */
> +unsigned filemap_get_folios_tag(struct address_space *mapping, pgoff_t 
> *start,
> + pgoff_t end, xa_mark_t tag, struct folio_batch *fbatch)
> +{
> + XA_STATE(xas, >i_pages, *start);
> + struct folio *folio;
> +
> + rcu_read_lock();
> + while ((folio = find_get_entry(, end, tag)) != NULL) {
> + /* Shadow entries should never be tagged, but this iteration
> +  * is lockless so there is a window for page reclaim to evict
> +  * a page we saw tagged. Skip over it.
> +  */

For multiline comments, the "/*" should be on a line by itself.



Re: [Cluster-devel] remove iomap_writepage v2

2022-08-10 Thread Matthew Wilcox
On Wed, Aug 10, 2022 at 11:32:06PM +0200, Andreas Grünbacher wrote:
> Am Mi., 10. Aug. 2022 um 22:57 Uhr schrieb Matthew Wilcox 
> :
> > On Mon, Aug 01, 2022 at 11:31:50AM -0400, Johannes Weiner wrote:
> > > XFS hasn't had a ->writepage call for a while. After LSF I internally
> > > tested dropping btrfs' callback, and the results looked good: no OOM
> > > kills with dirty/writeback pages remaining, performance parity. Then I
> > > went on vacation and Christoph beat me to the patch :)
> >
> > To avoid duplicating work with you or Christoph ... it seems like the
> > plan is to kill ->writepage entirely soon, so there's no point in me
> > doing a sweep of all the filesystems to convert ->writepage to
> > ->write_folio, correct?
> >
> > I assume the plan for filesystems which have a writepage but don't have
> > a ->writepages (9p, adfs, affs, bfs, ecryptfs, gfs2, hostfs, jfs, minix,
> > nilfs2, ntfs, ocfs2, reiserfs, sysv, ubifs, udf, ufs, vboxsf) is to give
> > them a writepages, modelled on iomap_writepages().  Seems that adding
> > a block_writepages() might be a useful thing for me to do?
> 
> Hmm, gfs2 does have gfs2_writepages() and gfs2_jdata_writepages()
> functions, so it should probably be fine.

Ah, it's gfs2_aspace_writepage which doesn't have a writepages
counterpart.  I haven't looked at it to understand why it's needed.
(gfs2_meta_aops and gfs2_rgrp_aops)



Re: [Cluster-devel] remove iomap_writepage v2

2022-08-10 Thread Matthew Wilcox
On Mon, Aug 01, 2022 at 11:31:50AM -0400, Johannes Weiner wrote:
> XFS hasn't had a ->writepage call for a while. After LSF I internally
> tested dropping btrfs' callback, and the results looked good: no OOM
> kills with dirty/writeback pages remaining, performance parity. Then I
> went on vacation and Christoph beat me to the patch :)

To avoid duplicating work with you or Christoph ... it seems like the
plan is to kill ->writepage entirely soon, so there's no point in me
doing a sweep of all the filesystems to convert ->writepage to
->write_folio, correct?

I assume the plan for filesystems which have a writepage but don't have
a ->writepages (9p, adfs, affs, bfs, ecryptfs, gfs2, hostfs, jfs, minix,
nilfs2, ntfs, ocfs2, reiserfs, sysv, ubifs, udf, ufs, vboxsf) is to give
them a writepages, modelled on iomap_writepages().  Seems that adding
a block_writepages() might be a useful thing for me to do?



Re: [Cluster-devel] remove iomap_writepage v2

2022-07-28 Thread Matthew Wilcox
On Thu, Jul 28, 2022 at 01:10:16PM +0200, Jan Kara wrote:
> Hi Christoph!
> 
> On Tue 19-07-22 06:13:07, Christoph Hellwig wrote:
> > this series removes iomap_writepage and it's callers, following what xfs
> > has been doing for a long time.
> 
> So this effectively means "no writeback from page reclaim for these
> filesystems" AFAICT (page migration of dirty pages seems to be handled by
> iomap_migrate_page()) which is going to make life somewhat harder for
> memory reclaim when memory pressure is high enough that dirty pages are
> reaching end of the LRU list. I don't expect this to be a problem on big
> machines but it could have some undesirable effects for small ones
> (embedded, small VMs). I agree per-page writeback has been a bad idea for
> efficiency reasons for at least last 10-15 years and most filesystems
> stopped dealing with more complex situations (like block allocation) from
> ->writepage() already quite a few years ago without any bug reports AFAIK.
> So it all seems like a sensible idea from FS POV but are MM people on board
> or at least aware of this movement in the fs land?

I mentioned it during my folio session at LSFMM, but didn't put a huge
emphasis on it.

For XFS, writeback should already be in progress on other pages if
we're getting to the point of trying to call ->writepage() in vmscan.
Surely this is also true for other filesystems?



Re: [Cluster-devel] [PATCH v2 07/19] mm/migrate: Convert expected_page_refs() to folio_expected_refs()

2022-07-07 Thread Matthew Wilcox
On Thu, Jul 07, 2022 at 07:50:17PM -0700, Hugh Dickins wrote:
> On Wed, 8 Jun 2022, Matthew Wilcox (Oracle) wrote:
> 
> > Now that both callers have a folio, convert this function to
> > take a folio & rename it.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) 
> > Reviewed-by: Christoph Hellwig 
> > ---
> >  mm/migrate.c | 19 ---
> >  1 file changed, 12 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 2975f0c4d7cf..2e2f41572066 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -336,13 +336,18 @@ void pmd_migration_entry_wait(struct mm_struct *mm, 
> > pmd_t *pmd)
> >  }
> >  #endif
> >  
> > -static int expected_page_refs(struct address_space *mapping, struct page 
> > *page)
> > +static int folio_expected_refs(struct address_space *mapping,
> > +   struct folio *folio)
> >  {
> > -   int expected_count = 1;
> > +   int refs = 1;
> > +   if (!mapping)
> > +   return refs;
> >  
> > -   if (mapping)
> > -   expected_count += compound_nr(page) + page_has_private(page);
> > -   return expected_count;
> > +   refs += folio_nr_pages(folio);
> > +   if (folio_get_private(folio))
> > +   refs++;
> > +
> > +   return refs;
> >  }
> >  
> >  /*
> > @@ -359,7 +364,7 @@ int folio_migrate_mapping(struct address_space *mapping,
> > XA_STATE(xas, >i_pages, folio_index(folio));
> > struct zone *oldzone, *newzone;
> > int dirty;
> > -   int expected_count = expected_page_refs(mapping, >page) + 
> > extra_count;
> > +   int expected_count = folio_expected_refs(mapping, folio) + extra_count;
> > long nr = folio_nr_pages(folio);
> >  
> > if (!mapping) {
> > @@ -669,7 +674,7 @@ static int __buffer_migrate_folio(struct address_space 
> > *mapping,
> > return migrate_page(mapping, >page, >page, mode);
> >  
> > /* Check whether page does not have extra refs before we do more work */
> > -   expected_count = expected_page_refs(mapping, >page);
> > +   expected_count = folio_expected_refs(mapping, src);
> > if (folio_ref_count(src) != expected_count)
> > return -EAGAIN;
> >  
> > -- 
> > 2.35.1
> 
> This commit (742e89c9e352d38df1a5825fe40c4de73a5d5f7a in pagecache.git
> folio/for-next and recent linux-next) is dangerously wrong, at least
> for swapcache, and probably for some others.
> 
> I say "dangerously" because it tells page migration a swapcache page
> is safe for migration when it certainly is not.
> 
> The fun that typically ensues is kernel BUG at include/linux/mm.h:750!
> put_page_testzero() VM_BUG_ON_PAGE(page_ref_count(page) == 0, page),
> if CONFIG_DEBUG_VM=y (bisecting for that is what brought me to this).
> But I guess you might get silent data corruption too.
> 
> I assumed at first that you'd changed the rules, and were now expecting
> any subsystem that puts a non-zero value into folio->private to raise
> its refcount - whereas the old convention (originating with buffer heads)
> is that setting PG_private says an extra refcount has been taken, please
> call try_to_release_page() to lower it, and maybe that will use data in
> page->private to do so; but page->private free for the subsystem owning
> the page to use as it wishes, no refcount implication.  But that you
> had missed updating swapcache.
> 
> So I got working okay with the patch below; but before turning it into
> a proper patch, noticed that there were still plenty of other places
> applying the test for PG_private: so now think that maybe you set out
> with intention as above, realized it wouldn't work, but got distracted
> before cleaning up some places you'd already changed.  And patch below
> now goes in the wrong direction.
> 
> Or maybe you didn't intend any change, but the PG_private test just got
> missed in a few places.  I don't know, hope you remember, but current
> linux-next badly inconsistent.
> Over to you, thanks,

Ugh.  The problem I'm trying to solve is that we're short on page flags.
We _seemed_ to have correlation between "page->private != NULL" and
"PG_private is set", and so I thought I could make progress towards
removing PG_private.  But the rule you set out above wasn't written down
anywhere that I was able to find it.

I'm about to go to sleep, but I'll think on this some more tomorrow.

> Hugh
> 
> --- a/mm/migrate.c2022-07-06 14:24:44.499941975 -0700
> +++ b/mm/migrate.c2022-07-06 15:49:25.0 -0700
> @@ -351,6 +351,10 

Re: [Cluster-devel] gfs2 is unhappy on pagecache/for-next

2022-06-19 Thread Matthew Wilcox
On Sun, Jun 19, 2022 at 09:05:59AM +0200, Christoph Hellwig wrote:
> When trying to run xfstests on gfs2 (locally with the lock_nolock
> cluster managed) the first mount already hits this warning in
> inode_to_wb called from mark_buffer_dirty.  This all seems standard
> code from folio_account_dirtied, so not sure what is going there.

I don't think this is new to pagecache/for-next.
https://lore.kernel.org/linux-mm/cf8bc8dd-8e16-3590-a714-51203e6f4...@redhat.com/

> 
> [   30.440408] [ cut here ]
> [   30.440409] WARNING: CPU: 1 PID: 931 at include/linux/backing-dev.h:261 
> __folio_mark_dirty+0x2f0/0x380
> [   30.446424] Modules linked in:
> [   30.446828] CPU: 1 PID: 931 Comm: kworker/1:2 Not tainted 5.19.0-rc2+ #1702
> [   30.447714] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.14.0-2 04/01/2014
> [   30.448770] Workqueue: gfs_recovery gfs2_recover_func
> [   30.449441] RIP: 0010:__folio_mark_dirty+0x2f0/0x380
> [   30.450113] Code: e8 b5 69 12 01 85 c0 0f 85 6a fe ff ff 48 8b 83 a8 01 00 
> 00 be ff ff ff ff 48 8d 78 2
> [   30.452490] RSP: 0018:c90001b77bd0 EFLAGS: 00010046
> [   30.453141] RAX:  RBX: 8881004a3d00 RCX: 
> 0001
> [   30.454067] RDX:  RSI: 82f592db RDI: 
> 830380ae
> [   30.454970] RBP: ea000455f680 R08: 0001 R09: 
> 84747570
> [   30.455921] R10: 0017 R11: 88810260b1c0 R12: 
> 0282
> [   30.456910] R13: 88810dd92170 R14: 0001 R15: 
> 0001
> [   30.457871] FS:  () GS:88813bd0() 
> knlGS:
> [   30.458912] CS:  0010 DS:  ES:  CR0: 80050033
> [   30.459608] CR2: 7efc1d5adc80 CR3: 000116416000 CR4: 
> 06e0
> [   30.460564] Call Trace:
> [   30.460871]  
> [   30.461130]  mark_buffer_dirty+0x173/0x1d0
> [   30.461687]  update_statfs_inode+0x146/0x187
> [   30.462276]  gfs2_recover_func.cold+0x48f/0x864
> [   30.462875]  ? add_lock_to_list+0x8b/0xf0
> [   30.463337]  ? __lock_acquire+0xf7e/0x1e30
> [   30.463812]  ? lock_acquire+0xd4/0x300
> [   30.464267]  ? lock_acquire+0xe4/0x300
> [   30.464715]  ? gfs2_recover_func.cold+0x217/0x864
> [   30.465334]  process_one_work+0x239/0x550
> [   30.465920]  ? process_one_work+0x550/0x550
> [   30.466485]  worker_thread+0x4d/0x3a0
> [   30.466966]  ? process_one_work+0x550/0x550
> [   30.467509]  kthread+0xe2/0x110
> [   30.467941]  ? kthread_complete_and_exit+0x20/0x20
> [   30.468558]  ret_from_fork+0x22/0x30
> [   30.469047]  
> [   30.469346] irq event stamp: 36146
> [   30.469796] hardirqs last  enabled at (36145): [] 
> folio_memcg_lock+0x8c/0x180
> [   30.470919] hardirqs last disabled at (36146): [] 
> _raw_spin_lock_irqsave+0x59/0x60
> [   30.472024] softirqs last  enabled at (33630): [] 
> __irq_exit_rcu+0xd7/0x130
> [   30.473051] softirqs last disabled at (33619): [] 
> __irq_exit_rcu+0xd7/0x130
> [   30.474107] ---[ end trace  ]---
> [   30.475367] [ cut here ]
> 
> 



Re: [Cluster-devel] [PATCH v2 12/19] btrfs: Convert btrfs_migratepage to migrate_folio

2022-06-09 Thread Matthew Wilcox
On Thu, Jun 09, 2022 at 06:33:23PM +0200, David Sterba wrote:
> On Wed, Jun 08, 2022 at 04:02:42PM +0100, Matthew Wilcox (Oracle) wrote:
> > Use filemap_migrate_folio() to do the bulk of the work, and then copy
> > the ordered flag across if needed.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) 
> > Reviewed-by: Christoph Hellwig 
> 
> Acked-by: David Sterba 
> 
> > +static int btrfs_migrate_folio(struct address_space *mapping,
> > +struct folio *dst, struct folio *src,
> >  enum migrate_mode mode)
> >  {
> > -   int ret;
> > +   int ret = filemap_migrate_folio(mapping, dst, src, mode);
> >  
> > -   ret = migrate_page_move_mapping(mapping, newpage, page, 0);
> > if (ret != MIGRATEPAGE_SUCCESS)
> > return ret;
> >  
> > -   if (page_has_private(page))
> > -   attach_page_private(newpage, detach_page_private(page));
> 
> If I'm reading it correctly, the private pointer does not need to be set
> like that anymore because it's done somewhere during the
> filemap_migrate_folio() call.

That's correct.  Everything except moving the ordered flag across is
done for you, and I'm kind of tempted to modify folio_migrate_flags()
to copy the ordered flag across as well.  Then you could just use
filemap_migrate_folio() directly.

> > -
> > -   if (PageOrdered(page)) {
> > -   ClearPageOrdered(page);
> > -   SetPageOrdered(newpage);
> > +   if (folio_test_ordered(src)) {
> > +   folio_clear_ordered(src);
> > +   folio_set_ordered(dst);
> > }



Re: [Cluster-devel] [PATCH v2 03/19] fs: Add aops->migrate_folio

2022-06-09 Thread Matthew Wilcox
On Thu, Jun 09, 2022 at 02:50:20PM +0200, David Hildenbrand wrote:
> On 08.06.22 17:02, Matthew Wilcox (Oracle) wrote:
> > diff --git a/Documentation/filesystems/locking.rst 
> > b/Documentation/filesystems/locking.rst
> > index c0fe711f14d3..3d28b23676bd 100644
> > --- a/Documentation/filesystems/locking.rst
> > +++ b/Documentation/filesystems/locking.rst
> > @@ -253,7 +253,8 @@ prototypes::
> > void (*free_folio)(struct folio *);
> > int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
> > bool (*isolate_page) (struct page *, isolate_mode_t);
> > -   int (*migratepage)(struct address_space *, struct page *, struct page 
> > *);
> > +   int (*migrate_folio)(struct address_space *, struct folio *dst,
> > +   struct folio *src, enum migrate_mode);
> > void (*putback_page) (struct page *);
> 
> isolate_page/putback_page are leftovers from the previous patch, no?

Argh, right, I completely forgot I needed to update the documentation in
that patch.

> > +++ b/Documentation/vm/page_migration.rst
> > @@ -181,22 +181,23 @@ which are function pointers of struct 
> > address_space_operations.
> > Once page is successfully isolated, VM uses page.lru fields so driver
> > shouldn't expect to preserve values in those fields.
> >  
> > -2. ``int (*migratepage) (struct address_space *mapping,``
> > -|  ``struct page *newpage, struct page *oldpage, enum migrate_mode);``
> > -
> > -   After isolation, VM calls migratepage() of driver with the isolated 
> > page.
> > -   The function of migratepage() is to move the contents of the old page 
> > to the
> > -   new page
> > -   and set up fields of struct page newpage. Keep in mind that you should
> > -   indicate to the VM the oldpage is no longer movable via 
> > __ClearPageMovable()
> > -   under page_lock if you migrated the oldpage successfully and returned
> > -   MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, 
> > driver
> > -   can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short 
> > time
> > -   because VM interprets -EAGAIN as "temporary migration failure". On 
> > returning
> > -   any error except -EAGAIN, VM will give up the page migration without
> > -   retrying.
> > -
> > -   Driver shouldn't touch the page.lru field while in the migratepage() 
> > function.
> > +2. ``int (*migrate_folio) (struct address_space *mapping,``
> > +|  ``struct folio *dst, struct folio *src, enum migrate_mode);``
> > +
> > +   After isolation, VM calls the driver's migrate_folio() with the
> > +   isolated folio.  The purpose of migrate_folio() is to move the contents
> > +   of the source folio to the destination folio and set up the fields
> > +   of destination folio.  Keep in mind that you should indicate to the
> > +   VM the source folio is no longer movable via __ClearPageMovable()
> > +   under folio if you migrated the source successfully and returned
> > +   MIGRATEPAGE_SUCCESS.  If driver cannot migrate the folio at the
> > +   moment, driver can return -EAGAIN. On -EAGAIN, VM will retry folio
> > +   migration in a short time because VM interprets -EAGAIN as "temporary
> > +   migration failure".  On returning any error except -EAGAIN, VM will
> > +   give up the folio migration without retrying.
> > +
> > +   Driver shouldn't touch the folio.lru field while in the migrate_folio()
> > +   function.
> >  
> >  3. ``void (*putback_page)(struct page *);``
> 
> Hmm, here it's a bit more complicated now, because we essentially have
> two paths: LRU+migrate_folio or !LRU+movable_ops
> (isolate/migrate/putback page)

Oh ... actually, this is just documenting the driver side of things.
I don't really like how it's written.  Here, have some rewritten
documentation (which is now part of the previous patch):

+++ b/Documentation/vm/page_migration.rst
@@ -152,110 +152,15 @@ Steps:
 Non-LRU page migration
 ==

-Although migration originally aimed for reducing the latency of memory accesses
-for NUMA, compaction also uses migration to create high-order pages.
+Although migration originally aimed for reducing the latency of memory
+accesses for NUMA, compaction also uses migration to create high-order
+pages.  For compaction purposes, it is also useful to be able to move
+non-LRU pages, such as zsmalloc and virtio-balloon pages.

-Current problem of the implementation is that it is designed to migrate only
-*LRU* pages. However, there are potential non-LRU pages which can be migrated
-in drivers, for example, zsmalloc, virtio-balloon pages.
-
-For virtio-balloon pages,

[Cluster-devel] [PATCH v2 15/19] aio: Convert to migrate_folio

2022-06-08 Thread Matthew Wilcox (Oracle)
Use a folio throughout this function.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 fs/aio.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 3c249b938632..a1911e86859c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -400,8 +400,8 @@ static const struct file_operations aio_ring_fops = {
 };
 
 #if IS_ENABLED(CONFIG_MIGRATION)
-static int aio_migratepage(struct address_space *mapping, struct page *new,
-   struct page *old, enum migrate_mode mode)
+static int aio_migrate_folio(struct address_space *mapping, struct folio *dst,
+   struct folio *src, enum migrate_mode mode)
 {
struct kioctx *ctx;
unsigned long flags;
@@ -435,10 +435,10 @@ static int aio_migratepage(struct address_space *mapping, 
struct page *new,
goto out;
}
 
-   idx = old->index;
+   idx = src->index;
if (idx < (pgoff_t)ctx->nr_pages) {
-   /* Make sure the old page hasn't already been changed */
-   if (ctx->ring_pages[idx] != old)
+   /* Make sure the old folio hasn't already been changed */
+   if (ctx->ring_pages[idx] != >page)
rc = -EAGAIN;
} else
rc = -EINVAL;
@@ -447,27 +447,27 @@ static int aio_migratepage(struct address_space *mapping, 
struct page *new,
goto out_unlock;
 
/* Writeback must be complete */
-   BUG_ON(PageWriteback(old));
-   get_page(new);
+   BUG_ON(folio_test_writeback(src));
+   folio_get(dst);
 
-   rc = migrate_page_move_mapping(mapping, new, old, 1);
+   rc = folio_migrate_mapping(mapping, dst, src, 1);
if (rc != MIGRATEPAGE_SUCCESS) {
-   put_page(new);
+   folio_put(dst);
goto out_unlock;
}
 
/* Take completion_lock to prevent other writes to the ring buffer
-* while the old page is copied to the new.  This prevents new
+* while the old folio is copied to the new.  This prevents new
 * events from being lost.
 */
spin_lock_irqsave(>completion_lock, flags);
-   migrate_page_copy(new, old);
-   BUG_ON(ctx->ring_pages[idx] != old);
-   ctx->ring_pages[idx] = new;
+   folio_migrate_copy(dst, src);
+   BUG_ON(ctx->ring_pages[idx] != >page);
+   ctx->ring_pages[idx] = >page;
spin_unlock_irqrestore(>completion_lock, flags);
 
-   /* The old page is no longer accessible. */
-   put_page(old);
+   /* The old folio is no longer accessible. */
+   folio_put(src);
 
 out_unlock:
mutex_unlock(>ring_lock);
@@ -475,13 +475,13 @@ static int aio_migratepage(struct address_space *mapping, 
struct page *new,
spin_unlock(>private_lock);
return rc;
 }
+#else
+#define aio_migrate_folio NULL
 #endif
 
 static const struct address_space_operations aio_ctx_aops = {
.dirty_folio= noop_dirty_folio,
-#if IS_ENABLED(CONFIG_MIGRATION)
-   .migratepage= aio_migratepage,
-#endif
+   .migrate_folio  = aio_migrate_folio,
 };
 
 static int aio_setup_ring(struct kioctx *ctx, unsigned int nr_events)
-- 
2.35.1



[Cluster-devel] [PATCH v2 04/19] mm/migrate: Convert fallback_migrate_page() to fallback_migrate_folio()

2022-06-08 Thread Matthew Wilcox (Oracle)
Use a folio throughout.  migrate_page() will be converted to
migrate_folio() later.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 mm/migrate.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index e064b998ead0..1878de817a01 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -805,11 +805,11 @@ static int writeout(struct address_space *mapping, struct 
page *page)
 /*
  * Default handling if a filesystem does not provide a migration function.
  */
-static int fallback_migrate_page(struct address_space *mapping,
-   struct page *newpage, struct page *page, enum migrate_mode mode)
+static int fallback_migrate_folio(struct address_space *mapping,
+   struct folio *dst, struct folio *src, enum migrate_mode mode)
 {
-   if (PageDirty(page)) {
-   /* Only writeback pages in full synchronous migration */
+   if (folio_test_dirty(src)) {
+   /* Only writeback folios in full synchronous migration */
switch (mode) {
case MIGRATE_SYNC:
case MIGRATE_SYNC_NO_COPY:
@@ -817,18 +817,18 @@ static int fallback_migrate_page(struct address_space 
*mapping,
default:
return -EBUSY;
}
-   return writeout(mapping, page);
+   return writeout(mapping, >page);
}
 
/*
 * Buffers may be managed in a filesystem specific way.
 * We must have no buffers or drop them.
 */
-   if (page_has_private(page) &&
-   !try_to_release_page(page, GFP_KERNEL))
+   if (folio_test_private(src) &&
+   !filemap_release_folio(src, GFP_KERNEL))
return mode == MIGRATE_SYNC ? -EAGAIN : -EBUSY;
 
-   return migrate_page(mapping, newpage, page, mode);
+   return migrate_page(mapping, >page, >page, mode);
 }
 
 /*
@@ -870,8 +870,7 @@ static int move_to_new_folio(struct folio *dst, struct 
folio *src,
rc = mapping->a_ops->migratepage(mapping, >page,
>page, mode);
else
-   rc = fallback_migrate_page(mapping, >page,
-   >page, mode);
+   rc = fallback_migrate_folio(mapping, dst, src, mode);
} else {
const struct movable_operations *mops;
 
-- 
2.35.1



[Cluster-devel] [PATCH v2 13/19] ubifs: Convert to filemap_migrate_folio()

2022-06-08 Thread Matthew Wilcox (Oracle)
filemap_migrate_folio() is a little more general than ubifs really needs,
but it's better to share the code.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 fs/ubifs/file.c | 29 ++---
 1 file changed, 2 insertions(+), 27 deletions(-)

diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 04ced154960f..f2353dd676ef 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1461,29 +1461,6 @@ static bool ubifs_dirty_folio(struct address_space 
*mapping,
return ret;
 }
 
-#ifdef CONFIG_MIGRATION
-static int ubifs_migrate_page(struct address_space *mapping,
-   struct page *newpage, struct page *page, enum migrate_mode mode)
-{
-   int rc;
-
-   rc = migrate_page_move_mapping(mapping, newpage, page, 0);
-   if (rc != MIGRATEPAGE_SUCCESS)
-   return rc;
-
-   if (PagePrivate(page)) {
-   detach_page_private(page);
-   attach_page_private(newpage, (void *)1);
-   }
-
-   if (mode != MIGRATE_SYNC_NO_COPY)
-   migrate_page_copy(newpage, page);
-   else
-   migrate_page_states(newpage, page);
-   return MIGRATEPAGE_SUCCESS;
-}
-#endif
-
 static bool ubifs_release_folio(struct folio *folio, gfp_t unused_gfp_flags)
 {
struct inode *inode = folio->mapping->host;
@@ -1649,10 +1626,8 @@ const struct address_space_operations 
ubifs_file_address_operations = {
.write_end  = ubifs_write_end,
.invalidate_folio = ubifs_invalidate_folio,
.dirty_folio= ubifs_dirty_folio,
-#ifdef CONFIG_MIGRATION
-   .migratepage= ubifs_migrate_page,
-#endif
-   .release_folio= ubifs_release_folio,
+   .migrate_folio  = filemap_migrate_folio,
+   .release_folio  = ubifs_release_folio,
 };
 
 const struct inode_operations ubifs_file_inode_operations = {
-- 
2.35.1



[Cluster-devel] [PATCH v2 01/19] secretmem: Remove isolate_page

2022-06-08 Thread Matthew Wilcox (Oracle)
The isolate_page operation is never called for filesystems, only
for device drivers which call SetPageMovable.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 mm/secretmem.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/secretmem.c b/mm/secretmem.c
index 206ed6b40c1d..1c7f1775b56e 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -133,11 +133,6 @@ static const struct file_operations secretmem_fops = {
.mmap   = secretmem_mmap,
 };
 
-static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
-{
-   return false;
-}
-
 static int secretmem_migratepage(struct address_space *mapping,
 struct page *newpage, struct page *page,
 enum migrate_mode mode)
@@ -155,7 +150,6 @@ const struct address_space_operations secretmem_aops = {
.dirty_folio= noop_dirty_folio,
.free_folio = secretmem_free_folio,
.migratepage= secretmem_migratepage,
-   .isolate_page   = secretmem_isolate_page,
 };
 
 static int secretmem_setattr(struct user_namespace *mnt_userns,
-- 
2.35.1



[Cluster-devel] [PATCH v2 18/19] fs: Remove aops->migratepage()

2022-06-08 Thread Matthew Wilcox (Oracle)
With all users converted to migrate_folio(), remove this operation.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 include/linux/fs.h | 2 --
 mm/compaction.c| 5 ++---
 mm/migrate.c   | 3 ---
 3 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9e6b17da4e11..7e06919b8f60 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -367,8 +367,6 @@ struct address_space_operations {
 */
int (*migrate_folio)(struct address_space *, struct folio *dst,
struct folio *src, enum migrate_mode);
-   int (*migratepage) (struct address_space *,
-   struct page *, struct page *, enum migrate_mode);
int (*launder_folio)(struct folio *);
bool (*is_partially_uptodate) (struct folio *, size_t from,
size_t count);
diff --git a/mm/compaction.c b/mm/compaction.c
index 458f49f9ab09..a2c53fcf933e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1031,7 +1031,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
/*
 * Only pages without mappings or that have a
-* ->migratepage callback are possible to migrate
+* ->migrate_folio callback are possible to migrate
 * without blocking. However, we can be racing with
 * truncation so it's necessary to lock the page
 * to stabilise the mapping as truncation holds
@@ -1043,8 +1043,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 
mapping = page_mapping(page);
migrate_dirty = !mapping ||
-   mapping->a_ops->migrate_folio ||
-   mapping->a_ops->migratepage;
+   mapping->a_ops->migrate_folio;
unlock_page(page);
if (!migrate_dirty)
goto isolate_fail_put;
diff --git a/mm/migrate.c b/mm/migrate.c
index bed0de86f3ae..767e41800d15 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -909,9 +909,6 @@ static int move_to_new_folio(struct folio *dst, struct 
folio *src,
 */
rc = mapping->a_ops->migrate_folio(mapping, dst, src,
mode);
-   else if (mapping->a_ops->migratepage)
-   rc = mapping->a_ops->migratepage(mapping, >page,
-   >page, mode);
else
rc = fallback_migrate_folio(mapping, dst, src, mode);
} else {
-- 
2.35.1



[Cluster-devel] [PATCH v2 06/19] mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()

2022-06-08 Thread Matthew Wilcox (Oracle)
Use a folio throughout __buffer_migrate_folio(), add kernel-doc for
buffer_migrate_folio() and buffer_migrate_folio_norefs(), move their
declarations to buffer.h and switch all filesystems that have wired
them up.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 block/fops.c|  2 +-
 fs/ext2/inode.c |  4 +-
 fs/ext4/inode.c |  4 +-
 fs/ntfs/aops.c  |  6 +--
 fs/ocfs2/aops.c |  2 +-
 include/linux/buffer_head.h | 10 +
 include/linux/fs.h  | 12 --
 mm/migrate.c| 76 ++---
 8 files changed, 65 insertions(+), 51 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index d6b3276a6c68..743fc46d0aad 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -417,7 +417,7 @@ const struct address_space_operations def_blk_aops = {
.write_end  = blkdev_write_end,
.writepages = blkdev_writepages,
.direct_IO  = blkdev_direct_IO,
-   .migratepage= buffer_migrate_page_norefs,
+   .migrate_folio  = buffer_migrate_folio_norefs,
.is_dirty_writeback = buffer_check_dirty_writeback,
 };
 
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 360ce3604a2d..84570c6265aa 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -973,7 +973,7 @@ const struct address_space_operations ext2_aops = {
.bmap   = ext2_bmap,
.direct_IO  = ext2_direct_IO,
.writepages = ext2_writepages,
-   .migratepage= buffer_migrate_page,
+   .migrate_folio  = buffer_migrate_folio,
.is_partially_uptodate  = block_is_partially_uptodate,
.error_remove_page  = generic_error_remove_page,
 };
@@ -989,7 +989,7 @@ const struct address_space_operations ext2_nobh_aops = {
.bmap   = ext2_bmap,
.direct_IO  = ext2_direct_IO,
.writepages = ext2_writepages,
-   .migratepage= buffer_migrate_page,
+   .migrate_folio  = buffer_migrate_folio,
.error_remove_page  = generic_error_remove_page,
 };
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1aaea53e67b5..53877ffe3c41 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3633,7 +3633,7 @@ static const struct address_space_operations ext4_aops = {
.invalidate_folio   = ext4_invalidate_folio,
.release_folio  = ext4_release_folio,
.direct_IO  = noop_direct_IO,
-   .migratepage= buffer_migrate_page,
+   .migrate_folio  = buffer_migrate_folio,
.is_partially_uptodate  = block_is_partially_uptodate,
.error_remove_page  = generic_error_remove_page,
.swap_activate  = ext4_iomap_swap_activate,
@@ -3668,7 +3668,7 @@ static const struct address_space_operations ext4_da_aops 
= {
.invalidate_folio   = ext4_invalidate_folio,
.release_folio  = ext4_release_folio,
.direct_IO  = noop_direct_IO,
-   .migratepage= buffer_migrate_page,
+   .migrate_folio  = buffer_migrate_folio,
.is_partially_uptodate  = block_is_partially_uptodate,
.error_remove_page  = generic_error_remove_page,
.swap_activate  = ext4_iomap_swap_activate,
diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c
index 9e3964ea2ea0..5f4fb6ca6f2e 100644
--- a/fs/ntfs/aops.c
+++ b/fs/ntfs/aops.c
@@ -1659,7 +1659,7 @@ const struct address_space_operations ntfs_normal_aops = {
.dirty_folio= block_dirty_folio,
 #endif /* NTFS_RW */
.bmap   = ntfs_bmap,
-   .migratepage= buffer_migrate_page,
+   .migrate_folio  = buffer_migrate_folio,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
 };
@@ -1673,7 +1673,7 @@ const struct address_space_operations 
ntfs_compressed_aops = {
.writepage  = ntfs_writepage,
.dirty_folio= block_dirty_folio,
 #endif /* NTFS_RW */
-   .migratepage= buffer_migrate_page,
+   .migrate_folio  = buffer_migrate_folio,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
 };
@@ -1688,7 +1688,7 @@ const struct address_space_operations ntfs_mst_aops = {
.writepage  = ntfs_writepage,   /* Write dirty page to disk. */
.dirty_folio= filemap_dirty_folio,
 #endif /* NTFS_RW */
-   .migratepage= buffer_migrate_page,
+   .migrate_folio  = buffer_migrate_folio,
.is_partially_uptodate  = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
 };
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 767df51f8657..1d489003f99d 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2462,7 +2462,7 @@ const struct address_space_operations ocfs2_aops = {
.direct_IO

[Cluster-devel] [PATCH v2 02/19] mm: Convert all PageMovable users to movable_operations

2022-06-08 Thread Matthew Wilcox (Oracle)
These drivers are rather uncomfortably hammered into the
address_space_operations hole.  They aren't filesystems and don't behave
like filesystems.  They just need their own movable_operations structure,
which we can point to directly from page->mapping.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 arch/powerpc/platforms/pseries/cmm.c |  60 +---
 drivers/misc/vmw_balloon.c   |  61 +---
 drivers/virtio/virtio_balloon.c  |  47 +---
 include/linux/balloon_compaction.h   |   6 +-
 include/linux/fs.h   |   2 -
 include/linux/migrate.h  |  26 +--
 include/linux/page-flags.h   |   2 +-
 include/uapi/linux/magic.h   |   4 --
 mm/balloon_compaction.c  |  10 ++-
 mm/compaction.c  |  29 
 mm/migrate.c |  24 +++
 mm/util.c|   4 +-
 mm/z3fold.c  |  82 +++--
 mm/zsmalloc.c| 102 ++-
 14 files changed, 94 insertions(+), 365 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/cmm.c 
b/arch/powerpc/platforms/pseries/cmm.c
index 15ed8206c463..5f4037c1d7fe 100644
--- a/arch/powerpc/platforms/pseries/cmm.c
+++ b/arch/powerpc/platforms/pseries/cmm.c
@@ -19,9 +19,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 
-#include 
 #include 
 #include 
 #include 
@@ -500,19 +497,6 @@ static struct notifier_block cmm_mem_nb = {
 };
 
 #ifdef CONFIG_BALLOON_COMPACTION
-static struct vfsmount *balloon_mnt;
-
-static int cmm_init_fs_context(struct fs_context *fc)
-{
-   return init_pseudo(fc, PPC_CMM_MAGIC) ? 0 : -ENOMEM;
-}
-
-static struct file_system_type balloon_fs = {
-   .name = "ppc-cmm",
-   .init_fs_context = cmm_init_fs_context,
-   .kill_sb = kill_anon_super,
-};
-
 static int cmm_migratepage(struct balloon_dev_info *b_dev_info,
   struct page *newpage, struct page *page,
   enum migrate_mode mode)
@@ -564,47 +548,13 @@ static int cmm_migratepage(struct balloon_dev_info 
*b_dev_info,
return MIGRATEPAGE_SUCCESS;
 }
 
-static int cmm_balloon_compaction_init(void)
+static void cmm_balloon_compaction_init(void)
 {
-   int rc;
-
balloon_devinfo_init(_dev_info);
b_dev_info.migratepage = cmm_migratepage;
-
-   balloon_mnt = kern_mount(_fs);
-   if (IS_ERR(balloon_mnt)) {
-   rc = PTR_ERR(balloon_mnt);
-   balloon_mnt = NULL;
-   return rc;
-   }
-
-   b_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
-   if (IS_ERR(b_dev_info.inode)) {
-   rc = PTR_ERR(b_dev_info.inode);
-   b_dev_info.inode = NULL;
-   kern_unmount(balloon_mnt);
-   balloon_mnt = NULL;
-   return rc;
-   }
-
-   b_dev_info.inode->i_mapping->a_ops = _aops;
-   return 0;
-}
-static void cmm_balloon_compaction_deinit(void)
-{
-   if (b_dev_info.inode)
-   iput(b_dev_info.inode);
-   b_dev_info.inode = NULL;
-   kern_unmount(balloon_mnt);
-   balloon_mnt = NULL;
 }
 #else /* CONFIG_BALLOON_COMPACTION */
-static int cmm_balloon_compaction_init(void)
-{
-   return 0;
-}
-
-static void cmm_balloon_compaction_deinit(void)
+static void cmm_balloon_compaction_init(void)
 {
 }
 #endif /* CONFIG_BALLOON_COMPACTION */
@@ -622,9 +572,7 @@ static int cmm_init(void)
if (!firmware_has_feature(FW_FEATURE_CMO) && !simulate)
return -EOPNOTSUPP;
 
-   rc = cmm_balloon_compaction_init();
-   if (rc)
-   return rc;
+   cmm_balloon_compaction_init();
 
rc = register_oom_notifier(_oom_nb);
if (rc < 0)
@@ -658,7 +606,6 @@ static int cmm_init(void)
 out_oom_notifier:
unregister_oom_notifier(_oom_nb);
 out_balloon_compaction:
-   cmm_balloon_compaction_deinit();
return rc;
 }
 
@@ -677,7 +624,6 @@ static void cmm_exit(void)
unregister_memory_notifier(_mem_nb);
cmm_free_pages(atomic_long_read(_pages));
cmm_unregister_sysfs(_dev);
-   cmm_balloon_compaction_deinit();
 }
 
 /**
diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 086ce77d9074..85dd6aa33df6 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -29,8 +29,6 @@
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
@@ -1730,20 +1728,6 @@ static inline void vmballoon_debugfs_exit(struct 
vmballoon *b)
 
 
 #ifdef CONFIG_BALLOON_COMPACTION
-
-static int vmballoon_init_fs_context(struct fs_context *fc)
-{
-   return init_pseudo(fc, BALLOON_VMW_MAGIC) ? 0 : -ENOMEM;
-}
-
-static struct file_system_type vmballoon_fs = {
-   .name   = "balloon-vmware",
-   .init_fs_context= vmballoon_init_fs_context,
-   .kill_sb

[Cluster-devel] [PATCH v2 17/19] secretmem: Convert to migrate_folio

2022-06-08 Thread Matthew Wilcox (Oracle)
This is little more than changing the types over; there's no real work
being done in this function.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 mm/secretmem.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/secretmem.c b/mm/secretmem.c
index 1c7f1775b56e..658a7486efa9 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -133,9 +133,8 @@ static const struct file_operations secretmem_fops = {
.mmap   = secretmem_mmap,
 };
 
-static int secretmem_migratepage(struct address_space *mapping,
-struct page *newpage, struct page *page,
-enum migrate_mode mode)
+static int secretmem_migrate_folio(struct address_space *mapping,
+   struct folio *dst, struct folio *src, enum migrate_mode mode)
 {
return -EBUSY;
 }
@@ -149,7 +148,7 @@ static void secretmem_free_folio(struct folio *folio)
 const struct address_space_operations secretmem_aops = {
.dirty_folio= noop_dirty_folio,
.free_folio = secretmem_free_folio,
-   .migratepage= secretmem_migratepage,
+   .migrate_folio  = secretmem_migrate_folio,
 };
 
 static int secretmem_setattr(struct user_namespace *mnt_userns,
-- 
2.35.1



[Cluster-devel] [PATCH v2 09/19] nfs: Convert to migrate_folio

2022-06-08 Thread Matthew Wilcox (Oracle)
Use a folio throughout this function.  migrate_page() will be converted
later.

Signed-off-by: Matthew Wilcox (Oracle) 
Acked-by: Anna Schumaker 
Reviewed-by: Christoph Hellwig 
---
 fs/nfs/file.c |  4 +---
 fs/nfs/internal.h |  6 --
 fs/nfs/write.c| 16 
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 2d72b1b7ed74..549baed76351 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -533,9 +533,7 @@ const struct address_space_operations nfs_file_aops = {
.write_end = nfs_write_end,
.invalidate_folio = nfs_invalidate_folio,
.release_folio = nfs_release_folio,
-#ifdef CONFIG_MIGRATION
-   .migratepage = nfs_migrate_page,
-#endif
+   .migrate_folio = nfs_migrate_folio,
.launder_folio = nfs_launder_folio,
.is_dirty_writeback = nfs_check_dirty_writeback,
.error_remove_page = generic_error_remove_page,
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 8f8cd6e2d4db..437ebe544aaf 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -578,8 +578,10 @@ void nfs_clear_pnfs_ds_commit_verifiers(struct 
pnfs_ds_commit_info *cinfo)
 #endif
 
 #ifdef CONFIG_MIGRATION
-extern int nfs_migrate_page(struct address_space *,
-   struct page *, struct page *, enum migrate_mode);
+int nfs_migrate_folio(struct address_space *, struct folio *dst,
+   struct folio *src, enum migrate_mode);
+#else
+#define nfs_migrate_folio NULL
 #endif
 
 static inline int
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 1c706465d090..649b9e633459 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -2119,27 +2119,27 @@ int nfs_wb_page(struct inode *inode, struct page *page)
 }
 
 #ifdef CONFIG_MIGRATION
-int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
-   struct page *page, enum migrate_mode mode)
+int nfs_migrate_folio(struct address_space *mapping, struct folio *dst,
+   struct folio *src, enum migrate_mode mode)
 {
/*
-* If PagePrivate is set, then the page is currently associated with
+* If the private flag is set, the folio is currently associated with
 * an in-progress read or write request. Don't try to migrate it.
 *
 * FIXME: we could do this in principle, but we'll need a way to ensure
 *that we can safely release the inode reference while holding
-*the page lock.
+*the folio lock.
 */
-   if (PagePrivate(page))
+   if (folio_test_private(src))
return -EBUSY;
 
-   if (PageFsCache(page)) {
+   if (folio_test_fscache(src)) {
if (mode == MIGRATE_ASYNC)
return -EBUSY;
-   wait_on_page_fscache(page);
+   folio_wait_fscache(src);
}
 
-   return migrate_page(mapping, newpage, page, mode);
+   return migrate_page(mapping, >page, >page, mode);
 }
 #endif
 
-- 
2.35.1



[Cluster-devel] [PATCH v2 19/19] mm/folio-compat: Remove migration compatibility functions

2022-06-08 Thread Matthew Wilcox (Oracle)
migrate_page_move_mapping(), migrate_page_copy() and migrate_page_states()
are all now unused after converting all the filesystems from
aops->migratepage() to aops->migrate_folio().

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 include/linux/migrate.h | 11 ---
 mm/folio-compat.c   | 22 --
 mm/ksm.c|  2 +-
 3 files changed, 1 insertion(+), 34 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 59d64a1e6b4b..3e18c7048506 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -40,12 +40,8 @@ extern int migrate_pages(struct list_head *l, new_page_t 
new, free_page_t free,
 extern struct page *alloc_migration_target(struct page *page, unsigned long 
private);
 extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
 
-extern void migrate_page_states(struct page *newpage, struct page *page);
-extern void migrate_page_copy(struct page *newpage, struct page *page);
 int migrate_huge_page_move_mapping(struct address_space *mapping,
struct folio *dst, struct folio *src);
-extern int migrate_page_move_mapping(struct address_space *mapping,
-   struct page *newpage, struct page *page, int extra_count);
 void migration_entry_wait_on_locked(swp_entry_t entry, pte_t *ptep,
spinlock_t *ptl);
 void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
@@ -66,13 +62,6 @@ static inline struct page *alloc_migration_target(struct 
page *page,
 static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
{ return -EBUSY; }
 
-static inline void migrate_page_states(struct page *newpage, struct page *page)
-{
-}
-
-static inline void migrate_page_copy(struct page *newpage,
-struct page *page) {}
-
 static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
  struct folio *dst, struct folio *src)
 {
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index 20bc15b57d93..458618c7302c 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -51,28 +51,6 @@ void mark_page_accessed(struct page *page)
 }
 EXPORT_SYMBOL(mark_page_accessed);
 
-#ifdef CONFIG_MIGRATION
-int migrate_page_move_mapping(struct address_space *mapping,
-   struct page *newpage, struct page *page, int extra_count)
-{
-   return folio_migrate_mapping(mapping, page_folio(newpage),
-   page_folio(page), extra_count);
-}
-EXPORT_SYMBOL(migrate_page_move_mapping);
-
-void migrate_page_states(struct page *newpage, struct page *page)
-{
-   folio_migrate_flags(page_folio(newpage), page_folio(page));
-}
-EXPORT_SYMBOL(migrate_page_states);
-
-void migrate_page_copy(struct page *newpage, struct page *page)
-{
-   folio_migrate_copy(page_folio(newpage), page_folio(page));
-}
-EXPORT_SYMBOL(migrate_page_copy);
-#endif
-
 bool set_page_writeback(struct page *page)
 {
return folio_start_writeback(page_folio(page));
diff --git a/mm/ksm.c b/mm/ksm.c
index 54f78c9eecae..e8f8c1a2bb39 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -712,7 +712,7 @@ static struct page *get_ksm_page(struct stable_node 
*stable_node,
 * however, it might mean that the page is under page_ref_freeze().
 * The __remove_mapping() case is easy, again the node is now stale;
 * the same is in reuse_ksm_page() case; but if page is swapcache
-* in migrate_page_move_mapping(), it might still be our page,
+* in folio_migrate_mapping(), it might still be our page,
 * in which case it's essential to keep the node.
 */
while (!get_page_unless_zero(page)) {
-- 
2.35.1



[Cluster-devel] [PATCH v2 05/19] mm/migrate: Convert writeout() to take a folio

2022-06-08 Thread Matthew Wilcox (Oracle)
Use a folio throughout this function.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 mm/migrate.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 1878de817a01..6b6fec26f4d0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -761,11 +761,10 @@ int buffer_migrate_page_norefs(struct address_space 
*mapping,
 #endif
 
 /*
- * Writeback a page to clean the dirty state
+ * Writeback a folio to clean the dirty state
  */
-static int writeout(struct address_space *mapping, struct page *page)
+static int writeout(struct address_space *mapping, struct folio *folio)
 {
-   struct folio *folio = page_folio(page);
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
.nr_to_write = 1,
@@ -779,25 +778,25 @@ static int writeout(struct address_space *mapping, struct 
page *page)
/* No write method for the address space */
return -EINVAL;
 
-   if (!clear_page_dirty_for_io(page))
+   if (!folio_clear_dirty_for_io(folio))
/* Someone else already triggered a write */
return -EAGAIN;
 
/*
-* A dirty page may imply that the underlying filesystem has
-* the page on some queue. So the page must be clean for
-* migration. Writeout may mean we loose the lock and the
-* page state is no longer what we checked for earlier.
+* A dirty folio may imply that the underlying filesystem has
+* the folio on some queue. So the folio must be clean for
+* migration. Writeout may mean we lose the lock and the
+* folio state is no longer what we checked for earlier.
 * At this point we know that the migration attempt cannot
 * be successful.
 */
remove_migration_ptes(folio, folio, false);
 
-   rc = mapping->a_ops->writepage(page, );
+   rc = mapping->a_ops->writepage(>page, );
 
if (rc != AOP_WRITEPAGE_ACTIVATE)
/* unlocked. Relock */
-   lock_page(page);
+   folio_lock(folio);
 
return (rc < 0) ? -EIO : -EAGAIN;
 }
@@ -817,7 +816,7 @@ static int fallback_migrate_folio(struct address_space 
*mapping,
default:
return -EBUSY;
}
-   return writeout(mapping, >page);
+   return writeout(mapping, src);
}
 
/*
-- 
2.35.1



[Cluster-devel] [PATCH v2 03/19] fs: Add aops->migrate_folio

2022-06-08 Thread Matthew Wilcox (Oracle)
Provide a folio-based replacement for aops->migratepage.  Update the
documentation to document migrate_folio instead of migratepage.

Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: Christoph Hellwig 
---
 Documentation/filesystems/locking.rst |  5 ++--
 Documentation/filesystems/vfs.rst | 13 ++-
 Documentation/vm/page_migration.rst   | 33 ++-
 include/linux/fs.h|  4 +++-
 mm/compaction.c   |  4 +++-
 mm/migrate.c  | 11 +
 6 files changed, 40 insertions(+), 30 deletions(-)

diff --git a/Documentation/filesystems/locking.rst 
b/Documentation/filesystems/locking.rst
index c0fe711f14d3..3d28b23676bd 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -253,7 +253,8 @@ prototypes::
void (*free_folio)(struct folio *);
int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
bool (*isolate_page) (struct page *, isolate_mode_t);
-   int (*migratepage)(struct address_space *, struct page *, struct page 
*);
+   int (*migrate_folio)(struct address_space *, struct folio *dst,
+   struct folio *src, enum migrate_mode);
void (*putback_page) (struct page *);
int (*launder_folio)(struct folio *);
bool (*is_partially_uptodate)(struct folio *, size_t from, size_t 
count);
@@ -281,7 +282,7 @@ release_folio:  yes
 free_folio:yes
 direct_IO:
 isolate_page:  yes
-migratepage:   yes (both)
+migrate_folio: yes (both)
 putback_page:  yes
 launder_folio: yes
 is_partially_uptodate: yes
diff --git a/Documentation/filesystems/vfs.rst 
b/Documentation/filesystems/vfs.rst
index a08c652467d7..3ae1b039b03f 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -740,7 +740,8 @@ cache in your filesystem.  The following members are 
defined:
/* isolate a page for migration */
bool (*isolate_page) (struct page *, isolate_mode_t);
/* migrate the contents of a page to the specified target */
-   int (*migratepage) (struct page *, struct page *);
+   int (*migrate_folio)(struct mapping *, struct folio *dst,
+   struct folio *src, enum migrate_mode);
/* put migration-failed page back to right list */
void (*putback_page) (struct page *);
int (*launder_folio) (struct folio *);
@@ -935,12 +936,12 @@ cache in your filesystem.  The following members are 
defined:
is successfully isolated, VM marks the page as PG_isolated via
__SetPageIsolated.
 
-``migrate_page``
+``migrate_folio``
This is used to compact the physical memory usage.  If the VM
-   wants to relocate a page (maybe off a memory card that is
-   signalling imminent failure) it will pass a new page and an old
-   page to this function.  migrate_page should transfer any private
-   data across and update any references that it has to the page.
+   wants to relocate a folio (maybe from a memory device that is
+   signalling imminent failure) it will pass a new folio and an old
+   folio to this function.  migrate_folio should transfer any private
+   data across and update any references that it has to the folio.
 
 ``putback_page``
Called by the VM when isolated page's migration fails.
diff --git a/Documentation/vm/page_migration.rst 
b/Documentation/vm/page_migration.rst
index 8c5cb8147e55..e0f73ddfabb1 100644
--- a/Documentation/vm/page_migration.rst
+++ b/Documentation/vm/page_migration.rst
@@ -181,22 +181,23 @@ which are function pointers of struct 
address_space_operations.
Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in those fields.
 
-2. ``int (*migratepage) (struct address_space *mapping,``
-|  ``struct page *newpage, struct page *oldpage, enum migrate_mode);``
-
-   After isolation, VM calls migratepage() of driver with the isolated page.
-   The function of migratepage() is to move the contents of the old page to the
-   new page
-   and set up fields of struct page newpage. Keep in mind that you should
-   indicate to the VM the oldpage is no longer movable via __ClearPageMovable()
-   under page_lock if you migrated the oldpage successfully and returned
-   MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver
-   can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time
-   because VM interprets -EAGAIN as "temporary migration failure". On returning
-   any error except -EAGAIN, VM will give up the page migration without
-   retrying.
-
-   Driver shouldn't touch the page.lru field while in the migratepage() 
function.
+2. ``int (*migrate_folio) (struct address_space *mapping,``
+|  ``struct folio *dst, stru

  1   2   3   4   5   >