Re: [Cluster-devel] remove iomap_writepage v2

2022-08-10 Thread Christoph Hellwig
On Wed, Aug 10, 2022 at 09:43:58PM +0100, Matthew Wilcox wrote:
> To avoid duplicating work with you or Christoph ... it seems like the
> plan is to kill ->writepage entirely soon, so there's no point in me
> doing a sweep of all the filesystems to convert ->writepage to
> ->write_folio, correct?

While I won't commit to concrete definition of "soon" I think that is
the general plan, and I don't think converting ->writepage to
->write_folio is needed.

> I assume the plan for filesystems which have a writepage but don't have
> a ->writepages (9p, adfs, affs, bfs, ecryptfs, gfs2, hostfs, jfs, minix,
> nilfs2, ntfs, ocfs2, reiserfs, sysv, ubifs, udf, ufs, vboxsf) is to give
> them a writepages,

and  ->migrate_folio if missing, yes.

> modelled on iomap_writepages().  Seems that adding
> a block_writepages() might be a useful thing for me to do?

We have mpage_writepages which basically is the ->writepages counterpart
to block_write_full_page.  The only caveat is of course that it can
use multi-block mappings from the get_bock callback, so we need to make
sure the file systems don't do anything stupid there.



Re: [Cluster-devel] remove iomap_writepage v2

2022-08-10 Thread Matthew Wilcox
On Wed, Aug 10, 2022 at 11:32:06PM +0200, Andreas Grünbacher wrote:
> Am Mi., 10. Aug. 2022 um 22:57 Uhr schrieb Matthew Wilcox 
> :
> > On Mon, Aug 01, 2022 at 11:31:50AM -0400, Johannes Weiner wrote:
> > > XFS hasn't had a ->writepage call for a while. After LSF I internally
> > > tested dropping btrfs' callback, and the results looked good: no OOM
> > > kills with dirty/writeback pages remaining, performance parity. Then I
> > > went on vacation and Christoph beat me to the patch :)
> >
> > To avoid duplicating work with you or Christoph ... it seems like the
> > plan is to kill ->writepage entirely soon, so there's no point in me
> > doing a sweep of all the filesystems to convert ->writepage to
> > ->write_folio, correct?
> >
> > I assume the plan for filesystems which have a writepage but don't have
> > a ->writepages (9p, adfs, affs, bfs, ecryptfs, gfs2, hostfs, jfs, minix,
> > nilfs2, ntfs, ocfs2, reiserfs, sysv, ubifs, udf, ufs, vboxsf) is to give
> > them a writepages, modelled on iomap_writepages().  Seems that adding
> > a block_writepages() might be a useful thing for me to do?
> 
> Hmm, gfs2 does have gfs2_writepages() and gfs2_jdata_writepages()
> functions, so it should probably be fine.

Ah, it's gfs2_aspace_writepage which doesn't have a writepages
counterpart.  I haven't looked at it to understand why it's needed.
(gfs2_meta_aops and gfs2_rgrp_aops)



Re: [Cluster-devel] remove iomap_writepage v2

2022-08-10 Thread Andreas Grünbacher
Am Mi., 10. Aug. 2022 um 22:57 Uhr schrieb Matthew Wilcox :
> On Mon, Aug 01, 2022 at 11:31:50AM -0400, Johannes Weiner wrote:
> > XFS hasn't had a ->writepage call for a while. After LSF I internally
> > tested dropping btrfs' callback, and the results looked good: no OOM
> > kills with dirty/writeback pages remaining, performance parity. Then I
> > went on vacation and Christoph beat me to the patch :)
>
> To avoid duplicating work with you or Christoph ... it seems like the
> plan is to kill ->writepage entirely soon, so there's no point in me
> doing a sweep of all the filesystems to convert ->writepage to
> ->write_folio, correct?
>
> I assume the plan for filesystems which have a writepage but don't have
> a ->writepages (9p, adfs, affs, bfs, ecryptfs, gfs2, hostfs, jfs, minix,
> nilfs2, ntfs, ocfs2, reiserfs, sysv, ubifs, udf, ufs, vboxsf) is to give
> them a writepages, modelled on iomap_writepages().  Seems that adding
> a block_writepages() might be a useful thing for me to do?

Hmm, gfs2 does have gfs2_writepages() and gfs2_jdata_writepages()
functions, so it should probably be fine.

Thanks,
Andreas



Re: [Cluster-devel] remove iomap_writepage v2

2022-08-10 Thread Matthew Wilcox
On Mon, Aug 01, 2022 at 11:31:50AM -0400, Johannes Weiner wrote:
> XFS hasn't had a ->writepage call for a while. After LSF I internally
> tested dropping btrfs' callback, and the results looked good: no OOM
> kills with dirty/writeback pages remaining, performance parity. Then I
> went on vacation and Christoph beat me to the patch :)

To avoid duplicating work with you or Christoph ... it seems like the
plan is to kill ->writepage entirely soon, so there's no point in me
doing a sweep of all the filesystems to convert ->writepage to
->write_folio, correct?

I assume the plan for filesystems which have a writepage but don't have
a ->writepages (9p, adfs, affs, bfs, ecryptfs, gfs2, hostfs, jfs, minix,
nilfs2, ntfs, ocfs2, reiserfs, sysv, ubifs, udf, ufs, vboxsf) is to give
them a writepages, modelled on iomap_writepages().  Seems that adding
a block_writepages() might be a useful thing for me to do?



Re: [Cluster-devel] remove iomap_writepage v2

2022-08-01 Thread Johannes Weiner
On Fri, Jul 29, 2022 at 04:11:45PM +0200, Christoph Hellwig wrote:
> On Fri, Jul 29, 2022 at 10:22:16AM +0100, Mel Gorman wrote:
> > There is some context missing because it's not clear what the full impact is
> > but it is definitly the case that writepage is ignored in some contexts for
> > common filesystems so lets assume that writepage from reclaim context always
> > failed as a worst case scenario. Certainly this type of change is something
> > linux-mm needs to be aware of because we've been blind-sided before.
> 
> Between willy and Johannes pushing or it I was under the strong assumption
> that linux-mm knows of it..

Yes, the context was an FS session at LSFMM. FS folks complained about
the MM relying on single-page writeouts. On the MM side we've invested
a lot into eliminating this dependency over the last decade or so, and
I would argue it's gone today. But the calls are still in the code,
and so FS folks continue to operate under the old assumption. You
can't blame them. I suggested we remove the callbacks to clarify
things and eliminate that murky corner from the FS/MM interface.

Compaction/migration is easy. It simply never calls writepage when
there is a migratepage callback - which major filesystems have.

Reclaim may still call it, but the invocation rules are so restrictive
nowadays that it's unlikely to actually help when it matters (and we
know it makes things worse in many cases).

For example, cgroup reclaim isn't ever allowed to call writepage.
This covers the small system scenario.

Whether writepage helps under OOM is not clear. OOMing systems tend to
have thundering herds of direct reclaimers, any one of which can
declare OOM if they fail, yet none of them can write pages. They
already rely on another thread to make progress. That thread can be
the flushers writing in offset order, or kswapd writing in LRU
order. You could argue that LRU order, while less efficient IO,
launders pages closer to the scanner. From an OOM perspective that
doesn't matter, though, because scanners will work through the entire
LRU list several times before giving up. They won't miss flusher
progress. That leaves reclaim efficiency - having to scan fewer pages
before potentially finding clean ones. But it's an IO bound scenario,
so arguably efficient IO would seem more important than efficient CPU.

The risk of this change, IMO, is exposing reclaim to flat out bugs in
the flusher code, or bugs in the code that matches reclaim to flushing
speed. However, a) cgroup has been relying on those for a decade. And
b) we've been treating writepage calls like bugs due to the latency
they inject into workloads, and tuned the MM to rely more on flushers
(e.g. c55e8d035b28 ("mm: vmscan: move dirty pages out of the way until
they're flushed")). So we know this stuff works at scale and with real
workloads. I think the risk of dragons there is quite low.

XFS hasn't had a ->writepage call for a while. After LSF I internally
tested dropping btrfs' callback, and the results looked good: no OOM
kills with dirty/writeback pages remaining, performance parity. Then I
went on vacation and Christoph beat me to the patch :)

I think it's a really good cleanup that makes things cleaner and more
predictable in both the fs and the mm.

> > I don't think it would be incredibly damaging although there *might* be
> > issues with small systems or cgroups. 
> 
> Johannes specifically mentioned that cgroup writeback will never call
> into ->writepage anyway.

Yes, cgroup has relied on the flushers since

commit ee72886d8ed5d9de3fa0ed3b99a7ca7702576a96
Author: Mel Gorman 
Date:   Mon Oct 31 17:07:38 2011 -0700

mm: vmscan: do not writeback filesystem pages in direct reclaim

since cgroup reclaim == direct reclaim.



Re: [Cluster-devel] remove iomap_writepage v2

2022-07-29 Thread Christoph Hellwig
On Fri, Jul 29, 2022 at 10:22:16AM +0100, Mel Gorman wrote:
> There is some context missing because it's not clear what the full impact is
> but it is definitly the case that writepage is ignored in some contexts for
> common filesystems so lets assume that writepage from reclaim context always
> failed as a worst case scenario. Certainly this type of change is something
> linux-mm needs to be aware of because we've been blind-sided before.

Between willy and Johannes pushing or it I was under the strong assumption
that linux-mm knows of it..

> I don't think it would be incredibly damaging although there *might* be
> issues with small systems or cgroups. 

Johannes specifically mentioned that cgroup writeback will never call
into ->writepage anyway.



Re: [Cluster-devel] remove iomap_writepage v2

2022-07-29 Thread Mel Gorman
On Thu, Jul 28, 2022 at 01:10:16PM +0200, Jan Kara wrote:
> Hi Christoph!
> 
> On Tue 19-07-22 06:13:07, Christoph Hellwig wrote:
> > this series removes iomap_writepage and it's callers, following what xfs
> > has been doing for a long time.
> 
> So this effectively means "no writeback from page reclaim for these
> filesystems" AFAICT (page migration of dirty pages seems to be handled by
> iomap_migrate_page()) which is going to make life somewhat harder for
> memory reclaim when memory pressure is high enough that dirty pages are
> reaching end of the LRU list. I don't expect this to be a problem on big
> machines but it could have some undesirable effects for small ones
> (embedded, small VMs). I agree per-page writeback has been a bad idea for
> efficiency reasons for at least last 10-15 years and most filesystems
> stopped dealing with more complex situations (like block allocation) from
> ->writepage() already quite a few years ago without any bug reports AFAIK.
> So it all seems like a sensible idea from FS POV but are MM people on board
> or at least aware of this movement in the fs land?
> 
> Added a few CC's for that.
> 

There is some context missing because it's not clear what the full impact is
but it is definitly the case that writepage is ignored in some contexts for
common filesystems so lets assume that writepage from reclaim context always
failed as a worst case scenario. Certainly this type of change is something
linux-mm needs to be aware of because we've been blind-sided before.

I don't think it would be incredibly damaging although there *might* be
issues with small systems or cgroups. In many respects, vmscan has been
moving in this direction for a long time e.g. f84f6e2b0868 ("mm: vmscan:
do not writeback filesystem pages in kswapd except in high priority") and
e2be15f6c3ee ("mm: vmscan: stall page reclaim and writeback pages based on
dirty/writepage pages encountered"). This was roughly 10 years ago when
it was clear that FS writeback from reclaim context was fragile (iirc it
was partially due to concerns about stack depth and later concerns that a
filesystem would simply ignore the writeback request). There also is less
reliance on stalling reclaim by queueing and waiting on writeback but we now
should explicitly throttle if no progress is being made e.g. 69392a403f49
("mm/vmscan: throttle reclaim when no progress is being made") and some
follow-up fixes.

There still is a reliance on swap and shmem pages will not ignore writepage
but I assume that is still ok.

One potential caveat is if wakeup_flusher_threads() is ignored because
there is a reliance in reclaim that if all the pages at the tail of the
LRU are dirty pages not queued for writeback then wakeup_flusher_threads()
will do something so pages get marked for immediate reclaim when writeback
completes. Of course there is no guarantee that flusher threads will start
writeback on the old pages and the pages could be backed by a slow BDI
but wakeup_flusher_threads() should not be ignored.

Another caveat is that comments become misleading. Take for example the
comment "Only kswapd can writeback filesystem folios to avoid risk of
stack overflow." The wording should change to note that writepage may
do nothing at all.  There also might need to be some adjustment on when
pages get marked for immediate reclaim when they are dirty but not under
writeback. pageout() might need a tracepoint for "mapping->a_ops->writepage
== NULL" to help debug problems around a failure to queue pages for writback
although that could be done as a test patch for a bug.

There would need to be some changes made if writepage always or often failed
and there might be some premature throttling based on "NOPROGRESS" on small
systems due to dirty-not-writepage pages at the tail of the LRU but I don't
think it would be an immediate disaster, Reclaim throttling is no longer
based on the ability to queue for writeback or "congestion" state of BDIs
and some care is taken to not prematurely stall on NOPROGRESS. However,
if there was a bug related to premature stalls or excessive CPU usage
from direct reclaimers or kswapd that bisected to a change in writepage
then it should be fixed on the vmscan-side to put an emphasis on handling
"reclaim is in trouble when the bulk of reclaimable pages are FS-only,
dirty and not under writeback but ->writepage is a no-op".

-- 
Mel Gorman
SUSE Labs



Re: [Cluster-devel] remove iomap_writepage v2

2022-07-29 Thread Yang Shi
On Thu, Jul 28, 2022 at 3:48 PM Dave Chinner  wrote:
>
> On Thu, Jul 28, 2022 at 03:18:03PM +0100, Matthew Wilcox wrote:
> > On Thu, Jul 28, 2022 at 01:10:16PM +0200, Jan Kara wrote:
> > > Hi Christoph!
> > >
> > > On Tue 19-07-22 06:13:07, Christoph Hellwig wrote:
> > > > this series removes iomap_writepage and it's callers, following what xfs
> > > > has been doing for a long time.
> > >
> > > So this effectively means "no writeback from page reclaim for these
> > > filesystems" AFAICT (page migration of dirty pages seems to be handled by
> > > iomap_migrate_page()) which is going to make life somewhat harder for
> > > memory reclaim when memory pressure is high enough that dirty pages are
> > > reaching end of the LRU list. I don't expect this to be a problem on big
> > > machines but it could have some undesirable effects for small ones
> > > (embedded, small VMs). I agree per-page writeback has been a bad idea for
> > > efficiency reasons for at least last 10-15 years and most filesystems
> > > stopped dealing with more complex situations (like block allocation) from
> > > ->writepage() already quite a few years ago without any bug reports AFAIK.
> > > So it all seems like a sensible idea from FS POV but are MM people on 
> > > board
> > > or at least aware of this movement in the fs land?
> >
> > I mentioned it during my folio session at LSFMM, but didn't put a huge
> > emphasis on it.
> >
> > For XFS, writeback should already be in progress on other pages if
> > we're getting to the point of trying to call ->writepage() in vmscan.
> > Surely this is also true for other filesystems?
>
> Yes.
>
> It's definitely true for btrfs, too, because btrfs_writepage does:
>
> static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
> {
> struct inode *inode = page->mapping->host;
> int ret;
>
> if (current->flags & PF_MEMALLOC) {
> redirty_page_for_writepage(wbc, page);
> unlock_page(page);
> return 0;
> }
> 
>
> It also rejects all calls to write dirty pages from memory reclaim
> contexts.

Aha, it seems even kswapd (it has PF_MEMALLOC set) is rejected too.

>
> ext4 will also reject writepage calls from memory allocation if
> block allocation is required (due to delayed allocation) or
> unwritten extents need converting to written. i.e. if it has to run
> blocking transactions.
>
> So all three major filesystems will either partially or wholly
> reject ->writepage calls from memory reclaim context.
>
> IOWs, if memory reclaim is depending on ->writepage() to make
> reclaim progress, it's not working as advertised on the vast
> majority of production Linux systems
>
> The reality is that ->writepage is a relic of a bygone era of OS and
> filesystem design. It was useful in the days where writing a dirty
> page just involved looking up the bufferhead attached to the page to
> get the disk mapping and then submitting it for IO.
>
> Those days are long gone - filesystems have complex IO submission
> paths now that have to handle delayed allocation, copy-on-write,
> unwritten extents, have unbound memory demand, etc. All the
> filesystems that support these 1990s era filesystem technologies
> simply turn off ->writepage in memory reclaim contexts.
>
> Hence for the vast majority of linux users (i.e. everyone using
> ext4, btrfs and XFS), ->writepage no longer plays any part in memory
> reclaim on their systems.
>
> So why should we try to maintain the fiction that ->writepage is
> required functionality in a filesystem when it clearly isn't?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com
>



Re: [Cluster-devel] remove iomap_writepage v2

2022-07-28 Thread Dave Chinner
On Thu, Jul 28, 2022 at 03:18:03PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 28, 2022 at 01:10:16PM +0200, Jan Kara wrote:
> > Hi Christoph!
> > 
> > On Tue 19-07-22 06:13:07, Christoph Hellwig wrote:
> > > this series removes iomap_writepage and it's callers, following what xfs
> > > has been doing for a long time.
> > 
> > So this effectively means "no writeback from page reclaim for these
> > filesystems" AFAICT (page migration of dirty pages seems to be handled by
> > iomap_migrate_page()) which is going to make life somewhat harder for
> > memory reclaim when memory pressure is high enough that dirty pages are
> > reaching end of the LRU list. I don't expect this to be a problem on big
> > machines but it could have some undesirable effects for small ones
> > (embedded, small VMs). I agree per-page writeback has been a bad idea for
> > efficiency reasons for at least last 10-15 years and most filesystems
> > stopped dealing with more complex situations (like block allocation) from
> > ->writepage() already quite a few years ago without any bug reports AFAIK.
> > So it all seems like a sensible idea from FS POV but are MM people on board
> > or at least aware of this movement in the fs land?
> 
> I mentioned it during my folio session at LSFMM, but didn't put a huge
> emphasis on it.
> 
> For XFS, writeback should already be in progress on other pages if
> we're getting to the point of trying to call ->writepage() in vmscan.
> Surely this is also true for other filesystems?

Yes.

It's definitely true for btrfs, too, because btrfs_writepage does:

static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
{
struct inode *inode = page->mapping->host;
int ret;

if (current->flags & PF_MEMALLOC) {
redirty_page_for_writepage(wbc, page);
unlock_page(page);
return 0;
}


It also rejects all calls to write dirty pages from memory reclaim
contexts.

ext4 will also reject writepage calls from memory allocation if
block allocation is required (due to delayed allocation) or
unwritten extents need converting to written. i.e. if it has to run
blocking transactions.

So all three major filesystems will either partially or wholly
reject ->writepage calls from memory reclaim context.

IOWs, if memory reclaim is depending on ->writepage() to make
reclaim progress, it's not working as advertised on the vast
majority of production Linux systems

The reality is that ->writepage is a relic of a bygone era of OS and
filesystem design. It was useful in the days where writing a dirty
page just involved looking up the bufferhead attached to the page to
get the disk mapping and then submitting it for IO.

Those days are long gone - filesystems have complex IO submission
paths now that have to handle delayed allocation, copy-on-write,
unwritten extents, have unbound memory demand, etc. All the
filesystems that support these 1990s era filesystem technologies
simply turn off ->writepage in memory reclaim contexts.

Hence for the vast majority of linux users (i.e. everyone using
ext4, btrfs and XFS), ->writepage no longer plays any part in memory
reclaim on their systems.

So why should we try to maintain the fiction that ->writepage is
required functionality in a filesystem when it clearly isn't?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [Cluster-devel] remove iomap_writepage v2

2022-07-28 Thread Matthew Wilcox
On Thu, Jul 28, 2022 at 01:10:16PM +0200, Jan Kara wrote:
> Hi Christoph!
> 
> On Tue 19-07-22 06:13:07, Christoph Hellwig wrote:
> > this series removes iomap_writepage and it's callers, following what xfs
> > has been doing for a long time.
> 
> So this effectively means "no writeback from page reclaim for these
> filesystems" AFAICT (page migration of dirty pages seems to be handled by
> iomap_migrate_page()) which is going to make life somewhat harder for
> memory reclaim when memory pressure is high enough that dirty pages are
> reaching end of the LRU list. I don't expect this to be a problem on big
> machines but it could have some undesirable effects for small ones
> (embedded, small VMs). I agree per-page writeback has been a bad idea for
> efficiency reasons for at least last 10-15 years and most filesystems
> stopped dealing with more complex situations (like block allocation) from
> ->writepage() already quite a few years ago without any bug reports AFAIK.
> So it all seems like a sensible idea from FS POV but are MM people on board
> or at least aware of this movement in the fs land?

I mentioned it during my folio session at LSFMM, but didn't put a huge
emphasis on it.

For XFS, writeback should already be in progress on other pages if
we're getting to the point of trying to call ->writepage() in vmscan.
Surely this is also true for other filesystems?



Re: [Cluster-devel] remove iomap_writepage v2

2022-07-28 Thread Jan Kara
Hi Christoph!

On Tue 19-07-22 06:13:07, Christoph Hellwig wrote:
> this series removes iomap_writepage and it's callers, following what xfs
> has been doing for a long time.

So this effectively means "no writeback from page reclaim for these
filesystems" AFAICT (page migration of dirty pages seems to be handled by
iomap_migrate_page()) which is going to make life somewhat harder for
memory reclaim when memory pressure is high enough that dirty pages are
reaching end of the LRU list. I don't expect this to be a problem on big
machines but it could have some undesirable effects for small ones
(embedded, small VMs). I agree per-page writeback has been a bad idea for
efficiency reasons for at least last 10-15 years and most filesystems
stopped dealing with more complex situations (like block allocation) from
->writepage() already quite a few years ago without any bug reports AFAIK.
So it all seems like a sensible idea from FS POV but are MM people on board
or at least aware of this movement in the fs land?

Added a few CC's for that.

Honza

> Changes since v1:
>  - clean up a printk in gfs2
> 
> Diffstat:
>  fs/gfs2/aops.c |   26 --
>  fs/gfs2/log.c  |5 ++---
>  fs/iomap/buffered-io.c |   15 ---
>  fs/zonefs/super.c  |8 
>  include/linux/iomap.h  |3 ---
>  5 files changed, 2 insertions(+), 55 deletions(-)
-- 
Jan Kara 
SUSE Labs, CR



[Cluster-devel] remove iomap_writepage v2

2022-07-18 Thread Christoph Hellwig
Hi all,

this series removes iomap_writepage and it's callers, following what xfs
has been doing for a long time.

Changes since v1:
 - clean up a printk in gfs2

Diffstat:
 fs/gfs2/aops.c |   26 --
 fs/gfs2/log.c  |5 ++---
 fs/iomap/buffered-io.c |   15 ---
 fs/zonefs/super.c  |8 
 include/linux/iomap.h  |3 ---
 5 files changed, 2 insertions(+), 55 deletions(-)