Re: [Cluster-devel] remove iomap_writepage v2
On Wed, Aug 10, 2022 at 09:43:58PM +0100, Matthew Wilcox wrote: > To avoid duplicating work with you or Christoph ... it seems like the > plan is to kill ->writepage entirely soon, so there's no point in me > doing a sweep of all the filesystems to convert ->writepage to > ->write_folio, correct? While I won't commit to concrete definition of "soon" I think that is the general plan, and I don't think converting ->writepage to ->write_folio is needed. > I assume the plan for filesystems which have a writepage but don't have > a ->writepages (9p, adfs, affs, bfs, ecryptfs, gfs2, hostfs, jfs, minix, > nilfs2, ntfs, ocfs2, reiserfs, sysv, ubifs, udf, ufs, vboxsf) is to give > them a writepages, and ->migrate_folio if missing, yes. > modelled on iomap_writepages(). Seems that adding > a block_writepages() might be a useful thing for me to do? We have mpage_writepages which basically is the ->writepages counterpart to block_write_full_page. The only caveat is of course that it can use multi-block mappings from the get_bock callback, so we need to make sure the file systems don't do anything stupid there.
Re: [Cluster-devel] remove iomap_writepage v2
On Wed, Aug 10, 2022 at 11:32:06PM +0200, Andreas Grünbacher wrote: > Am Mi., 10. Aug. 2022 um 22:57 Uhr schrieb Matthew Wilcox > : > > On Mon, Aug 01, 2022 at 11:31:50AM -0400, Johannes Weiner wrote: > > > XFS hasn't had a ->writepage call for a while. After LSF I internally > > > tested dropping btrfs' callback, and the results looked good: no OOM > > > kills with dirty/writeback pages remaining, performance parity. Then I > > > went on vacation and Christoph beat me to the patch :) > > > > To avoid duplicating work with you or Christoph ... it seems like the > > plan is to kill ->writepage entirely soon, so there's no point in me > > doing a sweep of all the filesystems to convert ->writepage to > > ->write_folio, correct? > > > > I assume the plan for filesystems which have a writepage but don't have > > a ->writepages (9p, adfs, affs, bfs, ecryptfs, gfs2, hostfs, jfs, minix, > > nilfs2, ntfs, ocfs2, reiserfs, sysv, ubifs, udf, ufs, vboxsf) is to give > > them a writepages, modelled on iomap_writepages(). Seems that adding > > a block_writepages() might be a useful thing for me to do? > > Hmm, gfs2 does have gfs2_writepages() and gfs2_jdata_writepages() > functions, so it should probably be fine. Ah, it's gfs2_aspace_writepage which doesn't have a writepages counterpart. I haven't looked at it to understand why it's needed. (gfs2_meta_aops and gfs2_rgrp_aops)
Re: [Cluster-devel] remove iomap_writepage v2
Am Mi., 10. Aug. 2022 um 22:57 Uhr schrieb Matthew Wilcox : > On Mon, Aug 01, 2022 at 11:31:50AM -0400, Johannes Weiner wrote: > > XFS hasn't had a ->writepage call for a while. After LSF I internally > > tested dropping btrfs' callback, and the results looked good: no OOM > > kills with dirty/writeback pages remaining, performance parity. Then I > > went on vacation and Christoph beat me to the patch :) > > To avoid duplicating work with you or Christoph ... it seems like the > plan is to kill ->writepage entirely soon, so there's no point in me > doing a sweep of all the filesystems to convert ->writepage to > ->write_folio, correct? > > I assume the plan for filesystems which have a writepage but don't have > a ->writepages (9p, adfs, affs, bfs, ecryptfs, gfs2, hostfs, jfs, minix, > nilfs2, ntfs, ocfs2, reiserfs, sysv, ubifs, udf, ufs, vboxsf) is to give > them a writepages, modelled on iomap_writepages(). Seems that adding > a block_writepages() might be a useful thing for me to do? Hmm, gfs2 does have gfs2_writepages() and gfs2_jdata_writepages() functions, so it should probably be fine. Thanks, Andreas
Re: [Cluster-devel] remove iomap_writepage v2
On Mon, Aug 01, 2022 at 11:31:50AM -0400, Johannes Weiner wrote: > XFS hasn't had a ->writepage call for a while. After LSF I internally > tested dropping btrfs' callback, and the results looked good: no OOM > kills with dirty/writeback pages remaining, performance parity. Then I > went on vacation and Christoph beat me to the patch :) To avoid duplicating work with you or Christoph ... it seems like the plan is to kill ->writepage entirely soon, so there's no point in me doing a sweep of all the filesystems to convert ->writepage to ->write_folio, correct? I assume the plan for filesystems which have a writepage but don't have a ->writepages (9p, adfs, affs, bfs, ecryptfs, gfs2, hostfs, jfs, minix, nilfs2, ntfs, ocfs2, reiserfs, sysv, ubifs, udf, ufs, vboxsf) is to give them a writepages, modelled on iomap_writepages(). Seems that adding a block_writepages() might be a useful thing for me to do?
Re: [Cluster-devel] remove iomap_writepage v2
On Fri, Jul 29, 2022 at 04:11:45PM +0200, Christoph Hellwig wrote: > On Fri, Jul 29, 2022 at 10:22:16AM +0100, Mel Gorman wrote: > > There is some context missing because it's not clear what the full impact is > > but it is definitly the case that writepage is ignored in some contexts for > > common filesystems so lets assume that writepage from reclaim context always > > failed as a worst case scenario. Certainly this type of change is something > > linux-mm needs to be aware of because we've been blind-sided before. > > Between willy and Johannes pushing or it I was under the strong assumption > that linux-mm knows of it.. Yes, the context was an FS session at LSFMM. FS folks complained about the MM relying on single-page writeouts. On the MM side we've invested a lot into eliminating this dependency over the last decade or so, and I would argue it's gone today. But the calls are still in the code, and so FS folks continue to operate under the old assumption. You can't blame them. I suggested we remove the callbacks to clarify things and eliminate that murky corner from the FS/MM interface. Compaction/migration is easy. It simply never calls writepage when there is a migratepage callback - which major filesystems have. Reclaim may still call it, but the invocation rules are so restrictive nowadays that it's unlikely to actually help when it matters (and we know it makes things worse in many cases). For example, cgroup reclaim isn't ever allowed to call writepage. This covers the small system scenario. Whether writepage helps under OOM is not clear. OOMing systems tend to have thundering herds of direct reclaimers, any one of which can declare OOM if they fail, yet none of them can write pages. They already rely on another thread to make progress. That thread can be the flushers writing in offset order, or kswapd writing in LRU order. You could argue that LRU order, while less efficient IO, launders pages closer to the scanner. From an OOM perspective that doesn't matter, though, because scanners will work through the entire LRU list several times before giving up. They won't miss flusher progress. That leaves reclaim efficiency - having to scan fewer pages before potentially finding clean ones. But it's an IO bound scenario, so arguably efficient IO would seem more important than efficient CPU. The risk of this change, IMO, is exposing reclaim to flat out bugs in the flusher code, or bugs in the code that matches reclaim to flushing speed. However, a) cgroup has been relying on those for a decade. And b) we've been treating writepage calls like bugs due to the latency they inject into workloads, and tuned the MM to rely more on flushers (e.g. c55e8d035b28 ("mm: vmscan: move dirty pages out of the way until they're flushed")). So we know this stuff works at scale and with real workloads. I think the risk of dragons there is quite low. XFS hasn't had a ->writepage call for a while. After LSF I internally tested dropping btrfs' callback, and the results looked good: no OOM kills with dirty/writeback pages remaining, performance parity. Then I went on vacation and Christoph beat me to the patch :) I think it's a really good cleanup that makes things cleaner and more predictable in both the fs and the mm. > > I don't think it would be incredibly damaging although there *might* be > > issues with small systems or cgroups. > > Johannes specifically mentioned that cgroup writeback will never call > into ->writepage anyway. Yes, cgroup has relied on the flushers since commit ee72886d8ed5d9de3fa0ed3b99a7ca7702576a96 Author: Mel Gorman Date: Mon Oct 31 17:07:38 2011 -0700 mm: vmscan: do not writeback filesystem pages in direct reclaim since cgroup reclaim == direct reclaim.
Re: [Cluster-devel] remove iomap_writepage v2
On Fri, Jul 29, 2022 at 10:22:16AM +0100, Mel Gorman wrote: > There is some context missing because it's not clear what the full impact is > but it is definitly the case that writepage is ignored in some contexts for > common filesystems so lets assume that writepage from reclaim context always > failed as a worst case scenario. Certainly this type of change is something > linux-mm needs to be aware of because we've been blind-sided before. Between willy and Johannes pushing or it I was under the strong assumption that linux-mm knows of it.. > I don't think it would be incredibly damaging although there *might* be > issues with small systems or cgroups. Johannes specifically mentioned that cgroup writeback will never call into ->writepage anyway.
Re: [Cluster-devel] remove iomap_writepage v2
On Thu, Jul 28, 2022 at 01:10:16PM +0200, Jan Kara wrote: > Hi Christoph! > > On Tue 19-07-22 06:13:07, Christoph Hellwig wrote: > > this series removes iomap_writepage and it's callers, following what xfs > > has been doing for a long time. > > So this effectively means "no writeback from page reclaim for these > filesystems" AFAICT (page migration of dirty pages seems to be handled by > iomap_migrate_page()) which is going to make life somewhat harder for > memory reclaim when memory pressure is high enough that dirty pages are > reaching end of the LRU list. I don't expect this to be a problem on big > machines but it could have some undesirable effects for small ones > (embedded, small VMs). I agree per-page writeback has been a bad idea for > efficiency reasons for at least last 10-15 years and most filesystems > stopped dealing with more complex situations (like block allocation) from > ->writepage() already quite a few years ago without any bug reports AFAIK. > So it all seems like a sensible idea from FS POV but are MM people on board > or at least aware of this movement in the fs land? > > Added a few CC's for that. > There is some context missing because it's not clear what the full impact is but it is definitly the case that writepage is ignored in some contexts for common filesystems so lets assume that writepage from reclaim context always failed as a worst case scenario. Certainly this type of change is something linux-mm needs to be aware of because we've been blind-sided before. I don't think it would be incredibly damaging although there *might* be issues with small systems or cgroups. In many respects, vmscan has been moving in this direction for a long time e.g. f84f6e2b0868 ("mm: vmscan: do not writeback filesystem pages in kswapd except in high priority") and e2be15f6c3ee ("mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered"). This was roughly 10 years ago when it was clear that FS writeback from reclaim context was fragile (iirc it was partially due to concerns about stack depth and later concerns that a filesystem would simply ignore the writeback request). There also is less reliance on stalling reclaim by queueing and waiting on writeback but we now should explicitly throttle if no progress is being made e.g. 69392a403f49 ("mm/vmscan: throttle reclaim when no progress is being made") and some follow-up fixes. There still is a reliance on swap and shmem pages will not ignore writepage but I assume that is still ok. One potential caveat is if wakeup_flusher_threads() is ignored because there is a reliance in reclaim that if all the pages at the tail of the LRU are dirty pages not queued for writeback then wakeup_flusher_threads() will do something so pages get marked for immediate reclaim when writeback completes. Of course there is no guarantee that flusher threads will start writeback on the old pages and the pages could be backed by a slow BDI but wakeup_flusher_threads() should not be ignored. Another caveat is that comments become misleading. Take for example the comment "Only kswapd can writeback filesystem folios to avoid risk of stack overflow." The wording should change to note that writepage may do nothing at all. There also might need to be some adjustment on when pages get marked for immediate reclaim when they are dirty but not under writeback. pageout() might need a tracepoint for "mapping->a_ops->writepage == NULL" to help debug problems around a failure to queue pages for writback although that could be done as a test patch for a bug. There would need to be some changes made if writepage always or often failed and there might be some premature throttling based on "NOPROGRESS" on small systems due to dirty-not-writepage pages at the tail of the LRU but I don't think it would be an immediate disaster, Reclaim throttling is no longer based on the ability to queue for writeback or "congestion" state of BDIs and some care is taken to not prematurely stall on NOPROGRESS. However, if there was a bug related to premature stalls or excessive CPU usage from direct reclaimers or kswapd that bisected to a change in writepage then it should be fixed on the vmscan-side to put an emphasis on handling "reclaim is in trouble when the bulk of reclaimable pages are FS-only, dirty and not under writeback but ->writepage is a no-op". -- Mel Gorman SUSE Labs
Re: [Cluster-devel] remove iomap_writepage v2
On Thu, Jul 28, 2022 at 3:48 PM Dave Chinner wrote: > > On Thu, Jul 28, 2022 at 03:18:03PM +0100, Matthew Wilcox wrote: > > On Thu, Jul 28, 2022 at 01:10:16PM +0200, Jan Kara wrote: > > > Hi Christoph! > > > > > > On Tue 19-07-22 06:13:07, Christoph Hellwig wrote: > > > > this series removes iomap_writepage and it's callers, following what xfs > > > > has been doing for a long time. > > > > > > So this effectively means "no writeback from page reclaim for these > > > filesystems" AFAICT (page migration of dirty pages seems to be handled by > > > iomap_migrate_page()) which is going to make life somewhat harder for > > > memory reclaim when memory pressure is high enough that dirty pages are > > > reaching end of the LRU list. I don't expect this to be a problem on big > > > machines but it could have some undesirable effects for small ones > > > (embedded, small VMs). I agree per-page writeback has been a bad idea for > > > efficiency reasons for at least last 10-15 years and most filesystems > > > stopped dealing with more complex situations (like block allocation) from > > > ->writepage() already quite a few years ago without any bug reports AFAIK. > > > So it all seems like a sensible idea from FS POV but are MM people on > > > board > > > or at least aware of this movement in the fs land? > > > > I mentioned it during my folio session at LSFMM, but didn't put a huge > > emphasis on it. > > > > For XFS, writeback should already be in progress on other pages if > > we're getting to the point of trying to call ->writepage() in vmscan. > > Surely this is also true for other filesystems? > > Yes. > > It's definitely true for btrfs, too, because btrfs_writepage does: > > static int btrfs_writepage(struct page *page, struct writeback_control *wbc) > { > struct inode *inode = page->mapping->host; > int ret; > > if (current->flags & PF_MEMALLOC) { > redirty_page_for_writepage(wbc, page); > unlock_page(page); > return 0; > } > > > It also rejects all calls to write dirty pages from memory reclaim > contexts. Aha, it seems even kswapd (it has PF_MEMALLOC set) is rejected too. > > ext4 will also reject writepage calls from memory allocation if > block allocation is required (due to delayed allocation) or > unwritten extents need converting to written. i.e. if it has to run > blocking transactions. > > So all three major filesystems will either partially or wholly > reject ->writepage calls from memory reclaim context. > > IOWs, if memory reclaim is depending on ->writepage() to make > reclaim progress, it's not working as advertised on the vast > majority of production Linux systems > > The reality is that ->writepage is a relic of a bygone era of OS and > filesystem design. It was useful in the days where writing a dirty > page just involved looking up the bufferhead attached to the page to > get the disk mapping and then submitting it for IO. > > Those days are long gone - filesystems have complex IO submission > paths now that have to handle delayed allocation, copy-on-write, > unwritten extents, have unbound memory demand, etc. All the > filesystems that support these 1990s era filesystem technologies > simply turn off ->writepage in memory reclaim contexts. > > Hence for the vast majority of linux users (i.e. everyone using > ext4, btrfs and XFS), ->writepage no longer plays any part in memory > reclaim on their systems. > > So why should we try to maintain the fiction that ->writepage is > required functionality in a filesystem when it clearly isn't? > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com >
Re: [Cluster-devel] remove iomap_writepage v2
On Thu, Jul 28, 2022 at 03:18:03PM +0100, Matthew Wilcox wrote: > On Thu, Jul 28, 2022 at 01:10:16PM +0200, Jan Kara wrote: > > Hi Christoph! > > > > On Tue 19-07-22 06:13:07, Christoph Hellwig wrote: > > > this series removes iomap_writepage and it's callers, following what xfs > > > has been doing for a long time. > > > > So this effectively means "no writeback from page reclaim for these > > filesystems" AFAICT (page migration of dirty pages seems to be handled by > > iomap_migrate_page()) which is going to make life somewhat harder for > > memory reclaim when memory pressure is high enough that dirty pages are > > reaching end of the LRU list. I don't expect this to be a problem on big > > machines but it could have some undesirable effects for small ones > > (embedded, small VMs). I agree per-page writeback has been a bad idea for > > efficiency reasons for at least last 10-15 years and most filesystems > > stopped dealing with more complex situations (like block allocation) from > > ->writepage() already quite a few years ago without any bug reports AFAIK. > > So it all seems like a sensible idea from FS POV but are MM people on board > > or at least aware of this movement in the fs land? > > I mentioned it during my folio session at LSFMM, but didn't put a huge > emphasis on it. > > For XFS, writeback should already be in progress on other pages if > we're getting to the point of trying to call ->writepage() in vmscan. > Surely this is also true for other filesystems? Yes. It's definitely true for btrfs, too, because btrfs_writepage does: static int btrfs_writepage(struct page *page, struct writeback_control *wbc) { struct inode *inode = page->mapping->host; int ret; if (current->flags & PF_MEMALLOC) { redirty_page_for_writepage(wbc, page); unlock_page(page); return 0; } It also rejects all calls to write dirty pages from memory reclaim contexts. ext4 will also reject writepage calls from memory allocation if block allocation is required (due to delayed allocation) or unwritten extents need converting to written. i.e. if it has to run blocking transactions. So all three major filesystems will either partially or wholly reject ->writepage calls from memory reclaim context. IOWs, if memory reclaim is depending on ->writepage() to make reclaim progress, it's not working as advertised on the vast majority of production Linux systems The reality is that ->writepage is a relic of a bygone era of OS and filesystem design. It was useful in the days where writing a dirty page just involved looking up the bufferhead attached to the page to get the disk mapping and then submitting it for IO. Those days are long gone - filesystems have complex IO submission paths now that have to handle delayed allocation, copy-on-write, unwritten extents, have unbound memory demand, etc. All the filesystems that support these 1990s era filesystem technologies simply turn off ->writepage in memory reclaim contexts. Hence for the vast majority of linux users (i.e. everyone using ext4, btrfs and XFS), ->writepage no longer plays any part in memory reclaim on their systems. So why should we try to maintain the fiction that ->writepage is required functionality in a filesystem when it clearly isn't? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [Cluster-devel] remove iomap_writepage v2
On Thu, Jul 28, 2022 at 01:10:16PM +0200, Jan Kara wrote: > Hi Christoph! > > On Tue 19-07-22 06:13:07, Christoph Hellwig wrote: > > this series removes iomap_writepage and it's callers, following what xfs > > has been doing for a long time. > > So this effectively means "no writeback from page reclaim for these > filesystems" AFAICT (page migration of dirty pages seems to be handled by > iomap_migrate_page()) which is going to make life somewhat harder for > memory reclaim when memory pressure is high enough that dirty pages are > reaching end of the LRU list. I don't expect this to be a problem on big > machines but it could have some undesirable effects for small ones > (embedded, small VMs). I agree per-page writeback has been a bad idea for > efficiency reasons for at least last 10-15 years and most filesystems > stopped dealing with more complex situations (like block allocation) from > ->writepage() already quite a few years ago without any bug reports AFAIK. > So it all seems like a sensible idea from FS POV but are MM people on board > or at least aware of this movement in the fs land? I mentioned it during my folio session at LSFMM, but didn't put a huge emphasis on it. For XFS, writeback should already be in progress on other pages if we're getting to the point of trying to call ->writepage() in vmscan. Surely this is also true for other filesystems?
Re: [Cluster-devel] remove iomap_writepage v2
Hi Christoph! On Tue 19-07-22 06:13:07, Christoph Hellwig wrote: > this series removes iomap_writepage and it's callers, following what xfs > has been doing for a long time. So this effectively means "no writeback from page reclaim for these filesystems" AFAICT (page migration of dirty pages seems to be handled by iomap_migrate_page()) which is going to make life somewhat harder for memory reclaim when memory pressure is high enough that dirty pages are reaching end of the LRU list. I don't expect this to be a problem on big machines but it could have some undesirable effects for small ones (embedded, small VMs). I agree per-page writeback has been a bad idea for efficiency reasons for at least last 10-15 years and most filesystems stopped dealing with more complex situations (like block allocation) from ->writepage() already quite a few years ago without any bug reports AFAIK. So it all seems like a sensible idea from FS POV but are MM people on board or at least aware of this movement in the fs land? Added a few CC's for that. Honza > Changes since v1: > - clean up a printk in gfs2 > > Diffstat: > fs/gfs2/aops.c | 26 -- > fs/gfs2/log.c |5 ++--- > fs/iomap/buffered-io.c | 15 --- > fs/zonefs/super.c |8 > include/linux/iomap.h |3 --- > 5 files changed, 2 insertions(+), 55 deletions(-) -- Jan Kara SUSE Labs, CR
[Cluster-devel] remove iomap_writepage v2
Hi all, this series removes iomap_writepage and it's callers, following what xfs has been doing for a long time. Changes since v1: - clean up a printk in gfs2 Diffstat: fs/gfs2/aops.c | 26 -- fs/gfs2/log.c |5 ++--- fs/iomap/buffered-io.c | 15 --- fs/zonefs/super.c |8 include/linux/iomap.h |3 --- 5 files changed, 2 insertions(+), 55 deletions(-)