On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsiang...@linux.alibaba.com> wrote:
>
>
>
> On 2025/7/16 07:32, Gao Xiang wrote:
> > Hi Matthew,
> >
> > On 2025/7/16 04:40, Matthew Wilcox wrote:
> >> I've started looking at how the page cache can help filesystems handle
> >> compressed data better.  Feedback would be appreciated!  I'll probably
> >> say a few things which are obvious to anyone who knows how compressed
> >> files work, but I'm trying to be explicit about my assumptions.
> >>
> >> First, I believe that all filesystems work by compressing fixed-size
> >> plaintext into variable-sized compressed blocks.  This would be a good
> >> point to stop reading and tell me about counterexamples.
> >
> > At least the typical EROFS compresses variable-sized plaintext (at least
> > one block, e.g. 4k, but also 4k+1, 4k+2, ...) into fixed-sized compressed
> > blocks for efficient I/Os, which is really useful for small compression
> > granularity (e.g. 4KiB, 8KiB) because use cases like Android are usually
> > under memory pressure so large compression granularity is almost
> > unacceptable in the low memory scenarios, see:
> > https://erofs.docs.kernel.org/en/latest/design.html
> >
> > Currently EROFS works pretty well on these devices and has been
> > successfully deployed in billions of real devices.
> >
> >>
> >>  From what I've been reading in all your filesystems is that you want to
> >> allocate extra pages in the page cache in order to store the excess data
> >> retrieved along with the page that you're actually trying to read.  That's
> >> because compressing in larger chunks leads to better compression.
> >>
> >> There's some discrepancy between filesystems whether you need scratch
> >> space for decompression.  Some filesystems read the compressed data into
> >> the pagecache and decompress in-place, while other filesystems read the
> >> compressed data into scratch pages and decompress into the page cache.
> >>
> >> There also seems to be some discrepancy between filesystems whether the
> >> decompression involves vmap() of all the memory allocated or whether the
> >> decompression routines can handle doing kmap_local() on individual pages.
> >>
> >> So, my proposal is that filesystems tell the page cache that their minimum
> >> folio size is the compression block size.  That seems to be around 64k,
> >> so not an unreasonable minimum allocation size.  That removes all the
> >> extra code in filesystems to allocate extra memory in the page cache.> It 
> >> means we don't attempt to track dirtiness at a sub-folio granularity
> >> (there's no point, we have to write back the entire compressed bock
> >> at once).  We also get a single virtually contiguous block ... if you're
> >> willing to ditch HIGHMEM support.  Or there's a proposal to introduce a
> >> vmap_file() which would give us a virtually contiguous chunk of memory
> >> (and could be trivially turned into a noop for the case of trying to
> >> vmap a single large folio).
> >
> > I don't see this will work for EROFS because EROFS always supports
> > variable uncompressed extent lengths and that will break typical
> > EROFS use cases and on-disk formats.
> >
> > Other thing is that large order folios (physical consecutive) will
> > caused "increase the latency on UX task with filemap_fault()"
> > because of high-order direct reclaims, see:
> > https://android-review.googlesource.com/c/kernel/common/+/3692333
> > so EROFS will not set min-order and always support order-0 folios.
> >
> > I think EROFS will not use this new approach, vmap() interface is
> > always the case for us.
>
> ... high-order folios can cause side effects on embedded devices
> like routers and IoT devices, which still have MiBs of memory (and I
> believe this won't change due to their use cases) but they also use
> Linux kernel for quite long time.  In short, I don't think enabling
> large folios for those devices is very useful, let alone limiting
> the minimum folio order for them (It would make the filesystem not
> suitable any more for those users.  At least that is what I never
> want to do).  And I believe this is different from the current LBS
> support to match hardware characteristics or LBS atomic write
> requirement.

Given the difficulty of allocating large folios, it's always a good
idea to have order-0 as a fallback. While I agree with your point,
I have a slightly different perspective — enabling large folios for
those devices might be beneficial, but the maximum order should
remain small. I'm referring to "small" large folios.

Still, even with those, allocation can be difficult — especially
since so many other allocations (which aren't large folios) can cause
fragmentation. So having order-0 as a fallback remains important.

It seems we're missing a mechanism to enable "small" large folios
for files. For anon large folios, we do have sysfs knobs—though they
don’t seem to be universally appreciated. :-)

Thanks
Barry


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Reply via email to