Re: Large block sizes support in Linux

Thomas Munro Fri, 22 Mar 2024 21:54:13 -0700

On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung)
<ker...@pankajraghav.com> wrote:
> My team and I have been working on adding Large block size(LBS)
> support to XFS in Linux[1]. Once this feature lands upstream, we will be
> able to create XFS with FS block size > page size of the system on Linux.
> We also gave a talk about it in Linux Plumbers conference recently[2]
> for more context. The initial support is only for XFS but more FSs will
> follow later.


Very cool!

(I used XFS on IRIX in the 90s, and it had large blocks then, a
feature lost in the port to Linux AFAIK.)

> On an x86_64 system, fs block size was limited to 4k, but traditionally
> Postgres uses 8k as its default internal page size. With LBS support,
> fs block size can be set to 8K, thereby matching the Postgres page size.
>
> If the file system block size == DB page size, then Postgres can have
> guarantees that a single DB page will be written as a single unit during
> kernel write back and not split.
>
> My knowledge of Postgres internals is limited, so I'm wondering if there
> are any optimizations or potential optimizations that Postgres could
> leverage once we have LBS support on Linux?

FWIW here are a couple of things I wrote about our storage atomicity
problem, for non-PostgreSQL hackers who may not understand our project
jargon:

https://wiki.postgresql.org/wiki/Full_page_writes
https://freebsdfoundation.org/wp-content/uploads/2023/02/munro_ZFS.pdf

The short version is that we (and MySQL, via a different scheme with
different tradeoffs) could avoid writing all our stuff out twice if we
could count on atomic writes of a suitable size on power failure, so
the benefits are very large.  As far as I know, there are two things
we need from the kernel and storage to do that on "overwrite"
filesystems like XFS:

1.  The disk must promise that its atomicity-on-power-failure is a
multiple of our block size -- something like NVMe AWUPF, right?  My
devices seem to say 0 :-(  Or I guess the filesystem has to
compensate, but then it's not exactly an overwrite filesystem
anymore...

2.  The kernel must promise that there is no code path in either
buffered I/O or direct I/O that will arbitrarily chop up our 8KB (or
other configured block size) writes on some smaller boundary, most
likely sector I guess, on their way to the device, as you were saying.
Not just in happy cases, but even under memory pressure, if
interrupted, etc etc.

Sounds like you're working on problem #2 which is great news.

I've been wondering for a while how a Unixoid kernel should report
these properties to userspace where it knows them, especially on
non-overwrite filesystems like ZFS where this sort of thing works
already, without stuff like AWUPF working the way one might hope.
Here was one throw-away idea on the back of a napkin about that, for
what little it's worth:

https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO

Re: Large block sizes support in Linux

Reply via email to