On Fri, Mar 22, 2024 at 10:56 PM Pankaj Raghav (Samsung) <ker...@pankajraghav.com> wrote: > My team and I have been working on adding Large block size(LBS) > support to XFS in Linux[1]. Once this feature lands upstream, we will be > able to create XFS with FS block size > page size of the system on Linux. > We also gave a talk about it in Linux Plumbers conference recently[2] > for more context. The initial support is only for XFS but more FSs will > follow later.
Very cool! (I used XFS on IRIX in the 90s, and it had large blocks then, a feature lost in the port to Linux AFAIK.) > On an x86_64 system, fs block size was limited to 4k, but traditionally > Postgres uses 8k as its default internal page size. With LBS support, > fs block size can be set to 8K, thereby matching the Postgres page size. > > If the file system block size == DB page size, then Postgres can have > guarantees that a single DB page will be written as a single unit during > kernel write back and not split. > > My knowledge of Postgres internals is limited, so I'm wondering if there > are any optimizations or potential optimizations that Postgres could > leverage once we have LBS support on Linux? FWIW here are a couple of things I wrote about our storage atomicity problem, for non-PostgreSQL hackers who may not understand our project jargon: https://wiki.postgresql.org/wiki/Full_page_writes https://freebsdfoundation.org/wp-content/uploads/2023/02/munro_ZFS.pdf The short version is that we (and MySQL, via a different scheme with different tradeoffs) could avoid writing all our stuff out twice if we could count on atomic writes of a suitable size on power failure, so the benefits are very large. As far as I know, there are two things we need from the kernel and storage to do that on "overwrite" filesystems like XFS: 1. The disk must promise that its atomicity-on-power-failure is a multiple of our block size -- something like NVMe AWUPF, right? My devices seem to say 0 :-( Or I guess the filesystem has to compensate, but then it's not exactly an overwrite filesystem anymore... 2. The kernel must promise that there is no code path in either buffered I/O or direct I/O that will arbitrarily chop up our 8KB (or other configured block size) writes on some smaller boundary, most likely sector I guess, on their way to the device, as you were saying. Not just in happy cases, but even under memory pressure, if interrupted, etc etc. Sounds like you're working on problem #2 which is great news. I've been wondering for a while how a Unixoid kernel should report these properties to userspace where it knows them, especially on non-overwrite filesystems like ZFS where this sort of thing works already, without stuff like AWUPF working the way one might hope. Here was one throw-away idea on the back of a napkin about that, for what little it's worth: https://wiki.postgresql.org/wiki/FreeBSD/AtomicIO