On Wed, May 8, 2024 at 6:54 AM Justin Pryzby <pry...@telsasoft.com> wrote: > On Tue, May 07, 2024 at 10:55:28AM +1200, Thomas Munro wrote: > > https://github.com/openzfs/zfs/issues/11641 > > > > I don't know enough to say anything useful about that but it certainly > > smells similar... > > Wow - I'd completely forgotten about that problem report. > The symptoms are the same, even with a zfs version 3+ years newer. > I wish the ZFS people would do more with their problem reports.
If I had to guess, my first idea would be that your 1MB or ginormous 16MB recordsize (a relatively new option) combined with PostgreSQL's 8KB block-at-a-time random order I/O patterns are tickling strange corners and finding a bug that no one has seen before. I would imagine that almost everyone in the galaxy who uses very large records does so with 'settled' data that gets streamed out once sequentially (for example I think some of the OpenZFS maintainers are at Lawrence Livermore National Lab where I guess they might pump around petabytes of data produced by particle physics research or whatever it might be, probably why they they are also adding direct I/O to avoid caches completely...). But for us, if we have lots of backends reading, writing and extending random 8KB fragments of a 16MB page concurrently (2048 pages per record!), maybe we hit some broken edge... I'd be sure to include that sort of detail in any future reports. Normally I suppress urges to blame problems on kernels, file systems etc and in the past accusations that ZFS was buggy turned out to be bugs in PostgreSQL IIRC, but user space sure seems to be off the hook for this one...