Hi, On 2025-11-20 14:08:57 -0500, Greg Burd wrote: > After talking to you about these ideas at PGConf in NYC I've been > anxiously awaiting this patch set. Thanks for dedicating the energy and > time to get it to this stage.
Thanks! > High level feedback after reading the patches/email/commit messages is > that it looks to get you to where you wanted to be, unblocking AIO > writes. I think they'll end up being faster than what's in place now, > even before you get to the AIO piece. Certainly removing the copy of > each page to do a checksum will help. Opening the door to a future > where we can have super-pinned/locked pages is also a net win. I actually had planned to write about the performance effects, in particular of 0012, a bit more: It's worth pointing out that the new way of setting hint bits is inherently more expensive than what we did before - upgrading a lock to a different lock level isn't free, compared to doing, well, nothing. For paths that set the hint bits of a whole page, like a seqscan, that cost is more than amortized by the batched approach introduced in 0011. Those get faster with the patch, both when already hinted and when not. However, there are paths that aren't easily amenable to that approach, like e.g. an ordered index scan referencing unhinted tuples. There we only ever access a single tuple and release the upgraded lock after every tuple. If the index scan is perfectly correlated with the table and every tuple is unhinted, that's a decent amount of additional work. I've spent a lot of time micro-optimizing that workload, to avoid any significiant regressions. An extreme stress-test started out being about 20% slower than today, as of my current local version, it's a bit faster (~1%) on one of my machines and a bit slower (~2%) on another. Partially that was achieved by optimizing the hint-bit-lock-upgrade code more (e.g. having a fast path for updating a single hint bit, avoiding redundant reads of the lock state by having MarkSharedBufferDirtyHint(), ...), partially by optimizing the locking code. The latter is a bit of a cheat though - things would be even faster if we went with the old way of setting hint bits, but with the independent optimizations applied. I think that's ok though: 1) the old way of setting hint bits is a pretty dirty hack that causes issues in quite a few places. 2) by definition, having to set hint bits is an ephemeral state, once the hint bits are set, the difference vanishes 3) no normal workload shows the difference - my stress test does SELECT * FROM manyrows_idx ORDER BY i OFFSET 10000000; on a perfectly correlated table with very narrow rows, i.e. an index scan of the whole table, where none of the scan results are ever used. Once one actually uses the resulting rows, the performance difference completely vanishes. 4) as part of the index prefetching work, we might get the infrastructure to actually batch the hint-bit setting in this case too. I see some mild performance gains in workloads like pgbench [-S], but nothing to write home about. Which I think is about what I would expect - there's a minor efficiency gain due to the private refcount changes and not having state tracking for two error recovery mechanisms (lwlocks' and private refcount). Non-all-visible seqscans do see some performance gain due to 0011, whether the table is hinted or not. But it's again something that's mostly noticeable in microbenchmarks, as e.g. tuple deforming or qual evaluation has a much bigger impact. > Everything before/after 0008 was rather easy to digest and understand > and I found nothing really to call out at this stage > > 0008 is understandable too, it's just sizable. While it is large, I find > it well laid out and more readable than before. I gave the locking code > a good look, it seems correct AFAICT. I hope so :), the locking logic it's largely the same as lwlock.c, with some exceptions due to the added lock level and differences in error handling state. I don't really see a good way to split 0008 unfortunately... I previously had split 0012 into four patches (core changes, heapam changes, the rest of the adaptions to setting hint bits in various places, adding assertions relying on the different lock levels), but I found it pretty unwieldly from a "comment management" perspective, because comments that need to be rewritten are temporarily wrong, or would need to be modified in yet another path. But I'm open to going back to that approach. I guess I could pull out the addition of UnlockBuffer() and the "redirection" to it from LockBuffer() into a separate patch. > Keep going, I'll be happy to dedicate time to testing and digging into > the commits as you get this into a final state. I look forward to > extending/enhancing this code once integrated. Cool! I think 0001, 0002, 0003, 0005 and 0009 should be mergeable pretty soon. I've some further polishing to do for 0006 and 0007, but I think they could go in well ahead of the rest. For 0008, in addition to what's noted in the commit message, I think there needs to be an additional section in src/backend/storage/buffer/README (or such), explaining that buffer content locks used to be lwlocks but aren't anymore for xyz reasons. I suspect it'd also be good to have a few references from lwlock.c to bufmgr.c to make sure the code is co-evolved. For 0012, I think it might make sense to pull out some of the changes to fsm_vacuum_page() out into a separate commit. Basically changing the code to acquire a lock on the page - I still can't quite believe that somebody thought it's sane to update the page without even bothering with a share lock. 0014 needs a lot more polishing, there's references to the hint bit / locking interactions all over. Pretty hard to find all the references :(. Greetings, Andres Freund
