Hi, On 2023-10-02 10:42:37 -0400, Robert Haas wrote: > I was trying to think of a test case where XLogInsertRecord would be > exercised as heavily as possible, so I really wanted to generate a lot > of WAL while doing as little real work as possible. The best idea that > I had was to run pg_create_restore_point() in a loop.
What I use for that is pg_logical_emit_message(). Something like SELECT count(*) FROM ( SELECT pg_logical_emit_message(false, '1', 'short'), generate_series(1, 10000) ); run via pgbench does seem to exercise that path nicely. > One possible conclusion is that the differences here aren't actually > big enough to get stressed about, but I don't want to jump to that > conclusion without investigating the competing hypothesis that this > isn't the right way to test this, and that some better test would show > clearer results. Suggestions? I saw some small differences in runtime running pgbench with the above query, with a single client. Comparing profiles showed a surprising degree of difference. That turns out to mostly a consequence of the fact that ReserveXLogInsertLocation() isn't inlined anymore, because there now are two callers of the function in XLogInsertRecord(). Unfortunately, I still see a small performance difference after that. To get the most reproducible numbers, I disable turbo boost, bound postgres to one cpu core, bound pgbench to another core. Over a few runs I quite reproducibly get ~319.323 tps with your patches applied (+ always inline), and ~324.674 with master. If I add an unlikely around if (rechdr->xl_rmid == RM_XLOG_ID), the performance does improve. But that "only" brings it up to 322.406. Not sure what the rest is. One thing that's notable, but not related to the patch, is that we waste a fair bit of cpu time below XLogInsertRecord() with divisions. I think they're all due to the use of UsableBytesInSegment in XLogBytePosToRecPtr/XLogBytePosToEndRecPtr. The multiplication of XLogSegNoOffsetToRecPtr() also shows. Greetings, Andres Freund