Hi Xuneng, Imran, and everyone, I’m curious how this approach differs from those previous efforts, and > why those attempts ultimately did not land.
There is directly relevant prior art that may be worth looking at. Koichi Suzuki presented parallel recovery at PGCon 2023 [1] and published a detailed design on the PostgreSQL wiki [2] with a working prototype on GitHub. Koichi's approach is quite different from the current patch: instead of pipelining decode, he parallelizes redo itself by dispatching WAL records to block workers based on page identity. The key rule is that for a given page, WAL records are applied in written order, but different pages can be replayed in parallel by different workers. His design uses a dispatcher to route records to workers, with synchronization needed for multi-block WAL records. One thing I wondered is whether the dispatcher could be avoided entirely: if each child simply reads the whole WAL stream on its own and skips blocks that are not assigned to it, there would be no IPC and no need to coordinate multi-block records across workers. The hard problem he ran into was Hot Standby visibility: when index and heap pages are replayed by different workers at different speeds, concurrent queries can see inconsistent state. The wiki itself notes the idea is to "use this when hot standby is disabled." As far as I know, this was never submitted as a patch to hackers. It also raises an implicit question: what makes the current approach > more promising—whether due to a simpler design or improved > performance. > The two approaches target different bottlenecks. The current patch parallelizes WAL decoding, which keeps the redo path single-threaded and avoids the Hot Standby visibility problem entirely. One thing I am curious about in the current patch: WAL records are already in a serialized format on disk. The producer decodes them and then re-serializes into a different custom format for shm_mq. What is the advantage of this second serialization format over simply passing the raw WAL bytes after CRC validation and letting the consumer decode directly? Offloading CRC to a separate core could still improve throughput at the cost of higher total CPU usage, without needing the custom format. Koichi's approach parallelizes redo (buffer I/O) itself, which attacks a larger cost — Jakub's flamegraphs show BufferAlloc -> GetVictimBuffer -> FlushBuffer dominating in both p0 and p1 — but at the expense of much harder concurrency problems. Whether the decode pipelining ceiling is high enough, or whether the redo parallelization complexity is tractable, seems like the central design question for this area. [1] https://www.pgcon.org/2023/schedule/session/392-parallel-recovery-in-postgresql/ [2] https://wiki.postgresql.org/wiki/Parallel_Recovery Best regards, Henson
