Hi Imran, On Tue, Jun 23, 2026 at 9:27 PM Imran Zaheer <[email protected]> wrote: > > Hi > > I am attaching the new series of patches. > > What has changed? > > * Rebased > > * The patch set is now split into two new patches. This will make the > code easier to understand and review. > > * The v4-0003 patch contains code mostly related to keeping the > recovery states synced between the startup process and the pipeline > process. Most of these changes were required to make the streaming > replication work. > > * The v4-0002 patch now only contains the consumer code that handles > receiving the decoded records from the shmem queue and moving the redo > loop forward. > > * The v4-0004 contains some basic tests to see if the pipeline worker > is functioning as expected. More testing was done by passing > PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on" before running the > recovery test suite.
+1 for splitting the patch set into smaller components to make the review process smoother. > * Other than that, the cpu overhead during deserialization is > optimized by skipping multiple copies of the decoded record and > directly passing the pointer to the shmem queue. There is still some > overhead visible during serialization that could be improved at the > producer end. > > * Signal handling for the pipeline worker is improved so that > promotion signals are sent to both the startup process and the > producer worker by the postmaster. > > > You will also find the new benchmarks attached [1] and the pdf report > overview. A simple cpu profiling on the pipelined startup process > shows that the cpu overhead during reading records has now been > removed and offloaded to the producer worker. > > Before pipelining: > > Around 50% of the cpu time is spent on fetching the wal record. Note that > in this workload pipeline is off so don't worry about the new func > ReceiveRecord(), it's just a wrapper around ReadRecord(). > > Children Self Command Shared O Symbol > - 98.85% 0.21% postgres postgres [.] PerformWalRecovery > - 98.64% PerformWalRecovery > - 51.00% ReceiveRecord > - 50.78% ReadRecord > - 50.52% XLogPrefetcherReadRecord > - 49.61% XLogPrefetcherNextBlock > + 25.33% XLogReadAhead > + 22.32% PrefetchSharedBuffer > + 0.76% smgropen > - 46.68% ApplyWalRecord > + 29.23% heap_redo > + 9.51% heap2_redo > + 4.74% btree_redo > + 1.11% xlog_redo > + 0.80% xact_redo > > > After Pipelining: > > Here the only work needed to be done by the cpu is to get the decoded > record from > the queue. Other times (89.13%) cpu is worried about applying the wal record. > > Children Self Command Shared O Symbol > - 98.74% 0.37% postgres postgres [.] PerformWalRecovery > - 98.37% PerformWalRecovery > - 89.13% ApplyWalRecord > + 56.89% heap_redo > + 18.28% heap2_redo > + 8.01% btree_redo > + 2.02% xlog_redo > + 1.15% xact_redo > - 7.80% ReceiveRecord > + 7.63% WalPipeline_ReceiveRecord > > If the recovery process is not I/O bound then we would be able to test > this cpu optimization. Doing pgbench on a workload that is fully in > memory shows around 30% performance gains. You can see more > benchmarking details in the attached drive link [1] The perf result looks promising! > Some comments related to attached pdf and benchmarking, it is showing > that we can get more performance advantage out of the pipeline when > most of the workload is running in memory i.e. we have enough shared > buffers configured. > > If you want to do some experiments, please be my guest; I would be > happy to see more testing. You can share what performance advantage > you are getting from this. You can also refer to the benchmarking > script that I have been using [2]. > > > Looking forward to your review, comments, etc. I haven't had a chance for a meaningful review yet, but expect to do so soon. -- Regards, Xuneng Zhou HighGo Software Co., Ltd.
