Re: [WIP] Pipelined Recovery

Imran Zaheer Sun, 15 Feb 2026 22:37:25 -0800

Hi Jakub, thanks for your valuable feedback.

> - on the flamegraphs BufferAlloc->GetVictimBuffer->FlushBuffer() is visible
> often in both scenarios p0/p1 . I recall from my old WAL IO prefetching stress
> tests experiments [1] that tuning bgwriter was playing some important role
> (most of the time bgwriter values are left on default). It's connected to
> the pipelining idea too for sure, and somehow we are not effective on it
> based on Your's and mine results and some other field reports. Basically we
> have wrong defaults for bgwriter on standbys, and no-one is going to notice
> unless you measure it in-depth. We could also tweak the bgwriter in standby
> role to make the pipelining more effective that way too, but maybe I am 
> missing
> something? Maybe bgwriter on standby should work full-steam instead of being
> rate limited as long as there is work to be done (pressure) (?) It's not a
> critique of the patch, but maybe have guys investigated that road too as it is
> related to the pipelining concept?
>


Ok thanks, I will look into that.

> - there's plenty of ValidXLogRecord->pg_comp_crc32c_sb8(), probably even too
> much, so I think either old CPU was used or something happened that SSE4.2
> was not available (pg_comp_crc32c_sse42()). I'm speculating, but probably
> you would even getter results from the patch (which are impressive anyway)
> by using pg_comp_crc32c_sse42() because it wouldn't be such a
> bottleneck anymore.
>

As mentioned in the benchmarks pdf the cpu was used `CPU: Common KVM
(4) @ 2.593GHz`. I was using
a virtual machine. Maybe that could be the reason for this bottleneck.
I will do some more research on it.

> - The flamegraphs itself are for whole Postgresql, right or am I
> misunderstanding it? Probably in the long run it would be better to have just
> the PID of startup/recovering (but that's way harder to script for sure,
> due to the need for isolated PID: perf record -p <PID>)
>
Yes the flame graphs were for the whole postgresql, I will try to
improve the script to make it only for the recovering processes.

> - You need to rebase due to due to 1eb09ed63a8 from couple of days ago:
>   CONFLICT (content): Merge conflict in src/backend/postmaster/bgworker.c
>

Noted, thanks.

> - naming: I don't have anything against wal_pipeline.c, it's just that
>   the suite of C files is named src/backend/access/transam/xlog*.c
>   (so perhaps xlogrecoverypipeline.c (?) to stay consistent)
>

I agree, I will change that in the next patch. Maybe `xlogpipeline.c`
would be enough.

> - I'm was getting lots of problems during linking (e.g. undefined reference to
>   `WalPipeline_IsActive') and it appears that the patch is missing
> meson support:
>
>     --- a/src/backend/access/transam/meson.build
>     +++ b/src/backend/access/transam/meson.build
>     @@ -25,6 +25,7 @@ backend_sources += files(
>     'xlogstats.c',
>     'xlogutils.c',
>     'xlogwait.c',
>     +'wal_pipeline.c'
>

Noted, thanks. This was some initial efforts and I still haven't
tested on windows env.

>
> - nitpicking, but you need to add wal_pipeline to postgresql.conf.sample
>   otherwise 003_check_guc.pl test fails.
>

Got it.

> - good news is that test with PG_TEST_EXTRA="wal_consistency_checking"  passes
>
> - but bad news is that PG_TEST_EXTRA="wal_consistency_checking" with
>   PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on" which actives all of this 
> does
>   not multiple tests (possibly I look for those in the following days,
> unless you are faster)
>
> - at least here, fresh after pg_basebackup -R for building standby and
> restart with wal_pipeline=on
>   I've got bug:
>     2026-02-04 10:29:40.146 CET [335197] LOG:  [walpipeline] started.
>     2026-02-04 10:29:40.146 CET [335197] LOG:  redo starts at 0/02000028
>     2026-02-04 10:29:40.147 CET [335198] LOG:  invalid record length
> at 0/03000060: expected at least 24, got 0
>     2026-02-04 10:29:40.147 CET [335197] LOG:  consistent recovery
> state reached at 0/03000060
>     2026-02-04 10:29:40.147 CET [335197] LOG:  [walpipeline] consumer:
> received shutdown message from the producer
>     2026-02-04 10:29:40.147 CET [335191] LOG:  database system is
> ready to accept read-only connections
>     2026-02-04 10:29:40.157 CET [335198] LOG:  [walpipeline] producer:
> exiting: sent=5 received=5
>     2026-02-04 10:29:40.167 CET [335197] LOG:  WAL pipeline stopped
>     2026-02-04 10:29:40.167 CET [335197] LOG:  redo done at 0/03000028
> system usage: CPU: user: 0.00 s, system: 0.03 s, elapsed: 0.04 s
>     2026-02-04 10:29:40.169 CET [335197] LOG:  selected new timeline ID: 2
>     2026-02-04 10:29:40.184 CET [335197] LOG:  archive recovery complete
>
> - even when starting from scratch with wal_pipeline = on it's similiar:
>    2026-02-04 10:34:14.386 CET [335833] LOG:  database system was
> interrupted; last known up at 2026-02-04 10:34:02 CET
>    2026-02-04 10:34:14.453 CET [335833] LOG:  starting backup recovery
> with redo LSN 0/0B0000E0, checkpoint LSN 0/0B000138, on timeline ID 1
>    2026-02-04 10:34:14.453 CET [335833] LOG:  entering standby mode
>    2026-02-04 10:34:14.482 CET [335833] LOG:  [walpipeline] started.
>    2026-02-04 10:34:14.482 CET [335833] LOG:  redo starts at 0/0B0000E0
>    2026-02-04 10:34:14.484 CET [335833] LOG:  completed backup
> recovery with redo LSN 0/0B0000E0 and end LSN 0/0B0001E0
>    2026-02-04 10:34:14.484 CET [335833] LOG:  consistent recovery
> state reached at 0/0B0001E0
>    2026-02-04 10:34:14.484 CET [335833] LOG:  [walpipeline] consumer:
> received shutdown message from the producer
>    2026-02-04 10:34:14.484 CET [335827] LOG:  database system is ready
> to accept read-only connections
>    2026-02-04 10:34:14.493 CET [335834] LOG:  [walpipeline] producer:
> exiting: sent=4 received=4
>    2026-02-04 10:34:14.504 CET [335833] LOG:  WAL pipeline stopped
>    2026-02-04 10:34:14.504 CET [335833] LOG:  redo done at 0/0B0001E0
> system usage: CPU: user: 0.00 s, system: 0.03 s, elapsed: 0.04 s
>    2026-02-04 10:34:14.506 CET [335833] LOG:  selected new timeline ID: 2
>
>    it appears that "Cannot read more records, shut it down" ->
>   WalPipeline_SendShutdown() path is taken but I haven't pursued further.
>

Yes I initial testing was only done with the archive recovery. I am
still trying to add support for streaming replication.
Also many tests from `make -C src/test/recovery/ check
PG_TEST_INITDB_EXTRA_OPTS="-c wal_pipeline=on"` are failing and need
to be fixed.

> Also some very quick review comments:
>
> +bool
> +WalPipeline_SendRecord(XLogReaderState *record)
> +{
> [..]
> +
> +    if (msglen > WAL_PIPELINE_MAX_MSG_SIZE)
> +    {
> +        elog(WARNING, "[walpipeline] producer: wal record at %X/%X
> too large (%zu bytes), skipping",
> +             LSN_FORMAT_ARGS(record->ReadRecPtr), msglen);
> +        pfree(buffer);
> +        return true;
> +    }
>
> When/why it could happen and if it would happen, shouldn't this be more
> like PANIC instead?
>

Noted, Will fix that.

> +/* Size of the shared memory queue (can be made configurable) */
> +#define WAL_PIPELINE_QUEUE_SIZE  (128 * 1024 * 1024)  /* 8 MB */
> +
> +/* Maximum size of a single message */
> +#define WAL_PIPELINE_MAX_MSG_SIZE  (2 * 1024 * 1024)  /* 1 MB */
>
> Maybe those should take into account (XLOG_)BLCKSZ too?
>

Yes, that's a good idea, I will do that.


Thanks for the valuable feedback. I will try to fix most of these
issues in the next patch.

Regards,
Imran Zaheer

Re: [WIP] Pipelined Recovery

Reply via email to