Re: High CPU consumption in cascade replication with large number of walsenders

Alexander Korotkov Fri, 19 Sep 2025 02:28:00 -0700

Hi, Alexey!

Thank you for spotting this problem, and thank you for working on it.


On Sun, Aug 31, 2025 at 2:47 AM Alexey Makhmutov <[email protected]>
wrote:
> This is a continuation of the thread
>
https://www.postgresql.org/message-id/flat/076eb7bd-52e6-4a51-ba00-c744d027b15c%40postgrespro.ru
,
> with focus only on the patch related to improving performance in case of
> large number of cascaded walsenders.
>
> We’ve faced an interesting situation on a standby environment with
> configured cascade replication and large number (~100) of configured
> walsenders. We’ve noticed a very high CPU consumption on such
> environment with the most time-consuming operation being signal delivery
> from startup recovery process to walsenders via WalSndWakeup invocations
> from ApplyWalRecord in xlogrecovery.c.
>
> The startup standby process notifies walsenders for downstream systems
> using ConditionVariableBroadcast (CV), so only processes waiting on this
> CV need to be contacted. However in case of high load we seems to be
> hitting here a bottleneck anyway. The current implementation tries to
> send notification after processing of each WAL record (i.e. during each
> invocation of ApplyWalRecord), so this implies high rate of WalSndWakeup
> invocations. At the same time, this also provides each walsender with
> very small chunk of data to process, so almost every process will be
> present in the CV wait list for the next iteration. As result, waiting
> list should be always fully packed in such case, which additionally
> reduces performance of WAL records processing by the standby instance.
>
> To reproduce such behavior we could use a simple environment with three
> servers: primary instance, attached physical standby and its downstream
> server with large number of logical replication subscriptions. Attached
> is the synthetic test case (test_scenario.zip) to reproduce this
> behavior: script ‘test_prepare.sh’ could be used to create required
> environment with test data and ‘test_execute.sh’ script executes
> ‘pgbench’ tool with simple updates against primary instance to trigger
> replication to other servers. With just about 6 clients I could observe
> high CPU consumption by the 'startup recovering process' (and it may be
> sufficient to completely saturate the CPU on a smaller machine). Please
> check the environment properties at the top of these scripts before
> running them, as they need to be updated in order to specify location
> for installed PG build, target location for database instances creation
> and used ports.
>
> After thinking about possible ways to improve such case, we've decided
> to implement batching for notification delivery. We try to slightly
> postpone sending notification until recovery has applied some number of
> messages.This reduces rate of CV notifications and also gives receivers
> more data to process, so they may not need to enter the CV wait state so
> often. Counting applied records is not difficult, but the tricky part
> here is to ensure that we do not postpone notifications for too long in
> case of low load. To reduce such delay we use a timer handler, which
> sets a timeout flag, which is checked in ProcessStartupProcInterrupts.
> This allow us to send signal on timeout if the startup process is
> waiting for the arrival of new WAL records (in ReadRecord). The
> WalSndWakeup will be invoked either after applying certain number of
> messages or after expiration of timeout since last notification. The
> notification however may be delayed while record is being applied
> (during redo handler invocation from ApplyWalRecord). This could
> increase delay for some corner cases with non-trivial WAL records like
> ‘drop database’, but this should be a rare case and walsender process
> have its own limit on the wait time, so the delay won’t be indefinite
> even in this case.

This approach makes sense to me.  Do you think it might have corner cases?
I suggest the test scenario might include some delay between "UPDATE"
queries.  Then we can see how changing of this delay interacts with
cascade_replication_batch_delay.

            /*
             * If time line has switched, then we do not want to delay the
             * notification, otherwise we will wait until we apply specified
             * number of records before notifying downstream logical
             * walsenders.
             */

This comment tells about logical walsenders, but they same will be applied
to physical walsenders, right?

> The patch introduces two GUCs to control the batching behavior. The
> first one controls size of batched messages
> ('cascade_replication_batch_size') and is set to 0 by default, so the
> functionality is effectively disabled. The second one controls timed
> delay during batching ('cascade_replication_batch_delay'), which is by
> default set to 500ms. The delay is used only if batching is enabled.

I see these two GUCs are both PGC_POSTMASTER.  Could they be PGC_SIGHUP?
Also I think there is a typo in the the description of
cascade_replication_batch_size, it must say "0 disables".

I also think these GUCs should be in the sample file, possibly disabled by
default because it only make sense to set up them with high number of
cascaded walsenders.

> With this patch applied we’ve noticed a significant reduction in CPU
> consumption while using the synthetic test program mentioned above. It
> would be great to hear any thoughts on these observations and fixing
> approaches, as well as possible pitfalls of proposed changes.

Great!

------
Regards,
Alexander Korotkov
Supabase

Re: High CPU consumption in cascade replication with large number of walsenders

Reply via email to