Hi, Alexey! Thank you for spotting this problem, and thank you for working on it.
On Sun, Aug 31, 2025 at 2:47 AM Alexey Makhmutov <[email protected]> wrote: > This is a continuation of the thread > https://www.postgresql.org/message-id/flat/076eb7bd-52e6-4a51-ba00-c744d027b15c%40postgrespro.ru , > with focus only on the patch related to improving performance in case of > large number of cascaded walsenders. > > We’ve faced an interesting situation on a standby environment with > configured cascade replication and large number (~100) of configured > walsenders. We’ve noticed a very high CPU consumption on such > environment with the most time-consuming operation being signal delivery > from startup recovery process to walsenders via WalSndWakeup invocations > from ApplyWalRecord in xlogrecovery.c. > > The startup standby process notifies walsenders for downstream systems > using ConditionVariableBroadcast (CV), so only processes waiting on this > CV need to be contacted. However in case of high load we seems to be > hitting here a bottleneck anyway. The current implementation tries to > send notification after processing of each WAL record (i.e. during each > invocation of ApplyWalRecord), so this implies high rate of WalSndWakeup > invocations. At the same time, this also provides each walsender with > very small chunk of data to process, so almost every process will be > present in the CV wait list for the next iteration. As result, waiting > list should be always fully packed in such case, which additionally > reduces performance of WAL records processing by the standby instance. > > To reproduce such behavior we could use a simple environment with three > servers: primary instance, attached physical standby and its downstream > server with large number of logical replication subscriptions. Attached > is the synthetic test case (test_scenario.zip) to reproduce this > behavior: script ‘test_prepare.sh’ could be used to create required > environment with test data and ‘test_execute.sh’ script executes > ‘pgbench’ tool with simple updates against primary instance to trigger > replication to other servers. With just about 6 clients I could observe > high CPU consumption by the 'startup recovering process' (and it may be > sufficient to completely saturate the CPU on a smaller machine). Please > check the environment properties at the top of these scripts before > running them, as they need to be updated in order to specify location > for installed PG build, target location for database instances creation > and used ports. > > After thinking about possible ways to improve such case, we've decided > to implement batching for notification delivery. We try to slightly > postpone sending notification until recovery has applied some number of > messages.This reduces rate of CV notifications and also gives receivers > more data to process, so they may not need to enter the CV wait state so > often. Counting applied records is not difficult, but the tricky part > here is to ensure that we do not postpone notifications for too long in > case of low load. To reduce such delay we use a timer handler, which > sets a timeout flag, which is checked in ProcessStartupProcInterrupts. > This allow us to send signal on timeout if the startup process is > waiting for the arrival of new WAL records (in ReadRecord). The > WalSndWakeup will be invoked either after applying certain number of > messages or after expiration of timeout since last notification. The > notification however may be delayed while record is being applied > (during redo handler invocation from ApplyWalRecord). This could > increase delay for some corner cases with non-trivial WAL records like > ‘drop database’, but this should be a rare case and walsender process > have its own limit on the wait time, so the delay won’t be indefinite > even in this case. This approach makes sense to me. Do you think it might have corner cases? I suggest the test scenario might include some delay between "UPDATE" queries. Then we can see how changing of this delay interacts with cascade_replication_batch_delay. /* * If time line has switched, then we do not want to delay the * notification, otherwise we will wait until we apply specified * number of records before notifying downstream logical * walsenders. */ This comment tells about logical walsenders, but they same will be applied to physical walsenders, right? > The patch introduces two GUCs to control the batching behavior. The > first one controls size of batched messages > ('cascade_replication_batch_size') and is set to 0 by default, so the > functionality is effectively disabled. The second one controls timed > delay during batching ('cascade_replication_batch_delay'), which is by > default set to 500ms. The delay is used only if batching is enabled. I see these two GUCs are both PGC_POSTMASTER. Could they be PGC_SIGHUP? Also I think there is a typo in the the description of cascade_replication_batch_size, it must say "0 disables". I also think these GUCs should be in the sample file, possibly disabled by default because it only make sense to set up them with high number of cascaded walsenders. > With this patch applied we’ve noticed a significant reduction in CPU > consumption while using the synthetic test program mentioned above. It > would be great to hear any thoughts on these observations and fixing > approaches, as well as possible pitfalls of proposed changes. Great! ------ Regards, Alexander Korotkov Supabase
