On Wednesday, May 10, 2023 2:36 PM Bharath Rupireddy <bharath.rupireddyforpostg...@gmail.com> wrote: > > On Wed, May 10, 2023 at 12:33 AM Andres Freund <and...@anarazel.de> > wrote: > > > > Unfortunately I have found the following commit to have caused a > performance > > regression: > > > > commit e101dfac3a53c20bfbf1ca85d30a368c2954facf > > > > The problem is that, on a standby, after the change - as needed to for the > > approach to work - the call to WalSndWakeup() in ApplyWalRecord() > happens for > > every record, instead of only happening when the timeline is changed (or > WAL > > is flushed or ...). > > > > WalSndWakeup() iterates over all walsender slots, regardless of whether in > > use. For each of the walsender slots it acquires a spinlock. > > > > When replaying a lot of small-ish WAL records I found the startup process to > > spend the majority of the time in WalSndWakeup(). I've not measured it very > > precisely yet, but the overhead is significant (~35% slowdown), even with > > the > > default max_wal_senders. If that's increased substantially, it obviously > > gets > > worse. > > I played it with a simple primary -> standby1 -> standby2 setup. I ran > a pgbench script [1] on primary and counted the number of times > WalSndWakeup() gets called from ApplyWalRecord() and the number of > times spinlock is acquired/released in WalSndWakeup(). It's a whopping > 21 million times spinlock is acquired/released on the standby 1 and > standby 2 for just a < 5min of pgbench run on the primary: > > standby 1: > 2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO WalSndWakeup() in > ApplyWalRecord() was called 2176352 times > 2023-05-10 05:32:43.249 UTC [1595600] LOG: FOO spinlock > acquisition/release count in WalSndWakeup() is 21763530 > > standby 2: > 2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO WalSndWakeup() in > ApplyWalRecord() was called 2176352 times > 2023-05-10 05:32:43.249 UTC [1595625] LOG: FOO spinlock > acquisition/release count in WalSndWakeup() is 21763530 > > In this case, there is no timeline switch or no logical decoding on > the standby or such, but WalSndWakeup() gets called because the > standby can't make out if the slot is for logical or physical > replication unless spinlock is acquired. Before e101dfac3a, > WalSndWakeup() was getting called only when there was a timeline > switch. > > > The only saving grace is that this is not an issue on the primary. > > Yeah. > > > I don't think the approach of not having any sort of "registry" of whether > > anybody is waiting for the replay position to be updated is > > feasible. Iterating over all walsenders slots is just too expensive - > > WalSndWakeup() shows up even if I remove the spinlock (which we likely > could, > > especially when just checking if the the walsender is connected). > > Right. > > > My current guess is that mis-using a condition variable is the best bet. I > > think it should work to use ConditionVariablePrepareToSleep() before a > > WalSndWait(), and then ConditionVariableCancelSleep(). I.e. to never use > > ConditionVariableSleep(). The latch set from ConditionVariableBroadcast() > > would still cause the necessary wakeup. > > How about something like the attached? Recovery and subscription tests > don't complain with the patch.
Thanks for the patch. I noticed one place where the logic is different from before and just to confirm: if (AllowCascadeReplication()) - WalSndWakeup(switchedTLI, true); + ConditionVariableBroadcast(&WalSndCtl->cv); After the change, we wakeup physical walsender regardless of switchedTLI flag. Is this intentional ? if so, I think It would be better to update the comments above this. Best Regards, Hou zj