Re: Synchronizing slots from primary to standby

Ajin Cherian Fri, 01 Mar 2024 01:09:52 -0800

On Thu, Feb 29, 2024 at 12:34 PM Zhijie Hou (Fujitsu) <
[email protected]> wrote:


> On Monday, February 26, 2024 7:52 PM Amit Kapila <[email protected]>
> wrote:
> >
> > On Mon, Feb 26, 2024 at 7:49 AM Zhijie Hou (Fujitsu) <
> [email protected]>
> > wrote:
> > >
> > > Attach the V98 patch set which addressed above comments.
> > >
> >
> > Few comments:
> > =============
> > 1.
> >  WalSndWaitForWal(XLogRecPtr loc)
> >  {
> >   int wakeEvents;
> > + bool wait_for_standby = false;
> > + uint32 wait_event;
> > + List    *standby_slots = NIL;
> >   static XLogRecPtr RecentFlushPtr = InvalidXLogRecPtr;
> >
> > + if (MyReplicationSlot->data.failover && replication_active)
> > + standby_slots = GetStandbySlotList(true);
> > +
> >   /*
> > - * Fast path to avoid acquiring the spinlock in case we already know we
> > - * have enough WAL available. This is particularly interesting if we're
> > - * far behind.
> > + * Check if all the standby servers have confirmed receipt of WAL up to
> > + * RecentFlushPtr even when we already know we have enough WAL
> available.
> > + *
> > + * Note that we cannot directly return without checking the status of
> > + * standby servers because the standby_slot_names may have changed,
> > + which
> > + * means there could be new standby slots in the list that have not yet
> > + * caught up to the RecentFlushPtr.
> >   */
> > - if (RecentFlushPtr != InvalidXLogRecPtr &&
> > - loc <= RecentFlushPtr)
> > - return RecentFlushPtr;
> > + if (!XLogRecPtrIsInvalid(RecentFlushPtr) && loc <= RecentFlushPtr) {
> > + FilterStandbySlots(RecentFlushPtr, &standby_slots);
> >
> > I think even if the slot list is not changed, we will always process
> each slot
> > mentioned in standby_slot_names once. Can't we cache the previous list of
> > slots for we have already waited for? In that case, we won't even need
> to copy
> > the list via GetStandbySlotList() unless we need to wait.
> >
> > 2.
> >  WalSndWaitForWal(XLogRecPtr loc)
> >  {
> > + /*
> > + * Update the standby slots that have not yet caught up to the flushed
> > + * position. It is good to wait up to RecentFlushPtr and then let it
> > + * send the changes to logical subscribers one by one which are
> > + * already covered in RecentFlushPtr without needing to wait on every
> > + * change for standby confirmation.
> > + */
> > + if (wait_for_standby)
> > + FilterStandbySlots(RecentFlushPtr, &standby_slots);
> > +
> >   /* Update our idea of the currently flushed position. */
> > - if (!RecoveryInProgress())
> > + else if (!RecoveryInProgress())
> >   RecentFlushPtr = GetFlushRecPtr(NULL);
> >   else
> >   RecentFlushPtr = GetXLogReplayRecPtr(NULL); ...
> > /*
> > * If postmaster asked us to stop, don't wait anymore.
> > *
> > * It's important to do this check after the recomputation of
> > * RecentFlushPtr, so we can send all remaining data before shutting
> > * down.
> > */
> > if (got_STOPPING)
> > break;
> >
> > I think because 'wait_for_standby' may not be set in the first or
> consecutive
> > cycles we may send the WAL to the logical subscriber before sending it
> to the
> > physical subscriber during shutdown.
>
> Here is the v101 patch set which addressed above comments.
>
> This version will cache the oldest standby slot's LSN each time we waited
> for
> them to catch up. The cached LSN is invalidated when we reload the GUC
> config.
> In the WalSndWaitForWal function, instead of traversing the entire standby
> list
> each time, we can check the cached LSN to quickly determine if the standbys
> have caught up. When a shutdown signal is received, we continue to wait
> for the
> standby slots to catch up. When waiting for the standbys to catch up after
> receiving the shutdown signal, an ERROR is reported if any slots are
> dropped,
> invalidated, or inactive. This measure is taken to prevent the walsender
> from
> waiting indefinitely.
>
>
Thanks for the patch.
I did some performance test run on PATCH v101 with synchronous_commit
turned on to check how much logical replication changes affects transaction
speed on primary compared to HEAD code. In all configurations, there is a
primary, a logical subscriber and a physical standby and the logical
subscriber is listed in the synchronous_standby_names. This means all
transactions on primary will not be committed until the logical subscriber
has confirmed receipt of this transaction.


Machine details:
Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz, 800GB RAM

My addition configuration on each instance is:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

All tests are done using pgbench running for 15 minutes:
Creating tables: pgbench -p 6972 postgres -qis 2
Running benchmark: pgbench postgres -p 6972 -c 10 -j 3 -T 900 -P 5

HEAD code-
 Primary with Synchronous_commit=on, physical standby with
hot_standby_feedback=off
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.226658 8.17815 8.202404

HEAD code-
Primary with Synchronous_commit=on, physical standby with
hot_standby_feedback=on
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.134901 8.229066 8.1819835 -- degradation from first config -0.25%

PATCHED code - (v101-0001)
Primary with synchronous_commit=on, physical standby with
hot_standby_feedback=on, standby_slot_names not configured, logical
subscriber not failover enabled, physical standby not configured for sync
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.18839 8.18839 8.18839-- degradation from first config *-0.17%*

PATCHED code - (v98-0001)
Synchronous_commit=on, hot_standby_feedback=on, standby_slot_names
configured to physical standby, logical subscriber failover enabled,
physical standby configured for sync
RUN1 (TPS) RUN2 (TPS) AVERAGE (TPS)
8.173062 8.068536 8.120799-- degradation from first config* -0.99%*

Overall, I do not see any significant performance degradation with the
patch and sync-slot enabled with one logical subscriber and one physical
standby.
Attaching script for my final test configuration for reference.

regards,
Ajin Cherian
Fujitsu Australia

<<attachment: patch_run_sync.zip>>

Re: Synchronizing slots from primary to standby

Reply via email to