On Sat, Apr 6, 2024 at 12:25 PM Bharath Rupireddy
<[email protected]> wrote:
>
> On Fri, Apr 5, 2024 at 10:31 AM shveta malik <[email protected]> wrote:
> >
> > Please find v2. Changes are:
>
> Thanks for the patch. Here are some comments.
Thanks for reviewing.
>
> 1. Can we have a clear saying in the shmem variable who's syncing at
> the moment? Is it a slot sync worker or a backend via SQL function?
> Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"
>
> typedef enum SlotSyncSource
> {
> SLOT_SYNC_NONE,
> SLOT_SYNC_WORKER,
> SLOT_SYNC_BACKEND,
> } SlotSyncSource;
>
> Then, the check in ShutDownSlotSync can be:
>
> + /*
> + * Return if neither the slot sync worker is running nor the function
> + * pg_sync_replication_slots() is executing.
> + */
> + if ((SlotSyncCtx->pid == InvalidPid) &&
> SlotSyncCtx->sync_source != SLOT_SYNC_BACKEND)
> {
>
> 2.
> SyncReplicationSlots(WalReceiverConn *wrconn)
> {
> + /*
> + * Startup process signaled the slot sync to stop, so if meanwhile user
> + * has invoked slot sync SQL function, simply return.
> + */
> + SpinLockAcquire(&SlotSyncCtx->mutex);
> + if (SlotSyncCtx->stopSignaled)
> + {
> + ereport(LOG,
> + errmsg("skipping slot synchronization as slot sync
> shutdown is signaled during promotion"));
> +
>
> Unless I'm missing something, I think this can't detect if the backend
> via SQL function is already half-way through syncing in
> synchronize_one_slot. So, better move this check to (or also have it
> there) slot sync loop that calls synchronize_one_slot. To avoid
> spinlock acquisitions, we can perhaps do this check in when we acquire
> the spinlock for synced flag.
If the sync via SQL function is already half-way, then promotion
should wait for it to finish. I don't think it is a good idea to move
the check to synchronize_one_slot(). The sync-call should either not
start (if it noticed the promotion) or finish the sync and then let
promotion proceed. But I would like to know others' opinion on this.
>
> /* Search for the named slot */
> if ((slot = SearchNamedReplicationSlot(remote_slot->name, true)))
> {
> bool synced;
>
> SpinLockAcquire(&slot->mutex);
> synced = slot->data.synced;
> << get SlotSyncCtx->stopSignaled here >>
> SpinLockRelease(&slot->mutex);
>
> << do the slot sync skip check here if (stopSignaled) >>
>
> 3. Can we have a test or steps at least to check the consequences
> manually one can get if slot syncing via SQL function is happening
> during the promotion?
>
> IIUC, we need to ensure there is no backend acquiring it and
> performing sync while the slot sync worker is shutting down/standby
> promotion is occuring. Otherwise, some of the slots can get resynced
> and some are not while we are shutting down the slot sync worker as
> part of the standby promotion which might leave the slots in an
> inconsistent state.
I do not think that we can reach a state (exception is some error
scenario) where some of the slots are synced while the rest are not
during a *particular* sync-cycle only because promotion is going in
parallel. (And yes we need to fix the race-condition stated by Amit
up-thread for this statement to be true.)
thanks
Shveta