On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand
<bertranddrouvot...@gmail.com> wrote:
>
> Hi,
>
> On 10/24/23 7:44 AM, Ajin Cherian wrote:
> > On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand
> > <bertranddrouvot...@gmail.com> wrote:
> >>
> >> @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn,
> >>           SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn);
> >>       }
> >>
> >> +   /* set failover in the slot, as requested */
> >> +   slot->data.failover = ctx->failover;
> >> +
> >>
> >> I think we can get rid of this change in CreateDecodingContext().
> >>
> > Yes, I too noticed this in my testing, however just removing this from
> > CreateDecodingContext will not allow us to change the slot's failover flag
> > using Alter subscription.
>
> Oh right.
>
> > I am thinking of moving this change to
> > StartLogicalReplication prior to calling CreateDecodingContext by
> > parsing the command options in StartReplicationCmd
> > without adding it to the LogicalDecodingContext.
> >
>
> Yeah, that looks like a good place to update "failover".
>
> Doing more testing and I have a couple of remarks about he current behavior.
>
> 1) Let's imagine that:
>
> - there is no standby
> - standby_slot_names is set to a valid slot on the primary (but due to the 
> above, not linked to any standby)
> - then a create subscription on a subscriber WITH (failover = true) would 
> start the
> synchronisation but never finish (means leaving a "synchronisation" slot like 
> "pg_32811_sync_24576_7293415241672430356"
> in place coming from ReplicationSlotNameForTablesync()).
>
> That's expected, but maybe we should emit a warning in 
> WalSndWaitForStandbyConfirmation() on the primary when there is
> a slot part of standby_slot_names which is not active/does not have an 
> active_pid attached to it?
>

Agreed, Will do that.

> 2) When we create a subscription, another slot is created during the 
> subscription synchronization, namely
> like "pg_16397_sync_16388_7293447291374081805" (coming from 
> ReplicationSlotNameForTablesync()).
>
> This extra slot appears to have failover also set to true.
>
> So, If the standby refresh the list of slot to sync when the subscription is 
> still synchronizing we'd see things like
> on the standby:
>
> LOG:  waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) 
> to pass local slot LSN (0/C0034840) and and catalog xmin (756)
> LOG:  wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog 
> xmin (756) has now passed local slot LSN (0/C0034840) and catalog xmin (756)
> LOG:  waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN 
> (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C00368E8) and 
> and catalog xmin (756)
> WARNING:  slot "pg_16397_sync_16388_7293447291374081805" disappeared from the 
> primary, aborting slot creation
>
> I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have 
> failover set to true. If there is a failover
> during the subscription creation, better to re-launch the subscription 
> instead?
>

'Failover' property of subscription is carried to the tablesync-slot
in recent versions only with the intent that if failover happens
during create-sub during table-sync of large tables, then users should
be able to start from that point onward on the new primary. But yes,
the above scenario is highly probable where-in no activity is
happening on primary and thus the table-sync slot is waiting for its
creation during sync on standby. So I agree, to simplify the stuff we
can skip table-sync slot syncing on standby and document this
behaviour.

thanks
Shveta


Reply via email to