Re: Synchronizing slots from primary to standby
On Tue, Nov 21, 2023 at 10:01 AM Zhijie Hou (Fujitsu) wrote: > > On Saturday, November 18, 2023 6:46 PM Amit Kapila > wrote: > > > > On Fri, Nov 17, 2023 at 5:18 PM Drouvot, Bertrand > > wrote: > > > > > > On 11/17/23 2:46 AM, Zhijie Hou (Fujitsu) wrote: > > > > On Tuesday, November 14, 2023 10:27 PM Drouvot, Bertrand > > wrote: > > > > > > > > I feel the WaitForWALToBecomeAvailable may not be the best place to > > > > shutdown slotsync worker and drop slots. There could be other > > > > reasons(other than > > > > promotion) as mentioned in comments in case XLOG_FROM_STREAM to > > > > reach the code there. I thought if the intention is to stop slotsync > > > > workers on promotion, maybe FinishWalRecovery() is a better place to > > > > do it as it's indicating the end of recovery and XLogShutdownWalRcv is > > > > also > > called in it. > > > > > > I can see that slotsync_drop_initiated_slots() has been moved in > > > FinishWalRecovery() in v35. That looks ok. > > > > > > > > > > More Review for v35-0002* > > Thanks for the comments. > > > > > 1. > > + ereport(WARNING, > > + errmsg("skipping slots synchronization as primary_slot_name " > > +"is not set.")); > > > > There is no need to use a full stop at the end for WARNING messages and as > > previously mentioned, let's not split message lines in such cases. There are > > other messages in the patch with similar problems, please fix those as well. > > Adjusted. > > > > > 2. > > +slotsync_checks() > > { > > ... > > ... > > + /* The hot_standby_feedback must be ON for slot-sync to work */ if > > + (!hot_standby_feedback) { ereport(WARNING, errmsg("skipping slots > > + synchronization as hot_standby_feedback " > > +"is off.")); > > > > This message has the same problem as mentioned in the previous comment. > > Additionally, I think either atop slotsync_checks or along with GUC check we > > should write comments as to why we expect these values to be set for slot > > sync > > to work. > > Added comments for these cases. > > > > > 3. > > + /* The worker is running already */ > > + if (SlotSyncWorker &>hdr.in_use && > > + SlotSyncWorker->hdr.proc) > > > > The spacing for both the &&'s has problems. You need a space after the first > > && and the second && should be in the prior line. > > Adjusted. > > > > > 4. > > + LauncherRereadConfig(_slotsync); > > + > > } > > > > An empty line after LauncherRereadConfig() is not required. > > > > 5. > > +static void > > +LauncherRereadConfig(bool *ss_recheck) > > +{ > > + char*conninfo = pstrdup(PrimaryConnInfo); > > + char*slotname = pstrdup(PrimarySlotName); > > + bool syncslot = enable_syncslot; > > + bool feedback = hot_standby_feedback; > > > > Can we change the variable name 'feedback' to 'standbyfeedback' to make it > > slightly more descriptive? > > Changed. > > > > > 6. The logic to recheck the slot_sync related parameters in > > LauncherMain() is not very clear. IIUC, if after reload config any > > parameter is > > changed, we just seem to be checking the validity of the changed parameter > > but not restarting the slot sync worker, is that correct? If so, what if > > dbname is > > changed, don't we need to restart the slot-sync worker and re-initialize the > > connection; similarly slotname change also needs some thoughts. Also, if > > all the > > parameters are valid we seem to be re-launching the slot-sync worker without > > first stopping it which doesn't seem correct, am I missing something in this > > logic? > > I think the slot sync worker will be stopped in LauncherRereadConfig() if GUC > changed > and new slot sync worker will be started in next loop in LauncherMain(). yes, LauncherRereadConfig will stop the worker on any parameter change and will set recheck_slotsync(). On finding this flag as true, LauncherMain will redo all the validations and restart slot-sync worker if needed. Yes, we do need to stop and relaunch slotsync workers on dbname change as well. This is currently missing inLauncherRereadConfig (). Regarding slot name change,we are already doing it, we are already checking PrimarySlotName in LauncherRereadConfig() > > > > 7. > > @@ -524,6 +525,25 @@ CreateDecodingContext(XLogRecPtr start_lsn, > > errmsg("replication slot \"%s\" was not created in this database", > > NameStr(slot->data.name; > > > > + in_recovery = RecoveryInProgress(); > > + > > + /* > > + * Do not allow consumption of a "synchronized" slot until the standby > > + * gets promoted. Also do not allow consumption of slots with > > + sync_state > > + * as SYNCSLOT_STATE_INITIATED as they are not synced completely to be > > + * used. > > + */ > > + if ((in_recovery && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) || > > + slot->data.sync_state == SYNCSLOT_STATE_INITIATED) > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("cannot use replication slot \"%s\" for logical decoding", > > + NameStr(slot->data.name)),
Re: Synchronizing slots from primary to standby
On Tue, Nov 14, 2023 at 7:56 PM Drouvot, Bertrand wrote: > > Hi, > > On 11/13/23 2:57 PM, Zhijie Hou (Fujitsu) wrote: > > On Friday, November 10, 2023 4:16 PM Drouvot, Bertrand > > wrote: > >> Yeah good point, agree to just error out in all the case then (if we > >> discard the > >> sync_ reserved wording proposal, which seems to be the case as probably not > >> worth the extra work). > > > > Thanks for the discussion! > > > > Here is the V33 patch set which includes the following changes: > > Thanks for working on it! > > > > > 1) Drop slots with state 'i' in promotion flow after we shut down > > WalReceiver. > > @@ -3557,10 +3558,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool > randAccess, > * this only after failure, so when you promote, we still > * finish replaying as much as we can from archive and > * pg_wal before failover. > +* > +* Drop the slots for which sync is initiated but not yet > +* completed i.e. they are still waiting for the primary > +* server to catch up. > */ > if (StandbyMode && CheckForStandbyTrigger()) > { > XLogShutdownWalRcv(); > + slotsync_drop_initiated_slots(); > return XLREAD_FAIL; > } > > I had a closer look and it seems this is not located at the right place. > > Indeed, it's added here: > > switch (currentSource) > { > case XLOG_FROM_ARCHIVE: > case XLOG_FROM_PG_WAL: > > While in our case we are in > > case XLOG_FROM_STREAM: > > So I think we should move slotsync_drop_initiated_slots() in the > XLOG_FROM_STREAM case. Maybe before shutting down the sync slot worker? > (the TODO item number 2 you mentioned up-thread) > > BTW in order to prevent any corner case, would'nt also be better to > > replace: > > + /* > +* Do not allow consumption of a "synchronized" slot until the standby > +* gets promoted. > +*/ > + if (RecoveryInProgress() && (slot->data.sync_state != > SYNCSLOT_STATE_NONE)) > > with something like: > > if ((RecoveryInProgress() && (slot->data.sync_state != SYNCSLOT_STATE_NONE)) > || slot->data.sync_state == SYNCSLOT_STATE_INITIATED) > > to ensure slots in 'i' case can never be used? > Yes, it makes sense. WIll do it. > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
On Mon, Nov 13, 2023 at 11:02 AM Ajin Cherian wrote: > > On Thu, Nov 9, 2023 at 9:54 PM shveta malik wrote: > > > > PFA v32 patches which has below changes: > Testing with this patch, I see that if the failover enabled slot is > invalidated on the primary, then the corresponding synced slot is not > invalidated on the standby. Instead, I see that it continuously gets > the below error: > " WARNING: not synchronizing slot sub; synchronization would move it > backwards" > > In the code, I see that: > if (remote_slot->restart_lsn < MyReplicationSlot->data.restart_lsn) > { > ereport(WARNING, > errmsg("not synchronizing slot %s; synchronization > would move" >" it backwards", remote_slot->name)); > > ReplicationSlotRelease(); > CommitTransactionCommand(); > return; > } > > If the restart_lsn of the remote slot is behind, then the > local_slot_update() function is never called to set the invalidation > status on the local slot. And for invalidated slots, restart_lsn is > always NULL. > Okay. Thanks for testing Ajin. I think it needs a fix wherein we set the local-slot's invalidation status (provided it is not invalidated already) from the remote slot before this check itself. And if the slot is invalidated locally (either by itself) or by primary_slot being invalidated, then we should skip the sync. I will fix this in the next version. thanks Shveta
Re: Synchronizing slots from primary to standby
On Thu, Nov 9, 2023 at 8:56 AM Amit Kapila wrote: > > On Thu, Nov 9, 2023 at 8:11 AM Amit Kapila wrote: > > > > On Wed, Nov 8, 2023 at 8:09 PM Drouvot, Bertrand > > wrote: > > > > > > > Unrelated to above, if there is a user slot on standby with the same > > > > name which the slot-sync worker is trying to create, then shall it > > > > emit a warning and skip the sync of that slot or shall it throw an > > > > error? > > > > > > > > > > I'd vote for emit a warning and move on to the next slot if any. > > > > > > > But then it could take time for users to know the actual problem and > > they probably notice it after failover. OTOH, if we throw an error > > then probably they will come to know earlier because the slot sync > > mechanism would be stopped. Do you have reasons to prefer giving a > > WARNING and skipping creating such slots? I expect this WARNING to > > keep getting repeated in LOGs because the consecutive sync tries will > > again generate a WARNING. > > > > Apart from the above, I would like to discuss the slot sync work > distribution strategy of this patch. The current implementation as > explained in the commit message [1] works well if the slots belong to > multiple databases. It is clear from the data in emails [2][3][4] that > having more workers really helps if the slots belong to multiple > databases. But I think if all the slots belong to one or very few > databases then such a strategy won't be as good. Now, on one hand, we > get very good numbers for a particular workload with the strategy used > in the patch but OTOH it may not be adaptable to various different > kinds of workloads. So, I have a question whether we should try to > optimize this strategy for various kinds of workloads or for the first > version let's use a single-slot sync-worker and then we can enhance > the functionality in later patches either in PG17 itself or in PG18 or > later versions. I can work on separating the patch. We can first focus on single worker design and then we can work on multi-worker design either immediately (if needed) or we can target it in the second draft of the patch. I would like to know the thoughts of others on this. One thing to note is that a lot of the complexity of > the patch is attributed to the multi-worker strategy which may still > not be efficient, so there is an argument to go with a simpler > single-slot sync-worker strategy and then enhance it in future > versions as we learn more about various workloads. It will also help > to develop this feature incrementally instead of doing all the things > in one go and taking a much longer time than it should. > Agreed. With multi-workers, a lot of complexity (dsa, locks etc) have come into play. We can decide better on our workload distribution strategy among workers once we have more clarity on different types of workloads. > > [1] - "The replication launcher on the physical standby queries > primary to get the list of dbids for failover logical slots. Once it > gets the dbids, if dbids < max_slotsync_workers, it starts only that > many workers, and if dbids > max_slotsync_workers, it starts > max_slotsync_workers and divides the work equally among them. Each > worker is then responsible to keep on syncing the logical slots > belonging to the DBs assigned to it. > > Each slot-sync worker will have its own dbids list. Since the upper > limit of this dbid-count is not known, it needs to be handled using > dsa. We initially allocated memory to hold 100 dbids for each worker. > If this limit is exhausted, we reallocate this memory with size > incremented again by 100." > > [2] - > https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com > [3] - > https://www.postgresql.org/message-id/CAFPTHDZw2G3Pax0smymMjfPqdPcZhMWo36f9F%2BTwNTs0HFxK%2Bw%40mail.gmail.com > [4] - > https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com > > -- > With Regards, > Amit Kapila.
Re: Synchronizing slots from primary to standby
On Mon, Nov 13, 2023 at 6:19 AM Zhijie Hou (Fujitsu) wrote: > > On Thursday, November 9, 2023 6:54 PM shveta malik > wrote: > > > > > > PFA v32 patches which has below changes: > > Thanks for updating the patch. > > Here are few comments: > > > 2. > The check for wal_level < WAL_LEVEL_LOGICAL. > > It seems the existing codes disallow slot creation if wal_level is not > sufficient, I am thinking we might need similar check in slot sync worker. > Otherwise, the synced slot could not be used after standby promotion. > Yes, I agree. We should add this check. > Best Regards, > Hou zj
Re: Synchronizing slots from primary to standby
On Thu, Nov 9, 2023 at 9:15 PM Drouvot, Bertrand wrote: > > Hi, > > On 11/9/23 11:54 AM, shveta malik wrote: > > > > PFA v32 patches which has below changes: > > Thanks! > > > 7) Added warning for cases where a user-slot with the same name is > > already present which slot-sync worker is trying to create. Sync for > > such slots is skipped. > > I'm seeing assertion and segfault in this case due to ReplicationSlotRelease() > in synchronize_one_slot(). > > Adding this extra check prior to it: > > - ReplicationSlotRelease(); > + if (!(found && s->data.sync_state == SYNCSLOT_STATE_NONE)) > + ReplicationSlotRelease(); > > make them disappear. > Oh, I see. Thanks for pointing it out. I will fix it in the next version. > > > > Open Question: > > 1) Currently I have put drop slot logic for slots with 'sync_state=i' > > in slot-sync worker. Do we need to put it somewhere in promotion-logic > > as well? > > Yeah I think so, because there is a time window when one could "use" the slot > after the promotion and before it is removed. Producing things like: > > " > 2023-11-09 15:16:50.294 UTC [2580462] LOG: dropped replication slot > "logical_slot2" of dbid 5 as it was not sync-ready > 2023-11-09 15:16:50.295 UTC [2580462] LOG: dropped replication slot > "logical_slot3" of dbid 5 as it was not sync-ready > 2023-11-09 15:16:50.297 UTC [2580462] LOG: dropped replication slot > "logical_slot4" of dbid 5 as it was not sync-ready > 2023-11-09 15:16:50.297 UTC [2580462] ERROR: replication slot > "logical_slot5" is active for PID 2594628 > " > > After the promotion one was able to use logical_slot5 and now we can now drop > it. Yes, I was suspicious about this small window which may allow others to use this slot, that is why I was thinking of putting it in the promotion flow and thus asked that question earlier. But the slot-sync worker may end up creating it again in case it has not exited. So we need to carefully decide at what all places we need to put 'not-in recovery' checks in slot-sync workers. In the previous version, synchronize_one_slot() had that check and it was skipping sync if '!RecoveryInProgress'. But I have removed that check in v32 thinking that the slots which the worker has already fetched from the primary, let them all get synced and exit after that nstead of syncing half and leaving rest. But now on rethinking, was the previous behaviour correct i.e. skip sync at that point onward where we see it is no longer in standby-mode while few of the slots have already been synced in that sync-cycle. Thoughts? > > > Perhaps in WaitForWALToBecomeAvailable() where we call > > XLogShutdownWalRcv after checking 'CheckForStandbyTrigger'. Thoughts? > > > > You mean here? > > /* > * Check to see if promotion is requested. Note that we do > * this only after failure, so when you promote, we still > * finish replaying as much as we can from archive and > * pg_wal before failover. > */ > if (StandbyMode && CheckForStandbyTrigger()) > { > XLogShutdownWalRcv(); > return XLREAD_FAIL; > } > yes, here. > If so, that sounds like a good place to me. okay. Will add it. > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
On Wed, Nov 8, 2023 at 3:19 PM Drouvot, Bertrand wrote: > > Hi, > > On 11/8/23 9:57 AM, Amit Kapila wrote: > > On Wed, Nov 8, 2023 at 12:32 PM Drouvot, Bertrand > > wrote: > >> > >> Hi, > >> > >> On 11/8/23 4:50 AM, Amit Kapila wrote: > >> > >>> I think if we want to follow > >>> this approach then we need to also monitor these slots for any change > >>> in the consecutive cycles and if we are able to sync them then > >>> accordingly we enable them to use after failover. > >> > >> What about to add a new field in ReplicationSlotPersistentData > >> indicating that we are waiting for "sync" and drop such slots during > >> promotion and > >> /or if not in recovery? > >> > > > > This patch is already adding 'synced' flag in > > ReplicationSlotPersistentData to distinguish synced slots so that we > > can disallow decoding on then in standby and disallow to drop those. I > > suggest we change that field to have multiple states where one of the > > states would indicate that the initial sync of the slot is done. > > > > Yeah, agree. > I am working on this implementation. This sync-state is even needed for cascading standbys to know when to start syncing the slots from the first standby. It should start syncing only after the first standby has finished initialization of it (i.e. wait for primary is over) and not before that. Unrelated to above, if there is a user slot on standby with the same name which the slot-sync worker is trying to create, then shall it emit a warning and skip the sync of that slot or shall it throw an error? thanks Shveta
Re: Synchronizing slots from primary to standby
On Mon, Nov 6, 2023 at 7:01 AM Zhijie Hou (Fujitsu) wrote: > > On Friday, November 3, 2023 7:32 PM Amit Kapila > > > > On Thu, Nov 2, 2023 at 2:35 PM Zhijie Hou (Fujitsu) > > wrote: > > > > > > Here is the new version patch set(V29) which addressed Peter > > > comments[1][2] and fixed one doc compile error. > > > > > > > Few comments: > > == > > 1. > > + > > +failover (boolean) > > + > > + > > + Specifies whether the replication slot assocaited with the > > subscription > > + is enabled to be synced to the physical standbys so that logical > > + replication can be resumed from the new primary after failover. > > + The default is true. > > > > Why do you think it is a good idea to keep the default value as true? > > I think the user needs to enable standby for syncing slots which is not a > > default > > feature, so by default, the failover property should also be false. AFAICS, > > it is > > false for create_slot SQL API as per the below change; so that way also > > keeping > > default true for a subscription doesn't make sense. > > @@ -479,6 +479,7 @@ CREATE OR REPLACE FUNCTION > > pg_create_logical_replication_slot( > > IN slot_name name, IN plugin name, > > IN temporary boolean DEFAULT false, > > IN twophase boolean DEFAULT false, > > +IN failover boolean DEFAULT false, > > OUT slot_name name, OUT lsn pg_lsn) > > > > BTW, the below change indicates that the code treats default as false; so, > > it > > seems to be a documentation error. > > I think the document is wrong and fixed it. > > > > > 2. > > - > > /* > > * Common option parsing function for CREATE and ALTER SUBSCRIPTION > > commands. > > * > > > > Spurious line removal. > > > > 3. > > + else if (opts.slot_name && failover_enabled) { > > + walrcv_alter_slot(wrconn, opts.slot_name, opts.failover); > > + ereport(NOTICE, (errmsg("altered replication slot \"%s\" on > > + publisher", opts.slot_name))); } > > > > I think we can add a comment to describe why it makes sense to enable the > > failover property of the slot in this case. Can we change the notice > > message to: > > "enabled failover for replication slot \"%s\" on publisher" > > Added. > > > > > 4. > > libpqrcv_create_slot(WalReceiverConn *conn, const char *slotname, > > - bool temporary, bool two_phase, CRSSnapshotAction snapshot_action, > > - XLogRecPtr *lsn) > > + bool temporary, bool two_phase, bool failover, CRSSnapshotAction > > + snapshot_action, XLogRecPtr *lsn) > > { > > PGresult *res; > > StringInfoData cmd; > > @@ -913,7 +917,14 @@ libpqrcv_create_slot(WalReceiverConn *conn, const > > char *slotname, > > else > > appendStringInfoChar(, ' '); > > } > > - > > + if (failover) > > + { > > + appendStringInfoString(, "FAILOVER"); if (use_new_options_syntax) > > + appendStringInfoString(, ", "); else appendStringInfoChar(, ' > > + '); } > > > > I don't see a corresponding change in repl_gram.y. I think the following > > part of > > the code needs to be changed: > > /* CREATE_REPLICATION_SLOT slot [TEMPORARY] LOGICAL plugin [options] */ > > | K_CREATE_REPLICATION_SLOT IDENT opt_temporary K_LOGICAL IDENT > > create_slot_options > > > > I think after 0266e98, we started to use the new syntax(see the > generic_option_list rule) and we can avoid changing the repl_gram.y when > adding > new options. The new failover can be detected when parsing the generic option > list(in parseCreateReplSlotOptions). > > > > > > 5. > > @@ -228,6 +230,28 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo > > fcinfo, bool confirm, bool bin > > NameStr(MyReplicationSlot->data.plugin), > > format_procedure(fcinfo->flinfo->fn_oid; > > .. > > + if (XLogRecPtrIsInvalid(upto_lsn)) > > + wal_to_wait = end_of_wal; > > + else > > + wal_to_wait = Min(upto_lsn, end_of_wal); > > + > > + /* Initialize standby_slot_names_list */ SlotSyncInitConfig(); > > + > > + /* > > + * Wait for specified streaming replication standby servers (if any) > > + * to confirm receipt of WAL upto wal_to_wait. > > + */ > > + WalSndWaitForStandbyConfirmation(wal_to_wait); > > + > > + /* > > + * The memory context used to allocate standby_slot_names_list will be > > + * freed at the end of this call. So free and nullify the list in > > + * order to avoid usage of freed list in the next call to this > > + * function. > > + */ > > + SlotSyncFreeConfig(); > > > > What if there is an error in WalSndWaitForStandbyConfirmation() before > > calling > > SlotSyncFreeConfig()? I think the problem you are trying to avoid by > > freeing it > > here can occur. I think it is better to do this in a logical decoding > > context and > > free the list along with it as we are doing in commit c7256e6564(see PG15). > > I will analyze more about this case and update in next version. > > > Also, > > it is better to allocate this list somewhere in > > WalSndWaitForStandbyConfirmation(), probably in WalSndGetStandbySlots, > >
Re: Synchronizing slots from primary to standby
On Fri, Oct 27, 2023 at 8:43 PM Drouvot, Bertrand wrote: > > Hi, > > On 10/27/23 11:56 AM, shveta malik wrote: > > On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand > > wrote: > >> > >> Hi, > >> > >> On 10/25/23 5:00 AM, shveta malik wrote: > >>> On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand > >>> wrote: > >>>> > >>>> Hi, > >>>> > >>>> On 10/23/23 2:56 PM, shveta malik wrote: > >>>>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > >>>>> wrote: > >>>> > >>>>>> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before > >>>>>> checking if there > >>>>>> is new synced slot(s) to be created on the standby. Do we want to keep > >>>>>> this behavior > >>>>>> for V1? > >>>>>> > >>>>> > >>>>> I think for the slotsync workers case, we should reduce the naptime in > >>>>> the launcher to say 30sec and retain the default one of 3mins for > >>>>> subscription apply workers. Thoughts? > >>>>> > >>>> > >>>> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a > >>>> new > >>>> API on the standby that would refresh the list of sync slot at wish, > >>>> thoughts? > >>>> > >>> > >>> Do you mean API to refresh list of DBIDs rather than sync-slots? > >>> As per current design, launcher gets DBID lists for all the failover > >>> slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. > >> > >> I mean an API to get a newly created slot on the primary being > >> created/synced on > >> the standby at wish. > >> > >> Also let's imagine this scenario: > >> > >> - create logical_slot1 on the primary (and don't start using it) > >> > >> Then on the standby we'll get things like: > >> > >> 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot > >> "logical_slot1" LSN (0/C00316A0) and catalog xmin (752) to pass local slot > >> LSN (0/C0049530) and and catalog xmin (754) > >> > >> That's expected and due to the fact that ReplicationSlotReserveWal() does > >> set the slot > >> restart_lsn to a value < at the corresponding restart_lsn slot on the > >> primary. > >> > >> - create logical_slot2 on the primary (and start using it) > >> > >> Then logical_slot2 won't be created/synced on the standby until there is > >> activity on logical_slot1 on the primary > >> that would produce things like: > >> 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot > >> "logical_slot1" as its LSN (0/C005FFD8) and catalog xmin (756) has now > >> passed local slot LSN (0/C0049530) and catalog xmin (754) > > > > > > Slight correction to above. As soon as we start activity on > > logical_slot2, it will impact all the slots on primary, as the WALs > > are consumed by all the slots. So even if there is activity on > > logical_slot2, logical_slot1 creation on standby will be unblocked and > > it will then move to logical_slot2 creation. eg: > > > > --on standby: > > 2023-10-27 15:15:46.069 IST [696884] LOG: waiting for remote slot > > "mysubnew1_1" LSN (0/3C97970) and catalog xmin (756) to pass local > > slot LSN (0/3C979A8) and and catalog xmin (756) > > > > on primary: > > newdb1=# select now(); > > now > > -- > > 2023-10-27 15:15:51.504835+05:30 > > (1 row) > > > > --activity on mysubnew1_3 > > newdb1=# insert into tab1_3 values(1); > > INSERT 0 1 > > newdb1=# select now(); > > now > > -- > > 2023-10-27 15:15:54.651406+05:30 > > > > > > --on standby, mysubnew1_1 is unblocked. > > 2023-10-27 15:15:56.223 IST [696884] LOG: wait over for remote slot > > "mysubnew1_1" as its LSN (0/3C97A18) and catalog xmin (757) has now > > passed local slot LSN (0/3C979A8) and catalog xmin (756) > > > > My Setup: > > mysubnew1_1 -->mypubnew1_1 -->tab1_1 > > mysubnew1_3 -->mypubnew1_3-->tab1_3 > > > > Agree with your test case, but in my case I was not using pub/sub. > > I was not clear, so when I said: > >
Re: Synchronizing slots from primary to standby
On Thu, Oct 26, 2023 at 5:43 PM Amit Kapila wrote: > > On Thu, Oct 26, 2023 at 5:38 PM Drouvot, Bertrand > wrote: > > > > On 10/26/23 10:40 AM, Amit Kapila wrote: > > > On Wed, Oct 25, 2023 at 8:49 PM Drouvot, Bertrand > > > wrote: > > >> > > > > > > Good point, I think we should enhance the WalSndWait() logic to > > > address this case. > > > > Agree. I think it would need to take care of the new CV and probably > > provide a way for the caller to detect it stopped waiting due to the socket > > (I don't think it can find out currently). > > > > > Additionally, I think we should ensure that > > > WalSndWaitForWal() shouldn't wait twice once for wal_flush and a > > > second time for wal to be replayed by physical standby. It should be > > > okay to just wait for Wal to be replayed by physical standby when > > > applicable, otherwise, just wait for Wal to flush as we are doing now. > > > Does that make sense? > > > > Yeah, I think so. What about moving WalSndWaitForStandbyConfirmation() > > outside of WalSndWaitForWal() and call one or the other in > > logical_read_xlog_page()? > > > > I think we need to somehow integrate the logic of both functions. Let > us see what the patch author has to say about this. Amit, this is attempted in v27. thanks Shveta
Re: Synchronizing slots from primary to standby
On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand wrote: > > Hi, > > On 10/25/23 5:00 AM, shveta malik wrote: > > On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand > > wrote: > >> > >> Hi, > >> > >> On 10/23/23 2:56 PM, shveta malik wrote: > >>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > >>> wrote: > >> > >>>> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking > >>>> if there > >>>> is new synced slot(s) to be created on the standby. Do we want to keep > >>>> this behavior > >>>> for V1? > >>>> > >>> > >>> I think for the slotsync workers case, we should reduce the naptime in > >>> the launcher to say 30sec and retain the default one of 3mins for > >>> subscription apply workers. Thoughts? > >>> > >> > >> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new > >> API on the standby that would refresh the list of sync slot at wish, > >> thoughts? > >> > > > > Do you mean API to refresh list of DBIDs rather than sync-slots? > > As per current design, launcher gets DBID lists for all the failover > > slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. > > I mean an API to get a newly created slot on the primary being created/synced > on > the standby at wish. > > Also let's imagine this scenario: > > - create logical_slot1 on the primary (and don't start using it) > > Then on the standby we'll get things like: > > 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot > "logical_slot1" LSN (0/C00316A0) and catalog xmin (752) to pass local slot > LSN (0/C0049530) and and catalog xmin (754) > > That's expected and due to the fact that ReplicationSlotReserveWal() does set > the slot > restart_lsn to a value < at the corresponding restart_lsn slot on the primary. > > - create logical_slot2 on the primary (and start using it) > > Then logical_slot2 won't be created/synced on the standby until there is > activity on logical_slot1 on the primary > that would produce things like: > 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot > "logical_slot1" as its LSN (0/C005FFD8) and catalog xmin (756) has now passed > local slot LSN (0/C0049530) and catalog xmin (754) Slight correction to above. As soon as we start activity on logical_slot2, it will impact all the slots on primary, as the WALs are consumed by all the slots. So even if there is activity on logical_slot2, logical_slot1 creation on standby will be unblocked and it will then move to logical_slot2 creation. eg: --on standby: 2023-10-27 15:15:46.069 IST [696884] LOG: waiting for remote slot "mysubnew1_1" LSN (0/3C97970) and catalog xmin (756) to pass local slot LSN (0/3C979A8) and and catalog xmin (756) on primary: newdb1=# select now(); now -- 2023-10-27 15:15:51.504835+05:30 (1 row) --activity on mysubnew1_3 newdb1=# insert into tab1_3 values(1); INSERT 0 1 newdb1=# select now(); now -- 2023-10-27 15:15:54.651406+05:30 --on standby, mysubnew1_1 is unblocked. 2023-10-27 15:15:56.223 IST [696884] LOG: wait over for remote slot "mysubnew1_1" as its LSN (0/3C97A18) and catalog xmin (757) has now passed local slot LSN (0/3C979A8) and catalog xmin (756) My Setup: mysubnew1_1 -->mypubnew1_1 -->tab1_1 mysubnew1_3 -->mypubnew1_3-->tab1_3 thanks Shveta
Re: Synchronizing slots from primary to standby
On Wed, Oct 25, 2023 at 3:15 PM Drouvot, Bertrand wrote: > > Hi, > > On 10/25/23 5:00 AM, shveta malik wrote: > > On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand > > wrote: > >> > >> Hi, > >> > >> On 10/23/23 2:56 PM, shveta malik wrote: > >>> On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > >>> wrote: > >> > >>>> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking > >>>> if there > >>>> is new synced slot(s) to be created on the standby. Do we want to keep > >>>> this behavior > >>>> for V1? > >>>> > >>> > >>> I think for the slotsync workers case, we should reduce the naptime in > >>> the launcher to say 30sec and retain the default one of 3mins for > >>> subscription apply workers. Thoughts? > >>> > >> > >> Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new > >> API on the standby that would refresh the list of sync slot at wish, > >> thoughts? > >> > > > > Do you mean API to refresh list of DBIDs rather than sync-slots? > > As per current design, launcher gets DBID lists for all the failover > > slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. > > I mean an API to get a newly created slot on the primary being created/synced > on > the standby at wish. > > Also let's imagine this scenario: > > - create logical_slot1 on the primary (and don't start using it) > > Then on the standby we'll get things like: > > 2023-10-25 08:33:36.897 UTC [740298] LOG: waiting for remote slot > "logical_slot1" LSN (0/C00316A0) and catalog xmin (752) to pass local slot > LSN (0/C0049530) and and catalog xmin (754) > > That's expected and due to the fact that ReplicationSlotReserveWal() does set > the slot > restart_lsn to a value < at the corresponding restart_lsn slot on the primary. > > - create logical_slot2 on the primary (and start using it) > > Then logical_slot2 won't be created/synced on the standby until there is > activity on logical_slot1 on the primary > that would produce things like: > > 2023-10-25 08:41:35.508 UTC [740298] LOG: wait over for remote slot > "logical_slot1" as its LSN (0/C005FFD8) and catalog xmin (756) has now passed > local slot LSN (0/C0049530) and catalog xmin (754) > > With this new dedicated API, it will be: > > - clear that the API call is "hanging" until there is some activity on the > newly created slot > (currently there is "waiting for remote slot " message in the logfile as > mentioned above but > I'm not sure that's enough) > > - be possible to create/sync logical_slot2 in the example above without > waiting for activity > on logical_slot1. > > Maybe we should change our current algorithm during slot creation so that a > newly created inactive > slot on the primary does not block other newly created "active" slots on the > primary to be created > on the standby? Depending on how we implement that, the new API may not be > needed at all. > > Thoughts? > I discussed this with my colleague Hou-San and we think that one possibility could be to somehow accelerate the increment of restart_lsn on primary. This can be achieved by connecting to the remote and executing pg_log_standby_snapshot() at reasonable intervals while waiting on standby during slot creation. This may increase speed to a reasonable extent w/o having to wait for the user or bgwriter to do the same for us. The current logical decoding uses a similar approach to speed up the slot creation. I refer to usage of LogStandbySnapshot in SnapBuildWaitSnapshot() and ReplicationSlotReserveWal()). Thoughts? thanks Shveta
Re: Synchronizing slots from primary to standby
On Tue, Oct 24, 2023 at 11:54 AM Drouvot, Bertrand wrote: > > Hi, > > On 10/23/23 2:56 PM, shveta malik wrote: > > On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand > > wrote: > > >> We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking > >> if there > >> is new synced slot(s) to be created on the standby. Do we want to keep > >> this behavior > >> for V1? > >> > > > > I think for the slotsync workers case, we should reduce the naptime in > > the launcher to say 30sec and retain the default one of 3mins for > > subscription apply workers. Thoughts? > > > > Another option could be to keep DEFAULT_NAPTIME_PER_CYCLE and create a new > API on the standby that would refresh the list of sync slot at wish, thoughts? > Do you mean API to refresh list of DBIDs rather than sync-slots? As per current design, launcher gets DBID lists for all the failover slots from the primary at intervals of DEFAULT_NAPTIME_PER_CYCLE. These dbids are then distributed among max slot-sync workers and then they fetch slots for the concerned DBIDs at regular intervals of 10ms (WORKER_DEFAULT_NAPTIME_MS) and create/update those locally. thanks Shveta
Re: Synchronizing slots from primary to standby
On Tue, Oct 24, 2023 at 3:35 PM Drouvot, Bertrand wrote: > > Hi, > > On 10/24/23 7:44 AM, Ajin Cherian wrote: > > On Mon, Oct 23, 2023 at 11:22 PM Drouvot, Bertrand > > wrote: > >> > >> @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn, > >> SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); > >> } > >> > >> + /* set failover in the slot, as requested */ > >> + slot->data.failover = ctx->failover; > >> + > >> > >> I think we can get rid of this change in CreateDecodingContext(). > >> > > Yes, I too noticed this in my testing, however just removing this from > > CreateDecodingContext will not allow us to change the slot's failover flag > > using Alter subscription. > > Oh right. > > > I am thinking of moving this change to > > StartLogicalReplication prior to calling CreateDecodingContext by > > parsing the command options in StartReplicationCmd > > without adding it to the LogicalDecodingContext. > > > > Yeah, that looks like a good place to update "failover". > > Doing more testing and I have a couple of remarks about he current behavior. > > 1) Let's imagine that: > > - there is no standby > - standby_slot_names is set to a valid slot on the primary (but due to the > above, not linked to any standby) > - then a create subscription on a subscriber WITH (failover = true) would > start the > synchronisation but never finish (means leaving a "synchronisation" slot like > "pg_32811_sync_24576_7293415241672430356" > in place coming from ReplicationSlotNameForTablesync()). > > That's expected, but maybe we should emit a warning in > WalSndWaitForStandbyConfirmation() on the primary when there is > a slot part of standby_slot_names which is not active/does not have an > active_pid attached to it? > Agreed, Will do that. > 2) When we create a subscription, another slot is created during the > subscription synchronization, namely > like "pg_16397_sync_16388_7293447291374081805" (coming from > ReplicationSlotNameForTablesync()). > > This extra slot appears to have failover also set to true. > > So, If the standby refresh the list of slot to sync when the subscription is > still synchronizing we'd see things like > on the standby: > > LOG: waiting for remote slot "mysub" LSN (0/C0034808) and catalog xmin (756) > to pass local slot LSN (0/C0034840) and and catalog xmin (756) > LOG: wait over for remote slot "mysub" as its LSN (0/C00368B0) and catalog > xmin (756) has now passed local slot LSN (0/C0034840) and catalog xmin (756) > LOG: waiting for remote slot "pg_16397_sync_16388_7293447291374081805" LSN > (0/C0034808) and catalog xmin (756) to pass local slot LSN (0/C00368E8) and > and catalog xmin (756) > WARNING: slot "pg_16397_sync_16388_7293447291374081805" disappeared from the > primary, aborting slot creation > > I'm not sure this "pg_16397_sync_16388_7293447291374081805" should have > failover set to true. If there is a failover > during the subscription creation, better to re-launch the subscription > instead? > 'Failover' property of subscription is carried to the tablesync-slot in recent versions only with the intent that if failover happens during create-sub during table-sync of large tables, then users should be able to start from that point onward on the new primary. But yes, the above scenario is highly probable where-in no activity is happening on primary and thus the table-sync slot is waiting for its creation during sync on standby. So I agree, to simplify the stuff we can skip table-sync slot syncing on standby and document this behaviour. thanks Shveta
Re: Synchronizing slots from primary to standby
On Mon, Oct 23, 2023 at 5:52 PM Drouvot, Bertrand wrote: > > Hi, > > On 10/20/23 5:27 AM, shveta malik wrote: > > On Wed, Oct 18, 2023 at 4:24 PM Amit Kapila wrote: > > > > PFA v25 patch set. The changes are: > > > > 1) 'enable_failover' is changed to 'failover' > > 2) Alter subscription changes to support 'failover' > > 3) Fixes a bug in patch001 wherein any change in standby_slot_names > > was not considered in the flow where logical walsenders wait for > > standby's confirmation. Now during the wait, if standby_slot_names is > > changed, wait is restarted using new standby_slot_names. > > 4) Addresses comments by Bertrand and Amit in [1],[2],[3] > > > > The changes are mostly in patch001 and a very few in patch002. > > > > Thank You Ajin for working on alter-subscription changes and adding > > more TAP-tests for 'failover' > > > > Thanks for updating the patch! > > Looking at 0001 and doing some experiment: > > Creating a logical slot with failover = true and then launching > pg_logical_slot_get_changes() or pg_recvlogical() on it results > to setting failover back to false. > > It occurs while creating the decoding context here: > > @@ -602,6 +602,9 @@ CreateDecodingContext(XLogRecPtr start_lsn, > SnapBuildSetTwoPhaseAt(ctx->snapshot_builder, start_lsn); > } > > + /* set failover in the slot, as requested */ > + slot->data.failover = ctx->failover; > + > > I think we can get rid of this change in CreateDecodingContext(). > Thanks for pointing it out. I will correct it in the next patch. > Looking at 0002: > > /* Enter main loop */ > for (;;) > { > int rc; > longwait_time = DEFAULT_NAPTIME_PER_CYCLE; > > CHECK_FOR_INTERRUPTS(); > > /* > * If it is Hot standby, then try to launch slot-sync workers else > * launch apply workers. > */ > if (RecoveryInProgress()) > { > /* Launch only if we have succesfully made the connection */ > if (wrconn) > LaunchSlotSyncWorkers(_time, wrconn); > } > > We are waiting for DEFAULT_NAPTIME_PER_CYCLE (3 minutes) before checking if > there > is new synced slot(s) to be created on the standby. Do we want to keep this > behavior > for V1? > I think for the slotsync workers case, we should reduce the naptime in the launcher to say 30sec and retain the default one of 3mins for subscription apply workers. Thoughts? > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
On Tue, Oct 17, 2023 at 9:06 PM Drouvot, Bertrand wrote: > > Hi, > > On 10/13/23 10:35 AM, shveta malik wrote: > > On Thu, Oct 12, 2023 at 9:18 AM shveta malik wrote: > >> > > > > PFA v24 patch set which has below changes: > > > > 1) 'enable_failover' displayed in pg_replication_slots. > > 2) Support for 'enable_failover' in > > pg_create_logical_replication_slot(). It is an optional argument with > > default value false. > > 3) Addressed pending comments (1-30) from Peter in [1]. > > 4) Fixed an issue in patch002 due to which even slots with > > enable_failover=false were getting synced. > > > > The changes for 1 and 2 are in patch001 while 3 and 4 are in patch0002 > > > > Thanks Ajin, for working on 1 and 3. > > Thanks for the hard work! > > + if (RecoveryInProgress()) > + wrconn = slotsync_remote_connect(NULL); > > does produce at compilation time: > > launcher.c:1916:40: warning: too many arguments in call to > 'slotsync_remote_connect' > wrconn = slotsync_remote_connect(NULL); > > Looking at 0001: > > commit message: > > "is added at the slot level which > will be persistent information" > > what about "which is persistent information" ? > > Code: > > + True if this logical slot is enabled to be synced to the physical > standbys > + so that logical replication is not blocked after failover. Always > false > + for physical slots. > > Not sure "not blocked" is the right wording. "can be resumed from the new > primary" maybe? > > +static void > +ProcessRepliesAndTimeOut(void) > +{ > + CHECK_FOR_INTERRUPTS(); > + > + /* Process any requests or signals received recently */ > + if (ConfigReloadPending) > + { > + ConfigReloadPending = false; > + ProcessConfigFile(PGC_SIGHUP); > + SyncRepInitConfig(); > + SlotSyncInitConfig(); > + } > > Do we want to do this at each place ProcessRepliesAndTimeOut() is being > called? I mean before this change it was not done in ProcessPendingWrites(). > > + * Wait for physical standby to confirm receiving give lsn. > > typo? s/give/given/ > > > diff --git a/src/test/recovery/t/050_verify_slot_order.pl > b/src/test/recovery/t/050_verify_slot_order.pl > new file mode 100644 > index 00..25b3d5aac2 > --- /dev/null > +++ b/src/test/recovery/t/050_verify_slot_order.pl > @@ -0,0 +1,145 @@ > + > +# Copyright (c) 2023, PostgreSQL Global Development Group > + > > Regarding the TAP tests, should we also add some testing related to > enable_failover being set > in pg_create_logical_replication_slot() and pg_logical_slot_get_changes() > behavior too? > We have added some basic tests in v25. More detailed tests to be added in coming versions. > Please note that current comments are coming while > "quickly" going through 0001. > > I'm planning to have a closer look at 0001 and 0002 too. > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
On Wed, Oct 18, 2023 at 4:24 PM Amit Kapila wrote: > > On Tue, Oct 17, 2023 at 2:01 PM shveta malik wrote: > > > > On Tue, Oct 17, 2023 at 12:44 PM Peter Smith wrote: > > > > > > FYI - the latest patch failed to apply. > > > > > > [postgres@CentOS7-x64 oss_postgres_misc]$ git apply > > > ../patches_misc/v24-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch > > > error: patch failed: src/include/utils/guc_hooks.h:160 > > > error: src/include/utils/guc_hooks.h: patch does not apply > > > > Rebased v24. PFA. > > > > Few comments: > == > 1. > +List of physical replication slots that logical replication > with failover > +enabled waits for. > > /logical replication/logical replication slots > > 2. > If > +enable_syncslot is not enabled on the > +corresponding standbys, then it may result in indefinite waiting > +on the primary for physical replication slots configured in > +standby_slot_names > + > > Why the above leads to indefinite wait? I think we should just ignore > standby_slot_names and probably LOG a message in the server for the > same. > Sorry for confusion. This info was wrong, I have corrected it. > 3. > +++ b/src/backend/replication/logical/tablesync.c > @@ -1412,7 +1412,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos) > */ > walrcv_create_slot(LogRepWorkerWalRcvConn, > slotname, false /* permanent */ , false /* two_phase */ , > -CRS_USE_SNAPSHOT, origin_startpos); > +false /* enable_failover */ , CRS_USE_SNAPSHOT, > +origin_startpos); > > As per this code, we won't enable failover for tablesync slots. So, > what happens if we need to failover to new node after the tablesync > worker has reached SUBREL_STATE_FINISHEDCOPY or SUBREL_STATE_DATASYNC? > I think we won't be able to continue replication from failed over > node. If this theory is correct, we have two options (a) enable > failover for sync slots as well, if it is enabled for main slot; but > then after we drop the slot on primary once sync is complete, same > needs to be taken care at standby. (b) enable failover even for the > main slot after all tables are in ready state, something similar to > what we do for two_phase. I have adopted approach a) right now. Table sync slot is created with subscription's failover option and will be dropped by standby if it dropped on primary. > > 4. > + /* Verify syntax */ > + if (!validate_slot_names(newval, )) > + return false; > + > + /* Now verify if these really exist and have correct type */ > + if (!validate_standby_slots(elemlist)) > > These two functions serve quite similar functionality which makes > their naming quite confusing. Can we directly move the functionality > of validate_slot_names() into validate_standby_slots()? > > 5. > +SlotSyncInitConfig(void) > +{ > + char*rawname; > + > + /* Free the old one */ > + list_free(standby_slot_names_list); > + standby_slot_names_list = NIL; > + > + if (strcmp(standby_slot_names, "") != 0) > + { > + rawname = pstrdup(standby_slot_names); > + SplitIdentifierString(rawname, ',', _slot_names_list); > > How does this handle the case where '*' is specified for standby_slot_names? > I have removed '*' related doc info in this patch and has introduced error if '*' is given for this GUC. The reason being, I do not see a way to figure out all physical standbys slot names on a primary to make '*' work. We have info about all the physical slots created on primary but which all are actually being used by standbys is not known. Thoughts? thanks Shveta
Re: Synchronizing slots from primary to standby
On Wed, Oct 18, 2023 at 10:20 AM Amit Kapila wrote: > > On Tue, Oct 17, 2023 at 9:06 PM Drouvot, Bertrand > wrote: > > > > On 10/13/23 10:35 AM, shveta malik wrote: > > > On Thu, Oct 12, 2023 at 9:18 AM shveta malik > > > wrote: > > >> > > > > Code: > > > > + True if this logical slot is enabled to be synced to the physical > > standbys > > + so that logical replication is not blocked after failover. Always > > false > > + for physical slots. > > > > Not sure "not blocked" is the right wording. "can be resumed from the new > > primary" maybe? > > > > Yeah, your proposed wording sounds better. Also, I think we should > document the impact of not doing so because I think the replication > can continue after failover but it may lead to data inconsistency. > > BTW, I noticed that the code for Create Subscription is updated but > not the corresponding docs. By looking at other parameters like > password_required, streaming, two_phase where true or false indicates > whether that option is enabled or not, I am thinking about whether > enable_failover is an appropriate name for this option. The other > option name that comes to mind is 'failover' where true indicates that > the corresponding subscription will be enabled for failover. What do > you think? +1. 'failover' seems more in sync with other options' names. > -- > With Regards, > Amit Kapila.
Re: Synchronizing slots from primary to standby
On Tue, Oct 17, 2023 at 9:06 PM Drouvot, Bertrand wrote: > > Hi, > > On 10/13/23 10:35 AM, shveta malik wrote: > > On Thu, Oct 12, 2023 at 9:18 AM shveta malik wrote: > >> > > > > PFA v24 patch set which has below changes: > > > > 1) 'enable_failover' displayed in pg_replication_slots. > > 2) Support for 'enable_failover' in > > pg_create_logical_replication_slot(). It is an optional argument with > > default value false. > > 3) Addressed pending comments (1-30) from Peter in [1]. > > 4) Fixed an issue in patch002 due to which even slots with > > enable_failover=false were getting synced. > > > > The changes for 1 and 2 are in patch001 while 3 and 4 are in patch0002 > > > > Thanks Ajin, for working on 1 and 3. > > Thanks for the hard work! > Thanks for the feedback. I will try to address these in the next 1-2 versions. > + if (RecoveryInProgress()) > + wrconn = slotsync_remote_connect(NULL); > > does produce at compilation time: > > launcher.c:1916:40: warning: too many arguments in call to > 'slotsync_remote_connect' > wrconn = slotsync_remote_connect(NULL); > > Looking at 0001: > > commit message: > > "is added at the slot level which > will be persistent information" > > what about "which is persistent information" ? > > Code: > > + True if this logical slot is enabled to be synced to the physical > standbys > + so that logical replication is not blocked after failover. Always > false > + for physical slots. > > Not sure "not blocked" is the right wording. "can be resumed from the new > primary" maybe? > > +static void > +ProcessRepliesAndTimeOut(void) > +{ > + CHECK_FOR_INTERRUPTS(); > + > + /* Process any requests or signals received recently */ > + if (ConfigReloadPending) > + { > + ConfigReloadPending = false; > + ProcessConfigFile(PGC_SIGHUP); > + SyncRepInitConfig(); > + SlotSyncInitConfig(); > + } > > Do we want to do this at each place ProcessRepliesAndTimeOut() is being > called? I mean before this change it was not done in ProcessPendingWrites(). > Are you referring to ConfigReload stuff ? I see that even in ProcessPendingWrites(), we do it after WalSndWait(). Now only the order is changed, it is before WalSndWait() now. > + * Wait for physical standby to confirm receiving give lsn. > > typo? s/give/given/ > > > diff --git a/src/test/recovery/t/050_verify_slot_order.pl > b/src/test/recovery/t/050_verify_slot_order.pl > new file mode 100644 > index 00..25b3d5aac2 > --- /dev/null > +++ b/src/test/recovery/t/050_verify_slot_order.pl > @@ -0,0 +1,145 @@ > + > +# Copyright (c) 2023, PostgreSQL Global Development Group > + > > Regarding the TAP tests, should we also add some testing related to > enable_failover being set > in pg_create_logical_replication_slot() and pg_logical_slot_get_changes() > behavior too? > Sure, will do it. > Please note that current comments are coming while > "quickly" going through 0001. > > I'm planning to have a closer look at 0001 and 0002 too. > Yes, that will be really helpful. Thanks. thanks Shveta
Re: Synchronizing slots from primary to standby
On Tue, Oct 10, 2023 at 12:52 PM Peter Smith wrote: > > On Mon, Oct 9, 2023 at 9:34 PM shveta malik wrote: > > > > On Wed, Oct 4, 2023 at 8:53 AM Peter Smith wrote: > > > > > > Here are some review comments for v20-0002. > > > > > > > Thanks Peter for the feedback. Comments from 31 till end are addressed > > in v22. First 30 comments will be addressed in the next version. > > > > Thanks for addressing my previous comments. > > I checked those and went through other changes in v22-0002 to give a > few more review comments below. > > I understand there are some design changes coming soon regarding the > use of GUCs so maybe a few of these comments will become redundant. > > == > doc/src/sgml/config.sgml > > 1. >A password needs to be provided too, if the sender demands password >authentication. It can be provided in the >primary_conninfo string, or in a separate > - ~/.pgpass file on the standby server (use > - replication as the database name). > - Do not specify a database name in the > - primary_conninfo string. > + ~/.pgpass file on the standby server. > + > + > + Specify a database name in primary_conninfo > string > + to allow synchronization of slots from the primary to standby. This > + dbname will only be used for slots synchronization purpose and will > + be irrelevant for streaming. > > > 1a. > "Specify a database name in...". Shouldn't that say "Specify dbname in..."? > > ~ > > 1b. > BEFORE > This dbname will only be used for slots synchronization purpose and > will be irrelevant for streaming. > > SUGGESTION > This will only be used for slot synchronization. It is ignored for streaming. > > == > doc/src/sgml/system-views.sgml > > 2. pg_replication_slots > > + > + > + synced_slot bool > + > + > + True if this logical slot is created on physical standby as part of > + slot-synchronization from primary server. Always false for > physical slots. > + > + > > /on physical standby/on the physical standby/ > > /from primary server/from the primary server/ > > == > src/backend/replication/logical/launcher.c > > 3. LaunchSlotSyncWorkers > > + /* > + * If we failed to launch this slotsync worker, return and try > + * launching rest of the workers in next sync cycle. But change > + * launcher's wait time to minimum of wal_retrieve_retry_interval and > + * default wait time to try next sync-cycle sooner. > + */ > > 3a. > Use consistent terms -- choose "sync cycle" or "sync-cycle" > > ~ > > 3b. > Is it correct to just say "rest of the workers"; won't it also try to > relaunch this same failed worker again? > > ~~~ > > 4. LauncherMain > > + /* > + * Stop the slot-sync workers if any of the related GUCs changed. > + * These will be relaunched using the new values during next > + * sync-cycle. Also revalidate the new configurations and > + * reconnect. > + */ > + if (SlotSyncConfigsChanged()) > + { > + slotsync_workers_stop(); > + > + if (wrconn) > + walrcv_disconnect(wrconn); > + > + if (RecoveryInProgress()) > + wrconn = slotsync_remote_connect(); > + } > > Was it overkill to disconnect/reconnect every time any of those GUCs > changed? Or is it enough to do that only if the > PrimaryConnInfoPreReload was changed? > The intent is to re-validate all the related GUCs and then decide if we want to carry on with the slot-sync task or leave it as is. > == > src/backend/replication/logical/logical.c > > 5. CreateDecodingContext > > + /* > + * Do not allow consumption of a "synchronized" slot until the standby > + * gets promoted. > + */ > + if (RecoveryInProgress() && slot->data.synced) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot use replication slot \"%s\" for logical decoding", > + NameStr(slot->data.name)), > + errdetail("This slot is being synced from primary."), > + errhint("Specify another replication slot."))); > + > > /from primary/from the primary/ > > == > src/backend/replication/logical/slotsync.c > > 6. use_slot_in_query > > + /* > + * Return TRUE if either slot is not yet created on standby or if it > + * belongs to one of the dbs passed in dbids. > + */ > + if (!slot_found || relevant_db) > + return true; > + > + return false; > &g
Re: Synchronizing slots from primary to standby
On Tue, Oct 10, 2023 at 12:52 PM Peter Smith wrote: > > On Mon, Oct 9, 2023 at 9:34 PM shveta malik wrote: > > > > On Wed, Oct 4, 2023 at 8:53 AM Peter Smith wrote: > > > > > > Here are some review comments for v20-0002. > > > > > > > Thanks Peter for the feedback. Comments from 31 till end are addressed > > in v22. First 30 comments will be addressed in the next version. > > > > Thanks for addressing my previous comments. > > I checked those and went through other changes in v22-0002 to give a > few more review comments below. > Thank You for your feedback. I have addressed these in v23. > I understand there are some design changes coming soon regarding the > use of GUCs so maybe a few of these comments will become redundant. > > == > doc/src/sgml/config.sgml > > 1. >A password needs to be provided too, if the sender demands password >authentication. It can be provided in the >primary_conninfo string, or in a separate > - ~/.pgpass file on the standby server (use > - replication as the database name). > - Do not specify a database name in the > - primary_conninfo string. > + ~/.pgpass file on the standby server. > + > + > + Specify a database name in primary_conninfo > string > + to allow synchronization of slots from the primary to standby. This > + dbname will only be used for slots synchronization purpose and will > + be irrelevant for streaming. > > > 1a. > "Specify a database name in...". Shouldn't that say "Specify dbname in..."? > > ~ > > 1b. > BEFORE > This dbname will only be used for slots synchronization purpose and > will be irrelevant for streaming. > > SUGGESTION > This will only be used for slot synchronization. It is ignored for streaming. > > == > doc/src/sgml/system-views.sgml > > 2. pg_replication_slots > > + > + > + synced_slot bool > + > + > + True if this logical slot is created on physical standby as part of > + slot-synchronization from primary server. Always false for > physical slots. > + > + > > /on physical standby/on the physical standby/ > > /from primary server/from the primary server/ > > == > src/backend/replication/logical/launcher.c > > 3. LaunchSlotSyncWorkers > > + /* > + * If we failed to launch this slotsync worker, return and try > + * launching rest of the workers in next sync cycle. But change > + * launcher's wait time to minimum of wal_retrieve_retry_interval and > + * default wait time to try next sync-cycle sooner. > + */ > > 3a. > Use consistent terms -- choose "sync cycle" or "sync-cycle" > > ~ > > 3b. > Is it correct to just say "rest of the workers"; won't it also try to > relaunch this same failed worker again? > > ~~~ > > 4. LauncherMain > > + /* > + * Stop the slot-sync workers if any of the related GUCs changed. > + * These will be relaunched using the new values during next > + * sync-cycle. Also revalidate the new configurations and > + * reconnect. > + */ > + if (SlotSyncConfigsChanged()) > + { > + slotsync_workers_stop(); > + > + if (wrconn) > + walrcv_disconnect(wrconn); > + > + if (RecoveryInProgress()) > + wrconn = slotsync_remote_connect(); > + } > > Was it overkill to disconnect/reconnect every time any of those GUCs > changed? Or is it enough to do that only if the > PrimaryConnInfoPreReload was changed? > > == > src/backend/replication/logical/logical.c > > 5. CreateDecodingContext > > + /* > + * Do not allow consumption of a "synchronized" slot until the standby > + * gets promoted. > + */ > + if (RecoveryInProgress() && slot->data.synced) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot use replication slot \"%s\" for logical decoding", > + NameStr(slot->data.name)), > + errdetail("This slot is being synced from primary."), > + errhint("Specify another replication slot."))); > + > > /from primary/from the primary/ > > == > src/backend/replication/logical/slotsync.c > > 6. use_slot_in_query > > + /* > + * Return TRUE if either slot is not yet created on standby or if it > + * belongs to one of the dbs passed in dbids. > + */ > + if (!slot_found || relevant_db) > + return true; > + > + return false; > > Same as single line: > > return (!slot_foun
Re: Synchronizing slots from primary to standby
On Mon, Oct 2, 2023 at 4:29 PM Hayato Kuroda (Fujitsu) wrote: > > Dear Shveta, > > Thank you for updating the patch! Thanks for the feedback Kuroda-san. I have addressed most of these in v22. Please find my comments inline. > > I found another ERROR due to the slot removal. Is this a real issue? > > 1. applied add_sleep.txt, which emulated the case the tablesync worker stucked >and the primary crashed during the >initial sync. > 2. executed test_0925_v2.sh (You attached in [1]) > 3. secondary could not start the logical replication because the slot was not >created (log files were also attached). > > > Here is my analysis. The cause is that the slotsync worker aborts the slot > creation > on secondary server because the restart_lsn of secondary ahead the primary's > one. > IIUC it can be occurred when tablesync workers finishes initial copy before > walsenders stream changes. In this case, the relstate of the worker is set to > SUBREL_STATE_CATCHUP and the apply worker waits till the relation becomes > SUBREL_STATE_SYNCDONE. From here the slot on primary will not be updated until > the relation is caught up. If some changes are come and the primary crashes at > that time, the syncslot worker will abort the slot creation. > > > Anyway, followings are my comments. > I have not checked detailed conventions yet. It should be done in later stage. > > > > > For 0001: > === > > WalSndWaitForStandbyConfirmation() > > ``` > + /* If postmaster asked us to stop, don't wait anymore */ > + if (got_STOPPING) > + break; > ``` > > I have considered again, and it may still have an issue: logical walsenders > may > break from the loop before physical walsenders send WALs. This may be occurred > because both physical and logical walsenders would get > PROCSIG_WALSND_INIT_STOPPING. > > I think a function like WalSndWaitStopping() must be needed, which waits until > physical walsenders become WALSNDSTATE_STOPPING or exit. Thought? > > WalSndWaitForStandbyConfirmation() > > ``` > + standby_slot_cpy = list_copy(standby_slot_names_list); > ``` > > I found that standby_slot_names_list and standby_slot_cpy would not be updated > even if the GUC was updated. Is this acceptable? Won't it be occurred after > you > refactor the patch? > What would be occurred when synchronize_slot_names is updated on secondary > while primary executes this? > > WalSndWaitForStandbyConfirmation() > > ``` > + > + goto retry; > ``` Yes, there could be a problem here. I will review it. Allow some more time for this. > > I checked other "goto retry;", but I could not find the pattern that the > return > clause does not exist after the goto (exception: void function). I also think > that current style seems a bit strange. How about using an outer loop like > While (list_length(standby_slot_cpy))? > > = > > slot.h > > ``` > +extern void WaitForStandbyLSN(XLogRecPtr wait_for_lsn); > ``` > > WaitForStandbyLSN() does not exist. > Done. > > > For 0002: > = > > General > > The patch requires that primary_conninfo must contain the dbname, but it > conflicts with documentation. It says: > > ``` > ...Do not specify a database name in the primary_conninfo string. > ``` > > I confirmed [^a] it is harmless that primary_conninfo has dbname, but at least > the description must be fixed. > Done. > General > > I found that primary server output huge amount of logs when the > log_min_duration_messages = 0. > This ie because slotsync worker sends an SQL per 10ms, in > wait_for_primary_slot_catchup(). > Is there any good way to suppress it? Or, should we be patient? > I will review it to see if we can do anything here. > = > > ``` > +{ oid => '6312', descr => 'what caused the replication slot to become > invalid', > ``` > > How did you determine the oid? IIRC, developping features should use oids in > the range 8000-. See src/include/catalog/unused_oids. > Corrected it. > = > > LogicalRepCtxStruct > > ``` > /* Background workers. */ > + SlotSyncWorker *ss_workers; /* slot-sync workers */ > LogicalRepWorker workers[FLEXIBLE_ARRAY_MEMBER]; > ``` > > It's OK for now, but can we combine them into an array? IIUC there is no > possibility to exist both of processes and they have same component, so it may > be able to be same. It can reduce an attribute but may lead some > difficulties to read. I feel it will add to more confusion and should be kept separate. > > WaitForReplicationWorkerAttach() and logicalrep_worker_stop_internal() > > I could not find cases that has "LWLock *" as an argument (exception: > functions in lwlock.c). > Is it sufficient to check RecoveryInProgress() instead of specifying as > arguments? > I feel it should be argument based. If not lock-based then a different arg perhaps. Let us say it is in the process of starting a worker and it failed and now it wants to do cleanup. It should
Re: Synchronizing slots from primary to standby
On Mon, Oct 9, 2023 at 3:24 AM Peter Smith wrote: > > On Fri, Oct 6, 2023 at 7:37 PM Alvaro Herrera wrote: > > > > On 2023-Sep-27, Peter Smith wrote: > > > > > 3. get_local_synced_slot_names > > > > > > + for (int i = 0; i < max_replication_slots; i++) > > > + { > > > + ReplicationSlot *s = >replication_slots[i]; > > > + > > > + /* Check if it is logical synchronized slot */ > > > + if (s->in_use && SlotIsLogical(s) && s->data.synced) > > > + { > > > + for (int j = 0; j < MySlotSyncWorker->dbcount; j++) > > > + { > > > > > > Loop variables are not declared in the common PG code way. > > > > Note that since we added C99 as a mandatory requirement for compilers in > > commit d9dd406fe281, we've been using declarations in loop initializers > > (see 143290efd079). We have almost 500 occurrences of this already. > > Older code, obviously, does not use them, but that's no reason not to > > introduce them in new code. I think they make the code a bit leaner, so > > I suggest to use these liberally. > > > > I also prefer the C99 style, but I had misunderstood there was still a > convention to keep using the old style for code consistency (e.g. many > new patches I see still seem to use the old style). > > Thanks for confirming that C99 loop variables are fine for any new code. > > @Shveta/Ajin - please ignore/revert all my old review comments about this > point. > Sure, reverted all such changes in v22. Now we have declarations in loop initializers. thanks Shveta
Re: Synchronizing slots from primary to standby
On Wed, Oct 4, 2023 at 8:53 AM Peter Smith wrote: > > Here are some review comments for v20-0002. > Thanks Peter for the feedback. Comments from 31 till end are addressed in v22. First 30 comments will be addressed in the next version. > == > 1. GENERAL - errmsg/elog messages > > There are a a lot of minor problems and/or quirks across all the > message texts. Here is a summary of some I found: > > ERROR > errmsg("could not receive list of slots from the primary server: %s", > errmsg("invalid response from primary server"), > errmsg("invalid connection string syntax: %s", > errmsg("replication slot-sync worker slot %d is empty, cannot attach", > errmsg("replication slot-sync worker slot %d is already used by > another worker, cannot attach", > errmsg("replication slot-sync worker slot %d is already used by > another worker, cannot attach", > errmsg("could not connect to the primary server: %s", > > errmsg("operation not permitted on replication slots on standby which > are synchronized from primary"))); > /primary/the primary/ > > errmsg("could not fetch invalidation cuase for slot \"%s\" from primary: %s", > /cuase/cause/ > /primary/the primary/ > > errmsg("slot \"%s\" disapeared from the primary", > /disapeared/disappeared/ > > errmsg("could not fetch slot info from the primary: %s", > errmsg("could not connect to the primary server: %s", err))); > errmsg("could not map dynamic shared memory segment for slot-sync worker"))); > > errmsg("physical replication slot %s found in synchronize_slot_names", > slot name not quoted? > --- > > WARNING > errmsg("out of background worker slots"), > > errmsg("Replication slot-sync worker failed to attach to worker-pool slot %d", > case? > > errmsg("Removed database %d from replication slot-sync worker %d; > dbcount now: %d", > case? > > errmsg("Skipping slots synchronization as primary_slot_name is not set.")); > case? > > errmsg("Skipping slots synchronization as hot_standby_feedback is off.")); > case? > > errmsg("Skipping slots synchronization as dbname is not specified in > primary_conninfo.")); > case? > > errmsg("slot-sync wait for slot %s interrupted by promotion, slot > creation aborted", > > errmsg("could not fetch slot info for slot \"%s\" from primary: %s", > /primary/the primary/ > > errmsg("slot \"%s\" disappeared from the primary, aborting slot creation", > errmsg("slot \"%s\" invalidated on primary, aborting slot creation", > > errmsg("slot-sync for slot %s interrupted by promotion, sync not possible", > slot name not quoted? > > errmsg("skipping sync of slot \"%s\" as the received slot-sync lsn > %X/%X is ahead of the standby position %X/%X", > > errmsg("not synchronizing slot %s; synchronization would move it backward", > slot name not quoted? > /backward/backwards/ > > --- > > LOG > errmsg("Added database %d to replication slot-sync worker %d; dbcount now: > %d", > errmsg("Added database %d to replication slot-sync worker %d; dbcount now: > %d", > errmsg("Stopping replication slot-sync worker %d", > errmsg("waiting for remote slot \"%s\" LSN (%u/%X) and catalog xmin > (%u) to pass local slot LSN (%u/%X) and and catalog xmin (%u)", > > errmsg("wait over for remote slot \"%s\" as its LSN (%X/%X)and catalog > xmin (%u) has now passed local slot LSN (%X/%X) and catalog xmin > (%u)", > missing spaces? > > elog(LOG, "Dropped replication slot \"%s\" ", > extra space? > why this one is elog but others are not? > > elog(LOG, "Replication slot-sync worker %d is shutting down on > receiving SIGINT", MySlotSyncWorker->slot); > case? > why this one is elog but others are not? > > elog(LOG, "Replication slot-sync worker %d started", worker_slot); > case? > why this one is elog but others are not? > > > DEBUG1 > errmsg("allocated dsa for slot-sync worker for dbcount: %d" > worker number not given? > should be elog? > > errmsg_internal("logical replication launcher started") > should be elog? > > > > DEBUG2 > elog(DEBUG2, "slot-sync worker%d's query:%s \n", > missing space after 'worker' > extra space before \n > > == > .../libpqwalreceiver/libpqwalreceiver.c > > 2. libpqrcv_get_dbname_from_conninfo > > +/* > + * Get database name from primary conninfo. > + * > + * If dbanme is not found in connInfo, return NULL value. > + * The caller should take care of handling NULL value. > + */ > +static char * > +libpqrcv_get_dbname_from_conninfo(const char *connInfo) > > 2a. > /dbanme/dbname/ > > ~ > > 2b. > "The caller should take care of handling NULL value." > > IMO this is not very useful; it's like saying "caller must handle > function return values". > > ~~~ > > 3. > + for (opt = opts; opt->keyword != NULL; ++opt) > + { > + /* Ignore connection options that are not present. */ > + if (opt->val == NULL) > + continue; > + > + if (strcmp(opt->keyword, "dbname") == 0 && opt->val[0] != '\0') > + { > + dbname = pstrdup(opt->val); > + } > + } > > 3a. > If there are multiple "dbname" in the conninfo then it will be the > LAST one that is returned. > >
Re: Synchronizing slots from primary to standby
On Fri, Oct 6, 2023 at 2:07 PM Alvaro Herrera wrote: > > On 2023-Sep-27, Peter Smith wrote: > > > 3. get_local_synced_slot_names > > > > + for (int i = 0; i < max_replication_slots; i++) > > + { > > + ReplicationSlot *s = >replication_slots[i]; > > + > > + /* Check if it is logical synchronized slot */ > > + if (s->in_use && SlotIsLogical(s) && s->data.synced) > > + { > > + for (int j = 0; j < MySlotSyncWorker->dbcount; j++) > > + { > > > > Loop variables are not declared in the common PG code way. > > Note that since we added C99 as a mandatory requirement for compilers in > commit d9dd406fe281, we've been using declarations in loop initializers > (see 143290efd079). We have almost 500 occurrences of this already. > Older code, obviously, does not use them, but that's no reason not to > introduce them in new code. I think they make the code a bit leaner, so > I suggest to use these liberally. > Okay, we will. Thanks for letting us know. thanks Shveta
Re: Synchronizing slots from primary to standby
On Wed, Oct 4, 2023 at 12:08 PM Drouvot, Bertrand wrote: > > Hi, > > On 10/4/23 7:00 AM, shveta malik wrote: > > On Wed, Oct 4, 2023 at 9:56 AM shveta malik wrote: > > > The most simplistic approach would be: > > > > 1) maintain standby_slot_names GUC on primary > > 2) maintain synchronize_slot_names GUC on physical standby alone. > > > > On primary, let all logical-walsenders wait on physical-standbys > > configured in standby_slot_names GUC. This will work and will avoid > > all the complexity involved in designs discussed above. But this > > simplistic approach comes with disadvantages like below: > > > > 1) Even if the associated slot of logical-walsender is not part of > > synchronize_slot_names of any of the physical-standbys, it is still > > waiting for all the configured standbys to finish. > > That's right. Currently (with walsender waiting an arbitrary amount of time) > that sounds like a -1. But if we're going with a new CV approach (like > proposed > in [1], that might not be so terrible). Though I don't feel comfortable with > waiting for no reasons (even if this is for a short amount of time possible). > Agreed. Not a good idea to block each logical walsender. > > 2) If associated slot of logical walsender is part of > > synchronize_slot_names of standby1, it is still waiting on standby2,3 > > etc to finish i.e. waiting on rest of the standbys configured in > > standby_slot_names which have not even marked that logical slot in > > their synchronize_slot_names. > > > > Same thoughts as above for 1) > > > So we need to weigh our options here. > > > > With the simplistic approach, if a standby goes down that would impact non > related > walsenders on the primary until the standby's associated physical slot is > removed > from standby_slot_names and I don't feel comfortable wit this behavior. > > So, I'm +1 for the ReplicationSlotPersistentData approach proposed by Amit. yes, +1 for ReplicationSlotPersistentData approach. Will start detailed analysis on that approach now. thanks Shveta
Re: Synchronizing slots from primary to standby
On Wed, Oct 4, 2023 at 5:00 PM Amit Kapila wrote: > > On Wed, Oct 4, 2023 at 11:55 AM Drouvot, Bertrand > wrote: > > > > On 10/4/23 6:26 AM, shveta malik wrote: > > > On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila > > > wrote: > > >> > > >> > > >> How about an alternate scheme where we define sync_slot_names on > > >> standby but then store the physical_slot_name in the corresponding > > >> logical slot (ReplicationSlotPersistentData) to be synced? So, the > > >> standby will send the list of 'sync_slot_names' and the primary will > > >> add the physical standby's slot_name in each of the corresponding > > >> sync_slot. Now, if we do this then even after restart, we should be > > >> able to know for which physical slot each logical slot needs to wait. > > >> We can even provide an SQL API to reset the value of > > >> standby_slot_names in logical slots as a way to unblock decoding in > > >> case of emergency (for example, corresponding when physical standby > > >> never comes up). > > >> > > > > > > > > > Looks like a better approach to me. It solves most of the pain points > > > like: > > > 1) Avoids the need of multiple GUCs > > > 2) Primary and standby need not to worry to be in sync if we maintain > > > sync-slot-names GUC on both > > As per my understanding of this approach, we don't want > 'sync-slot-names' to be set on the primary. Do you have a different > understanding? > Same understanding. We do not need it to be set on primary by user. It will be GUC on standby and standby will convey it to primary. > > > 3) User still gets the flexibility to remove a standby from wait-lost > > > of primary's logical-walsenders' using reset SQL API. > > > > > > > Fully agree. > > > > > Now some initial thoughts: > > > 1) Since each logical slot could be needed to be synched by multiple > > > physical-standbys, so in ReplicationSlotPersistentData, we need to > > > hold a list of standby's name. So this brings us to question as in how > > > much shall we allocate initially in shared-memory? Shall it be for > > > max_replication_slots (worst case scenario) in each > > > ReplicationSlotPersistentData to hold physical-standby names? > > > > > > > Yeah, and even if we do the opposite means add the 'to-sync' > > logical replication slot in the ReplicationSlotPersistentData of the > > physical > > slot(s) the questions still remain (as a physical standby could want to > > sync multiples slots) > > > > I think we don't need to allocate the entire max_replication_slots > array in ReplicationSlotPersistentData. We should design something > like the variable amount of data to be written on disk should be > represented similar to what we do with variable TransactionIds in > SnapBuildOnDisk. Now, we also need to store the list of standby's > in-memory either shared or local memory of walsender. I think storing > it in shared-memory say in ReplicationSlot has the advantage that we > can easily set that via physical walsender and it may be easier to > maintain both for manually created logical slots and logical slots > associated with logical walsenders. But still this needs some thoughts > as to what is the best way to store this information. > Thanks for the idea, I will review this. thanks Shveta
Re: Synchronizing slots from primary to standby
On Mon, Oct 2, 2023 at 4:29 PM Hayato Kuroda (Fujitsu) wrote: > > Dear Shveta, > > Thank you for updating the patch! > > I found another ERROR due to the slot removal. Is this a real issue? > > 1. applied add_sleep.txt, which emulated the case the tablesync worker stucked >and the primary crashed during the >initial sync. > 2. executed test_0925_v2.sh (You attached in [1]) > 3. secondary could not start the logical replication because the slot was not >created (log files were also attached). > > > Here is my analysis. The cause is that the slotsync worker aborts the slot > creation > on secondary server because the restart_lsn of secondary ahead the primary's > one. > IIUC it can be occurred when tablesync workers finishes initial copy before > walsenders stream changes. In this case, the relstate of the worker is set to > SUBREL_STATE_CATCHUP and the apply worker waits till the relation becomes > SUBREL_STATE_SYNCDONE. From here the slot on primary will not be updated until > the relation is caught up. If some changes are come and the primary crashes at > that time, the syncslot worker will abort the slot creation. > Kuroda-San, we need to let slot-creation on standby finish before we start expecting it to support logical replication on failover. In the current case, as you stated the slot-creation itself is aborted and thus it can not support logical-replication later. We are currently trying to think of possibilities to advance remote_lsn on primary internally by slot-sync workers in order to accelerate slot-creation on standby for cases where slot-creation is stuck due to primary's restart_lsn lagging behind standby's restart_lsn. But till then, the way to proceed for testing is to execute workload on primary for such cases in order to accelerate slot-creation. thanks Shveta
Re: Synchronizing slots from primary to standby
On Wed, Oct 4, 2023 at 9:56 AM shveta malik wrote: > > On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila wrote: > > > > On Tue, Oct 3, 2023 at 9:27 PM shveta malik wrote: > > > > > > On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand > > > wrote: > > > > > > > > Hi, > > > > > > > > On 10/3/23 12:54 PM, Amit Kapila wrote: > > > > > On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand > > > > > wrote: > > > > >> > > > > >> On 9/29/23 1:33 PM, Amit Kapila wrote: > > > > >>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand > > > > >>> wrote: > > > > >>>> > > > > >>> > > > > >>>> - probably open corner cases like: what if a standby is down? > > > > >>>> would that mean > > > > >>>> that synchronize_slot_names not being send to the primary would > > > > >>>> allow the decoding > > > > >>>> on the primary to go ahead? > > > > >>>> > > > > >>> > > > > >>> Good question. BTW, irrespective of whether we have > > > > >>> 'standby_slot_names' parameters or not, how should we behave if > > > > >>> standby is down? Say, if 'synchronize_slot_names' is only specified > > > > >>> on > > > > >>> standby then in such a situation primary won't be even aware that > > > > >>> some > > > > >>> of the logical walsenders need to wait. > > > > >> > > > > >> Exactly, that's why I was thinking keeping standby_slot_names to > > > > >> address > > > > >> this scenario. In such a case one could simply decide to keep or > > > > >> remove > > > > >> the associated physical replication slot from standby_slot_names. > > > > >> Keep would > > > > >> mean "wait" and removing would mean allow to decode on the primary. > > > > >> > > > > >>> OTOH, one can say that users > > > > >>> should configure 'synchronize_slot_names' on both primary and > > > > >>> standby > > > > >>> but note that this value could be different for different standby's, > > > > >>> so we can't configure it on primary. > > > > >>> > > > > >> > > > > >> Yeah, I think that's a good use case for standby_slot_names, what do > > > > >> you think? > > > > >> > > > > > > > > > > But, even if we keep 'standby_slot_names' for this purpose, the > > > > > primary doesn't know the value of 'synchronize_slot_names' once the > > > > > standby is down and or the primary is restarted. So, how will we know > > > > > which logical WAL senders needs to wait for 'standby_slot_names'? > > > > > > > > > > > > > Yeah right, I also think we'd need: > > > > > > > > - synchronize_slot_names on both primary and standby > > > > > > > > But now we would need to take care of different standby having > > > > different values ( > > > > as you said up-thread) > > > > > > > > Thinking out loud: What about a single GUC on the primary (not > > > > standby_slot_names nor > > > > synchronize_slot_names) but say logical_slots_wait_for_standby that > > > > could be a list of say > > > > "logical_slot_name:physical_slot". > > > > > > > > I think this GUC would help us define each walsender behavior (should > > > > the standby(s) > > > > be up or down): > > > > > > > > > > It may help in defining the walsender's behaviour better for sure. But > > > the problem I see once we start defining sync-slot-names on primary > > > (in any form whether as independent GUC or as above mapping GUC) is > > > that it needs to be then in sync with standbys, as each standby for > > > sure needs to maintain its own sync-slot-names GUC to make it aware of > > > what all it needs to sync. > > > > Yes, I also think so. Also, defining such a GUC where user wants to > > sync all the slots which would normally be the case would be a night > > mare for the users. > > > > > > > > This br
Re: Synchronizing slots from primary to standby
On Wed, Oct 4, 2023 at 5:36 AM Amit Kapila wrote: > > On Tue, Oct 3, 2023 at 9:27 PM shveta malik wrote: > > > > On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand > > wrote: > > > > > > Hi, > > > > > > On 10/3/23 12:54 PM, Amit Kapila wrote: > > > > On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand > > > > wrote: > > > >> > > > >> On 9/29/23 1:33 PM, Amit Kapila wrote: > > > >>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand > > > >>> wrote: > > > >>>> > > > >>> > > > >>>> - probably open corner cases like: what if a standby is down? would > > > >>>> that mean > > > >>>> that synchronize_slot_names not being send to the primary would > > > >>>> allow the decoding > > > >>>> on the primary to go ahead? > > > >>>> > > > >>> > > > >>> Good question. BTW, irrespective of whether we have > > > >>> 'standby_slot_names' parameters or not, how should we behave if > > > >>> standby is down? Say, if 'synchronize_slot_names' is only specified on > > > >>> standby then in such a situation primary won't be even aware that some > > > >>> of the logical walsenders need to wait. > > > >> > > > >> Exactly, that's why I was thinking keeping standby_slot_names to > > > >> address > > > >> this scenario. In such a case one could simply decide to keep or remove > > > >> the associated physical replication slot from standby_slot_names. Keep > > > >> would > > > >> mean "wait" and removing would mean allow to decode on the primary. > > > >> > > > >>> OTOH, one can say that users > > > >>> should configure 'synchronize_slot_names' on both primary and standby > > > >>> but note that this value could be different for different standby's, > > > >>> so we can't configure it on primary. > > > >>> > > > >> > > > >> Yeah, I think that's a good use case for standby_slot_names, what do > > > >> you think? > > > >> > > > > > > > > But, even if we keep 'standby_slot_names' for this purpose, the > > > > primary doesn't know the value of 'synchronize_slot_names' once the > > > > standby is down and or the primary is restarted. So, how will we know > > > > which logical WAL senders needs to wait for 'standby_slot_names'? > > > > > > > > > > Yeah right, I also think we'd need: > > > > > > - synchronize_slot_names on both primary and standby > > > > > > But now we would need to take care of different standby having different > > > values ( > > > as you said up-thread) > > > > > > Thinking out loud: What about a single GUC on the primary (not > > > standby_slot_names nor > > > synchronize_slot_names) but say logical_slots_wait_for_standby that could > > > be a list of say > > > "logical_slot_name:physical_slot". > > > > > > I think this GUC would help us define each walsender behavior (should the > > > standby(s) > > > be up or down): > > > > > > > It may help in defining the walsender's behaviour better for sure. But > > the problem I see once we start defining sync-slot-names on primary > > (in any form whether as independent GUC or as above mapping GUC) is > > that it needs to be then in sync with standbys, as each standby for > > sure needs to maintain its own sync-slot-names GUC to make it aware of > > what all it needs to sync. > > Yes, I also think so. Also, defining such a GUC where user wants to > sync all the slots which would normally be the case would be a night > mare for the users. > > > > > This brings us to the original question of > > how do we actually keep these configurations in sync between primary > > and standby if we plan to maintain it on both? > > > > > > > - don't wait if its associated logical_slot is not listed in this GUC > > > - or wait based on its associated "list" of mapped physical slots (would > > > probably > > > have to deal with the min restart_lsn for all the corresponding mapped > > > ones). > > > > > > I don't think we can avoid having to define at least one GUC on the > >
Re: Synchronizing slots from primary to standby
On Tue, Oct 3, 2023 at 7:56 PM Drouvot, Bertrand wrote: > > Hi, > > On 10/3/23 12:54 PM, Amit Kapila wrote: > > On Mon, Oct 2, 2023 at 11:39 AM Drouvot, Bertrand > > wrote: > >> > >> On 9/29/23 1:33 PM, Amit Kapila wrote: > >>> On Thu, Sep 28, 2023 at 6:31 PM Drouvot, Bertrand > >>> wrote: > > >>> > - probably open corner cases like: what if a standby is down? would that > mean > that synchronize_slot_names not being send to the primary would allow > the decoding > on the primary to go ahead? > > >>> > >>> Good question. BTW, irrespective of whether we have > >>> 'standby_slot_names' parameters or not, how should we behave if > >>> standby is down? Say, if 'synchronize_slot_names' is only specified on > >>> standby then in such a situation primary won't be even aware that some > >>> of the logical walsenders need to wait. > >> > >> Exactly, that's why I was thinking keeping standby_slot_names to address > >> this scenario. In such a case one could simply decide to keep or remove > >> the associated physical replication slot from standby_slot_names. Keep > >> would > >> mean "wait" and removing would mean allow to decode on the primary. > >> > >>> OTOH, one can say that users > >>> should configure 'synchronize_slot_names' on both primary and standby > >>> but note that this value could be different for different standby's, > >>> so we can't configure it on primary. > >>> > >> > >> Yeah, I think that's a good use case for standby_slot_names, what do you > >> think? > >> > > > > But, even if we keep 'standby_slot_names' for this purpose, the > > primary doesn't know the value of 'synchronize_slot_names' once the > > standby is down and or the primary is restarted. So, how will we know > > which logical WAL senders needs to wait for 'standby_slot_names'? > > > > Yeah right, I also think we'd need: > > - synchronize_slot_names on both primary and standby > > But now we would need to take care of different standby having different > values ( > as you said up-thread) > > Thinking out loud: What about a single GUC on the primary (not > standby_slot_names nor > synchronize_slot_names) but say logical_slots_wait_for_standby that could be > a list of say > "logical_slot_name:physical_slot". > > I think this GUC would help us define each walsender behavior (should the > standby(s) > be up or down): > It may help in defining the walsender's behaviour better for sure. But the problem I see once we start defining sync-slot-names on primary (in any form whether as independent GUC or as above mapping GUC) is that it needs to be then in sync with standbys, as each standby for sure needs to maintain its own sync-slot-names GUC to make it aware of what all it needs to sync. This brings us to the original question of how do we actually keep these configurations in sync between primary and standby if we plan to maintain it on both? > - don't wait if its associated logical_slot is not listed in this GUC > - or wait based on its associated "list" of mapped physical slots (would > probably > have to deal with the min restart_lsn for all the corresponding mapped ones). > > I don't think we can avoid having to define at least one GUC on the primary > (at least to > handle the case of standby(s) being down). > > Thoughts? > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com
Re: Synchronizing slots from primary to standby
On Mon, Sep 25, 2023 at 7:46 PM Hayato Kuroda (Fujitsu) wrote: > > Dear Ajin, Shveta, > > Thank you for rebasing the patch set! Here are new comments for v19_2-0001. > Thank You Kuroda-san for the feedback. Most of these are addressed in v20. Please find my response inline. > 01. WalSndWaitForStandbyNeeded() > > ``` > if (SlotIsPhysical(MyReplicationSlot)) > return false; > ``` > > Is there a possibility that physical walsenders call this function? > IIUC following is a stacktrace for the function, so the only logical > walsenders use it. > If so, it should be Assert() instead of an if statement. > > logical_read_xlog_page() > WalSndWaitForWal() > WalSndWaitForStandbyNeeded() It will only be called from logical-walsenders. Modified as you suggested. > > 02. WalSndWaitForStandbyNeeded() > > Can we set shouldwait in SlotSyncInitConfig()? synchronize_slot_names_list is > searched whenever the function is called, but it is not changed automatically. > If the slotname is compared with the list in the SlotSyncInitConfig(), the > liner search can be reduced. standby_slot_names and synchronize_slot_names will be removed in the next version as per discussion in [1] and thus SlotSyncInitConfig() will not be needed. It will be replaced by new functionality. So I am currently leaving it as is. > > 03. WalSndWaitForStandbyConfirmation() > > We should add ProcessRepliesIfAny() during the loop, otherwise the walsender > overlooks the death of an apply worker. > Done. > 04. WalSndWaitForStandbyConfirmation() > > Not sure, but do we have to return early if walsenders got > PROCSIG_WALSND_INIT_STOPPING > signal? I thought that if physical walsenders get stuck, logical walsenders > wait > forever. At that time we cannot stop the primary server even if "pg_ctl stop" > is executed. > yes, right. I have added CHECK_FOR_INTERRUPTS() and 'got_STOPPING' handling now which I think should suffice to process PROCSIG_WALSND_INIT_STOPPING. > 05. SlotSyncInitConfig() > > Why don't we free the memory for rawname, old standby_slot_names_list, and > synchronize_slot_names_list? > They seem to be overwritten. > Skipped for the time being due to reasons stated in pt 2. > 06. SlotSyncInitConfig() > > Both physical and logical walsenders call the func, but physical one do not > use > lists, right? If so, can we add a quick exit for physical walsenders? > Or, we should carefully remove where physical calls it. > > 07. StartReplication() > > I think we do not have to call SlotSyncInitConfig(). > Alternative approach is written in above. > I have removed it from StartReplication() > 08. the other > > Also, I found the unexpected behavior after both 0001 and 0002 were applied. > Was it normal or not? > > 1. constructed below setup > (ensured that logical slot existed on secondary) > 2. stopped the primary > 3. promoted the secondary server > 4. disabled a subscription once > 5. changed the connection string for subscriber > 6. Inserted data to new primary > 7. enabled the subscription again > 8. got an ERROR: replication slot "sub" does not exist > > I expected that the logical replication would be restarted, but it could not. > Was it real issue or my fault? The error would appear in secondary.log. > > ``` > Setup: > primary--->secondary >| >| > subscriber > ``` I have attached the new test-script (v2), can you please try that on the v20 set of patches. We should let the slot creation complete first on standby and then try promotion. I have added a few extra lines in v2 of your script for the same. In the test-case, primary's restart-lsn was lagging behind new-slot's restart_lsn on standby and thus standby was waiting for primary to catch-up. Meanwhile standby got promoted and thus slot creation got aborted. That is the reason you were not able to get the logical replication working on the new primary. v20 has improved handling and better logging for such a case. Please try the attached test-script on v20. [1]: https://www.postgresql.org/message-id/CAJpy0uA%2Bt3XP2M0qtEmrOG1gSwHghjHPno5AtwTXM-94-%2Bc6JQ%40mail.gmail.com thanks Shveta #!/bin/bash port_primary=5431 port_secondary=5432 port_subscriber=5433 echo '==' echo '=Clean up=' echo '==' pg_ctl stop -D data_primary pg_ctl stop -D data_secondary pg_ctl stop -D data_subscriber rm -rf data_* *log echo '===' echo '=Set up primary server=' echo '===' initdb -D data_primary -U postgres cat << EOF >> data_primary/postgresql.conf wal_level = logical port = $port_primary standby_slot_names = 'physical' synchronize_slot_names = 'sub' log_replication_commands = 'on' EOF cat << EOF >> data_primary/pg_hba.conf host all,replication all 0.0.0.0/0 trust EOF pg_ctl -D data_primary start -w -l primary.log psql -U postgres -p $port_primary -c "CREATE TABLE tbl (id int primary key);" psql -U postgres -p $port_primary -c "INSERT INTO tbl VALUES (1)" psql -U postgres -p $port_primary -c
Re: Synchronizing slots from primary to standby
On Fri, Sep 22, 2023 at 3:48 PM Amit Kapila wrote: > > On Thu, Sep 21, 2023 at 9:16 AM shveta malik wrote: > > > > On Tue, Sep 19, 2023 at 10:29 AM shveta malik > > wrote: > > > > Currently in patch001, synchronize_slot_names is a GUC on both primary > > and physical standby. This GUC tells which all logical slots need to > > be synced on physical standbys from the primary. Ideally it should be > > a GUC on physical standby alone and each physical standby should be > > able to communicate the value to the primary (considering the value > > may vary for different physical replicas of the same primary). The > > primary on the other hand should be able to take UNION of these values > > and let the logical walsenders (belonging to the slots in UNION > > synchronize_slots_names) wait for physical standbys for confirmation > > before sending those changes to logical subscribers. The intent is > > logical subscribers should never be ahead of physical standbys. > > > > Before getting into the details of 'synchronize_slot_names', I would > like to know whether we really need the second GUC > 'standby_slot_names'. Can't we simply allow all the logical wal > senders corresponding to 'synchronize_slot_names' to wait for just the > physical standby(s) (physical slot corresponding to such physical > standby) that have sent ' synchronize_slot_names'list? We should have > one physical standby slot corresponding to one physical standby. > yes, with the new approach (to be implemented next) where we plan to send synchronize_slot_names from each physical standby to primary, the standby_slot_names GUC should no longer be needed on primary. The physical standbys sending requests should automatically become the ones to be waited for confirmation on the primary. thanks Shveta
Re: Synchronizing slots from primary to standby
On Fri, Sep 22, 2023 at 6:01 PM Drouvot, Bertrand wrote: > > Hi, > > Thanks for all the work that has been done on this feature, and sorry > to have been quiet on it for so long. Thanks for looking into this. > > On 9/18/23 12:22 PM, shveta malik wrote: > > On Wed, Sep 13, 2023 at 4:48 PM Hayato Kuroda (Fujitsu) > > wrote: > >> Right, but I wanted to know why it is needed. One motivation seemed to > >> know the > >> WAL location of physical standby, but I thought that struct WalSnd.apply > >> could > >> be also used. Is it bad to assume that the physical walsender always > >> exists? > >> > > > > We do not plan to target this case where physical slot is not created > > between primary and physical-standby in the first draft. In such a > > case, slot-synchronization will be skipped for the time being. We can > > extend this functionality (if needed) later. > > > > I do think it's needed to extend this functionality. Having physical slot > created sounds like a (too?) strong requirement as: > > - It has not been a requirement for Logical decoding on standby so that could > sounds weird > to require it for sync slot (while it's not allowed to logical decode from > sync slots) > > - One could want to limit the WAL space used on the primary > > It seems that the "skipping sync as primary_slot_name not set." warning > message is emitted > every 10ms, that seems too verbose to me. > You are right, the warning msg is way too frequent. I will optimize it in the next version. thanks Shveta
Re: Synchronizing slots from primary to standby
On Tue, Sep 19, 2023 at 10:29 AM shveta malik wrote: > > On Fri, Sep 15, 2023 at 2:22 PM Peter Smith wrote: > > > > Hi. Here are some review comments for v17-0002. > > > > Thanks Peter for the feedback. I have addressed most of these in v18 > except 2. Please find my comments for the ones not addressed. > > > This is a WIP and a long way from complete, but I wanted to send what > > I have so far (while it is still current with your latest posted > > patches). > > > > == > > > > > 34. ListSlotDatabaseOIDs - sorting/logic > > > > Maybe explain better the reason for having the qsort and other logic. > > > > TBH, I was not sure of the necessity for the names lists and the > > sorting and bsearch logic. AFAICT these are all *only* used to check > > for uniqueness and existence of the slot name. So I was wondering if a > > hashmap keyed by the slot name might be more appropriate and also > > simpler than all this list sorting/searching. > > > > Pending. I will revisit this soon and will let you know more on this. > IMO, it was done to optimize the search as slot_names list could be > pretty huge if max_replication_slots is set to high value. > > > ~~ > > > > 35. ListSlotDatabaseOIDs > > > > + for (int slotno = 0; slotno < max_replication_slots; slotno++) > > + { > > > > loop variable declaration > > > > > > == > > src/backend/storage/lmgr/lwlock.c > > OK > > > > == > > src/backend/storage/lmgr/lwlocknames.txt > > OK > > > > == > > .../utils/activity/wait_event_names.txt > > TODO > > > > == > > src/backend/utils/misc/guc_tables.c > > OK > > > > == > > src/backend/utils/misc/postgresql.conf.sample > > > > 36. > > # primary to streaming replication standby server > > +#max_slotsync_workers = 2 # max number of slot synchronization > > workers on a standby > > > > IMO it is better to say "maximum" instead of "max" in the comment. > > > > (make sure the GUC description text is identical) > > > > == > > src/include/catalog/pg_proc.dat > > > > 37. > > +{ oid => '6312', descr => 'get invalidate cause of a replication slot', > > + proname => 'pg_get_invalidation_cause', provolatile => 's', > > proisstrict => 't', > > + prorettype => 'int2', proargtypes => 'name', > > + prosrc => 'pg_get_invalidation_cause' }, > > > > 37a. > > SUGGESTION (descr) > > what caused the replication slot to become invalid > > > > ~ > > > > 37b > > 'pg_get_invalidation_cause' seemed like a poor function name because > > it doesn't have any context -- not even the word "slot" in it. > > > > == > > src/include/commands/subscriptioncmds.h > > OK > > > > == > > src/include/nodes/replnodes.h > > OK > > > > == > > src/include/postmaster/bgworker_internals.h > > > > 38. > > #define MAX_PARALLEL_WORKER_LIMIT 1024 > > +#define MAX_SLOT_SYNC_WORKER_LIMIT 50 > > > > Consider SLOTSYNC instead of SLOT_SYNC for consistency with other > > names of this worker. > > > > == > > OK > > > > == > > src/include/replication/logicalworker.h > > > > 39. > > extern void ApplyWorkerMain(Datum main_arg); > > extern void ParallelApplyWorkerMain(Datum main_arg); > > extern void TablesyncWorkerMain(Datum main_arg); > > +extern void ReplSlotSyncMain(Datum main_arg); > > > > The name is not consistent with others nearby. At least it should > > include the word "Worker" like everything else does. > > > > == > > src/include/replication/slot.h > > OK > > > > == > > src/include/replication/walreceiver.h > > > > 40. > > +/* > > + * Slot's DBid related data > > + */ > > +typedef struct WalRcvRepSlotDbData > > +{ > > + Oid database; /* Slot's DBid received from remote */ > > + TimestampTz last_sync_time; /* The last time we tried to launch sync > > + * worker for above Dbid */ > > +} WalRcvRepSlotDbData; > > + > > > > > > Is that comment about field 'last_sync_time' correct? I thought this > > field is the last time the slot was synced -- not the last time the > > worker was launched. > > Sorry for confusion. Comment is correct, the name is misleading. I > have changed the name in v18. > > > > >
Re: Synchronizing slots from primary to standby
On Fri, Sep 15, 2023 at 2:22 PM Peter Smith wrote: > > Hi. Here are some review comments for v17-0002. > Thanks Peter for the feedback. I have addressed most of these in v18 except 2. Please find my comments for the ones not addressed. > This is a WIP and a long way from complete, but I wanted to send what > I have so far (while it is still current with your latest posted > patches). > > == > > 34. ListSlotDatabaseOIDs - sorting/logic > > Maybe explain better the reason for having the qsort and other logic. > > TBH, I was not sure of the necessity for the names lists and the > sorting and bsearch logic. AFAICT these are all *only* used to check > for uniqueness and existence of the slot name. So I was wondering if a > hashmap keyed by the slot name might be more appropriate and also > simpler than all this list sorting/searching. > Pending. I will revisit this soon and will let you know more on this. IMO, it was done to optimize the search as slot_names list could be pretty huge if max_replication_slots is set to high value. > ~~ > > 35. ListSlotDatabaseOIDs > > + for (int slotno = 0; slotno < max_replication_slots; slotno++) > + { > > loop variable declaration > > > == > src/backend/storage/lmgr/lwlock.c > OK > > == > src/backend/storage/lmgr/lwlocknames.txt > OK > > == > .../utils/activity/wait_event_names.txt > TODO > > == > src/backend/utils/misc/guc_tables.c > OK > > == > src/backend/utils/misc/postgresql.conf.sample > > 36. > # primary to streaming replication standby server > +#max_slotsync_workers = 2 # max number of slot synchronization > workers on a standby > > IMO it is better to say "maximum" instead of "max" in the comment. > > (make sure the GUC description text is identical) > > == > src/include/catalog/pg_proc.dat > > 37. > +{ oid => '6312', descr => 'get invalidate cause of a replication slot', > + proname => 'pg_get_invalidation_cause', provolatile => 's', > proisstrict => 't', > + prorettype => 'int2', proargtypes => 'name', > + prosrc => 'pg_get_invalidation_cause' }, > > 37a. > SUGGESTION (descr) > what caused the replication slot to become invalid > > ~ > > 37b > 'pg_get_invalidation_cause' seemed like a poor function name because > it doesn't have any context -- not even the word "slot" in it. > > == > src/include/commands/subscriptioncmds.h > OK > > == > src/include/nodes/replnodes.h > OK > > == > src/include/postmaster/bgworker_internals.h > > 38. > #define MAX_PARALLEL_WORKER_LIMIT 1024 > +#define MAX_SLOT_SYNC_WORKER_LIMIT 50 > > Consider SLOTSYNC instead of SLOT_SYNC for consistency with other > names of this worker. > > == > OK > > == > src/include/replication/logicalworker.h > > 39. > extern void ApplyWorkerMain(Datum main_arg); > extern void ParallelApplyWorkerMain(Datum main_arg); > extern void TablesyncWorkerMain(Datum main_arg); > +extern void ReplSlotSyncMain(Datum main_arg); > > The name is not consistent with others nearby. At least it should > include the word "Worker" like everything else does. > > == > src/include/replication/slot.h > OK > > == > src/include/replication/walreceiver.h > > 40. > +/* > + * Slot's DBid related data > + */ > +typedef struct WalRcvRepSlotDbData > +{ > + Oid database; /* Slot's DBid received from remote */ > + TimestampTz last_sync_time; /* The last time we tried to launch sync > + * worker for above Dbid */ > +} WalRcvRepSlotDbData; > + > > > Is that comment about field 'last_sync_time' correct? I thought this > field is the last time the slot was synced -- not the last time the > worker was launched. Sorry for confusion. Comment is correct, the name is misleading. I have changed the name in v18. > > == > src/include/replication/worker_internal.h > > 41. > - /* User to use for connection (will be same as owner of subscription). */ > + /* User to use for connection (will be same as owner of subscription > + * in case of LogicalRep worker). */ > Oid userid; > +} WorkerHeader; > > 41a. > > This is not the normal style for a multi-line comment. > > ~ > > 41b. > I wondered if the name "WorkerHeader" is just a bit *too* generic and > might cause future trouble because of the vague name. > I agree. Can you please suggest a better name for it? thanks Shveta
Re: Synchronizing slots from primary to standby
On Wed, Sep 13, 2023 at 4:48 PM Hayato Kuroda (Fujitsu) wrote: > > Dear Shveta, > > Sorry for the late response. > > > Thanks Kuroda-san for the feedback. > > > > > > 01. General > > > > > > I think the documentation can be added, not only GUCs. How about adding > > examples > > > for combinations of physical and logical replications? You can say that > > > both of > > > physical primary can be publisher and slots on primary/standby are > > synchronized. > > > > > > > I did not fully understand this. Can you please state a clear example. > > We are only synchronizing logical replication slots in this draft and > > that too on physical standby from primary. So the last statement is > > not completely true. > > I expected to add a new subsection in "Log-Shipping Standby Servers". I think > we > can add like following infos: > > * logical replication publisher can be also replicated > * For that, a physical repliation slot must be defined on primar > * Then we can set up standby_slot_names(on primary) and synchronize_slot_names > (on both server). > * slots are synchronized automatically Sure. I am trying to find the right place in this section to add this info. I will try to address this in coming versions. > > > > 02. General > > > > > > standby_slot_names ensures that physical standby is always ahead > > > subscriber, > > but I > > > think it may be not sufficient. There is a possibility that primary > > > server does > > > not have any physical slots.So it expects a slot to be present. > > > In this case the physical standby may be behind the > > > subscriber and the system may be confused when the failover is occured. > > > > Currently there is a check in slot-sync worker which mandates that > > there is a physical slot present between primary and standby for this > > feature to proceed.So that confusion state will not arise. > > + /* WalRcvData is not set or primary_slot_name is not set yet */ > > + if (!WalRcv || WalRcv->slotname[0] == '\0') > > + return naptime; > > Right, but I wanted to know why it is needed. One motivation seemed to know > the > WAL location of physical standby, but I thought that struct WalSnd.apply could > be also used. Is it bad to assume that the physical walsender always exists? > We do not plan to target this case where physical slot is not created between primary and physical-standby in the first draft. In such a case, slot-synchronization will be skipped for the time being. We can extend this functionality (if needed) later. > > >Can't we specify the name of standby via application_name or something? > > > > So do you mean that in absence of a physical slot (if we plan to > > support that), we let primary know about standby(slots-synchronization > > client) through application_name? > > Yes, it is what I considered. > > > > > > > 03. General > > > > > > In this architecture, the syncslot worker is launched per db and they > > > independently connects to primary, right? > > > > Not completely true. Each slotsync worker is responsible for managing > > N dbs. Here 'N' = 'Number of distinct dbs for slots in > > synchronize_slot_names'/ 'number of max_slotsync_workers configured' > > for cases where dbcount exceeds workers configured. > > And if dbcount < max_slotsync_workers, then we launch only that many > > workers equal to dbcount and each worker manages a single db. Each > > worker independently connects to primary. Currently it makes a > > connection multiple times, I am optimizing it to make connection only > > once and then after each SIGHUP assuming 'primary_conninfo' may > > change. This change will be in the next version. > > > > > > >I'm not sure it is efficient, but I > > > come up with another architecture - only a worker (syncslot > > > receiver)connects > > > to the primary and other workers (syncslot worker) receives infos from it > > > and > > > updates. This can reduce the number of connections so that it may slightly > > > improve the latency of network. How do you think? > > > > > > > I feel it may help in reducing network latency, but not sure if it > > could be more efficient in keeping the lsns in sync. I feel it may > > introduce lag due to the fact that only one worker is getting all the > > info from primary and the actual synchronizing workers are waiting on > > that worker. This lag may be more when the number of slots are huge. > > We have run some performance tests on the design implemented > > currently, please have a look at emails around [1] and [2]. > > Thank you for teaching! Yeah, I agreed that another point might be a > bottleneck. > It could be recalled in future, but currently we do not have to consider... > > > > 04. ReplSlotSyncMain > > > > > > Does the worker have to connect to the specific database? > > > > > > > > > ``` > > > /* Connect to our database. */ > > > > > BackgroundWorkerInitializeConnectionByOid(MySlotSyncWorker->dbid, > > > > > MySlotSyncWorker->userid, > > > > > 0); > > > ``` > > > > Since we
Re: Synchronizing slots from primary to standby
On Wed, Sep 6, 2023 at 2:18 PM shveta malik wrote: > > On Fri, Sep 1, 2023 at 1:59 PM Peter Smith wrote: > > > > Hi Shveta. Here are some comments for patch v14-0002 > > > > The patch is large, so my code review is a WIP... more later next week... > > > > Thanks Peter for the feedback. I have tried to address most of these > in v15. Please find my response inline for the ones which I have not > addressed. > > > > > 26. > > +typedef struct SlotSyncWorker > > +{ > > + /* Time at which this worker was launched. */ > > + TimestampTz launch_time; > > + > > + /* Indicates if this slot is used or free. */ > > + bool in_use; > > + > > + /* The slot in worker pool to which it is attached */ > > + int slot; > > + > > + /* Increased every time the slot is taken by new worker. */ > > + uint16 generation; > > + > > + /* Pointer to proc array. NULL if not running. */ > > + PGPROC*proc; > > + > > + /* User to use for connection (will be same as owner of subscription). */ > > + Oid userid; > > + > > + /* Database id to connect to. */ > > + Oid dbid; > > + > > + /* Count of Database ids it manages */ > > + uint32 dbcount; > > + > > + /* DSA for dbids */ > > + dsa_area *dbids_dsa; > > + > > + /* dsa_pointer for database ids it manages */ > > + dsa_pointer dbids_dp; > > + > > + /* Mutex to access dbids in dsa */ > > + slock_t mutex; > > + > > + /* Info about slot being monitored for worker's naptime purpose */ > > + SlotSyncWorkerWatchSlot monitor; > > +} SlotSyncWorker; > > > > There seems an awful lot about this struct which is common with > > 'LogicalRepWorker' struct. > > > > It seems a shame not to make use of the commonality instead of all the > > cut/paste here. > > > > E.g. Can it be rearranged so all these common fields are shared: > > - launch_time > > - in_use > > - slot > > - generation > > - proc > > - userid > > - dbid > > > > Sure, I had this in mind along with previous comments where it was > suggested to merge similar functions like > WaitForReplicationWorkerAttach, WaitForSlotSyncWorkerAttach etc. That > merging could only be possible if we try to merge the common part of > these structures. This is WIP, will be addressed in the next version. > This has been addressed in version-17 now. thanks Shveta
Re: Synchronizing slots from primary to standby
On Fri, Sep 8, 2023 at 4:40 PM Hayato Kuroda (Fujitsu) wrote: > > Dear Shveta, > > I resumed to check the thread. Here are my high-level comments. > Sorry if you have been already discussed. Thanks Kuroda-san for the feedback. > > 01. General > > I think the documentation can be added, not only GUCs. How about adding > examples > for combinations of physical and logical replications? You can say that both > of > physical primary can be publisher and slots on primary/standby are > synchronized. > I did not fully understand this. Can you please state a clear example. We are only synchronizing logical replication slots in this draft and that too on physical standby from primary. So the last statement is not completely true. > 02. General > > standby_slot_names ensures that physical standby is always ahead subscriber, > but I > think it may be not sufficient. There is a possibility that primary server > does > not have any physical slots.So it expects a slot to be present. > In this case the physical standby may be behind the > subscriber and the system may be confused when the failover is occured. Currently there is a check in slot-sync worker which mandates that there is a physical slot present between primary and standby for this feature to proceed.So that confusion state will not arise. + /* WalRcvData is not set or primary_slot_name is not set yet */ + if (!WalRcv || WalRcv->slotname[0] == '\0') + return naptime; >Can't we specify the name of standby via application_name or something? So do you mean that in absence of a physical slot (if we plan to support that), we let primary know about standby(slots-synchronization client) through application_name? I am not sure about this. Will think more on this. I would like to know others' opinion on this as well. > > 03. General > > In this architecture, the syncslot worker is launched per db and they > independently connects to primary, right? Not completely true. Each slotsync worker is responsible for managing N dbs. Here 'N' = 'Number of distinct dbs for slots in synchronize_slot_names'/ 'number of max_slotsync_workers configured' for cases where dbcount exceeds workers configured. And if dbcount < max_slotsync_workers, then we launch only that many workers equal to dbcount and each worker manages a single db. Each worker independently connects to primary. Currently it makes a connection multiple times, I am optimizing it to make connection only once and then after each SIGHUP assuming 'primary_conninfo' may change. This change will be in the next version. >I'm not sure it is efficient, but I > come up with another architecture - only a worker (syncslot receiver)connects > to the primary and other workers (syncslot worker) receives infos from it and > updates. This can reduce the number of connections so that it may slightly > improve the latency of network. How do you think? > I feel it may help in reducing network latency, but not sure if it could be more efficient in keeping the lsns in sync. I feel it may introduce lag due to the fact that only one worker is getting all the info from primary and the actual synchronizing workers are waiting on that worker. This lag may be more when the number of slots are huge. We have run some performance tests on the design implemented currently, please have a look at emails around [1] and [2]. > 04. General > > test file recovery/t/051_slot_sync.pl is missing. > yes, it was removed. Please see point3 at [3] > 04. ReplSlotSyncMain > > Does the worker have to connect to the specific database? > > > ``` > /* Connect to our database. */ > BackgroundWorkerInitializeConnectionByOid(MySlotSyncWorker->dbid, > > MySlotSyncWorker->userid, > > 0); > ``` Since we are using libpq public interface 'walrcv_exec=libpqrcv_exec' to connect to primary, this needs database connection. It errors out in the absence of 'MyDatabaseId'. Do you think db-connection can have some downsides? > > 05. SlotSyncInitSlotNamesLst() > > "Lst" should be "List". > Okay, I will change this in the next version. == [1]: https://www.postgresql.org/message-id/CAJpy0uD2F43avuXy_yQv7Wa3kpUwioY_Xn955xdmd6vX0ME6%3Dg%40mail.gmail.com [2]: https://www.postgresql.org/message-id/CAJpy0uD%3DDevMxTwFVsk_%3DxHqYNH8heptwgW6AimQ9fbRmx4ioQ%40mail.gmail.com [3]: https://www.postgresql.org/message-id/CAJpy0uAuzbzvcjpnzFTiWuDBctnH-SDZC6AZabPX65x9GWBrjQ%40mail.gmail.com thanks Shveta
Re: Synchronizing slots from primary to standby
On Fri, Aug 25, 2023 at 2:15 PM Alvaro Herrera wrote: > > Wait a minute ... > > From bac0fbef8b203c530e5117b0b7cfee13cfab78b9 Mon Sep 17 00:00:00 2001 > From: Bharath Rupireddy > Date: Sat, 22 Jul 2023 10:17:48 + > Subject: [PATCH v13 1/2] Allow logical walsenders to wait for physical > standbys > > @@ -2498,6 +2500,13 @@ ReorderBufferProcessTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn, > } > else > { > + /* > +* Before we send out the last set of changes to > logical decoding > +* output plugin, wait for specified streaming > replication standby > +* servers (if any) to confirm receipt of WAL upto > commit_lsn. > +*/ > + WaitForStandbyLSN(commit_lsn); > > OK, so we call this new function frequently enough -- once per > transaction, if I read this correctly? So ... > > +void > +WaitForStandbyLSN(XLogRecPtr wait_for_lsn) > +{ > ... > > + /* "*" means all logical walsenders should wait for physical > standbys. */ > + if (strcmp(synchronize_slot_names, "*") != 0) > + { > + boolshouldwait = false; > + > + rawname = pstrdup(synchronize_slot_names); > + SplitIdentifierString(rawname, ',', ); > + > + foreach (l, elemlist) > + { > + char *name = lfirst(l); > + if (strcmp(name, > NameStr(MyReplicationSlot->data.name)) == 0) > + { > + shouldwait = true; > + break; > + } > + } > + > + pfree(rawname); > + rawname = NULL; > + list_free(elemlist); > + elemlist = NIL; > + > + if (!shouldwait) > + return; > + } > + > + rawname = pstrdup(standby_slot_names); > + SplitIdentifierString(rawname, ',', ); > > ... do we really want to be doing the GUC string parsing every time > through it? This sounds like it could be a bottleneck, or at least slow > things down. Maybe we should think about caching this somehow. > Yes, these parsed lists are now cached. Please see v15 (https://www.postgresql.org/message-id/CAJpy0uAuzbzvcjpnzFTiWuDBctnH-SDZC6AZabPX65x9GWBrjQ%40mail.gmail.com) thanks Shveta
Re: Synchronizing slots from primary to standby
On Fri, Sep 1, 2023 at 1:59 PM Peter Smith wrote: > > Hi Shveta. Here are some comments for patch v14-0002 > > The patch is large, so my code review is a WIP... more later next week... > Thanks Peter for the feedback. I have tried to address most of these in v15. Please find my response inline for the ones which I have not addressed. > == > GENERAL > > 1. Patch size > > The patch is 2700 lines. Is it possible to break this up into smaller > self-contained parts to make the reviews more manageable? > Currently, patches are created based on work done on primary and standby. Patch 001 for primary-side implementation and 002 for standby side. Let me think more on this and see if the changes can be segregated further. > > 26. > +typedef struct SlotSyncWorker > +{ > + /* Time at which this worker was launched. */ > + TimestampTz launch_time; > + > + /* Indicates if this slot is used or free. */ > + bool in_use; > + > + /* The slot in worker pool to which it is attached */ > + int slot; > + > + /* Increased every time the slot is taken by new worker. */ > + uint16 generation; > + > + /* Pointer to proc array. NULL if not running. */ > + PGPROC*proc; > + > + /* User to use for connection (will be same as owner of subscription). */ > + Oid userid; > + > + /* Database id to connect to. */ > + Oid dbid; > + > + /* Count of Database ids it manages */ > + uint32 dbcount; > + > + /* DSA for dbids */ > + dsa_area *dbids_dsa; > + > + /* dsa_pointer for database ids it manages */ > + dsa_pointer dbids_dp; > + > + /* Mutex to access dbids in dsa */ > + slock_t mutex; > + > + /* Info about slot being monitored for worker's naptime purpose */ > + SlotSyncWorkerWatchSlot monitor; > +} SlotSyncWorker; > > There seems an awful lot about this struct which is common with > 'LogicalRepWorker' struct. > > It seems a shame not to make use of the commonality instead of all the > cut/paste here. > > E.g. Can it be rearranged so all these common fields are shared: > - launch_time > - in_use > - slot > - generation > - proc > - userid > - dbid > > == > src/include/storage/lwlock.h Sure, I had this in mind along with previous comments where it was suggested to merge similar functions like WaitForReplicationWorkerAttach, WaitForSlotSyncWorkerAttach etc. That merging could only be possible if we try to merge the common part of these structures. This is WIP, will be addressed in the next version. > > 27. > LWTRANCHE_LAUNCHER_HASH, > - LWTRANCHE_FIRST_USER_DEFINED > + LWTRANCHE_FIRST_USER_DEFINED, > + LWTRANCHE_SLOT_SYNC_DSA > } BuiltinTrancheIds; > > Isn't 'LWTRANCHE_FIRST_USER_DEFINED' supposed to the be last enum? > > == > src/test/recovery/meson.build > > == > src/test/recovery/t/051_slot_sync.pl > I have currently removed this file from the patch. Please see my comments (pt 3) here: https://mail.google.com/mail/u/0/?ik=52d5778aba=om=msg-a:r-2984462571505788980 thanks Shveta > 28. > + > +# Copyright (c) 2021, PostgreSQL Global Development Group > + > +use strict; > > Wrong copyright date > > ~~~ > > 29. > +my $node_primary = PostgreSQL::Test::Cluster->new('primary'); > +my $node_phys_standby = PostgreSQL::Test::Cluster->new('phys_standby'); > +my $node_subscriber = PostgreSQL::Test::Cluster->new('subscriber'); > > 29a. > Can't all the subroutines be up-front? Then this can move to be with > the other node initialisation code that comets next. > > ~ > > 29b. > Add a comment something like # Setup nodes > > ~~~ > > 30. > +# Check conflicting status in pg_replication_slots. > +sub check_slots_conflicting_status > +{ > + my $res = $node_phys_standby->safe_psql( > + 'postgres', qq( > + select bool_and(conflicting) from pg_replication_slots;)); > + > + is($res, 't', > + "Logical slot is reported as conflicting"); > +} > > Doesn't bool_and() mean returns false if only some but not all slots > are conflicting - is that intentional?> Or is this sub-routine only > expecting to test one slot, in which case maybe the SQL should include > also the 'slot_name'? > > ~~~ > > 31. > +$node_primary->start; > +$node_primary->psql('postgres', q{SELECT > pg_create_physical_replication_slot('pslot1');}); > + > +$node_primary->backup('backup'); > + > +$node_phys_standby->init_from_backup($node_primary, 'backup', > has_streaming => 1); > +$node_phys_standby->append_conf('postgresql.conf', q{ > +synchronize_slot_names = '*' > +primary_slot_name = 'pslot1' > +hot_standby_feedback = off > +}); > +$node_phys_standby->start; > + > +$node_primary->safe_psql('postgres', "CREATE TABLE t1 (a int PRIMARY KEY)"); > +$node_primary->safe_psql('postgres', "INSERT INTO t1 VALUES (1), (2), (3)"); > > The comments seem mostly to describe details about what are the > expectations at each test step. > > IMO there also needs to be a larger "overview" comment to describe > more generally *what* this is testing, and *how* it is testing it. > e.g. it is hard to understand the test without being already familiar > with the patch. > > -- >
Re: Synchronizing slots from primary to standby
On Wed, Aug 23, 2023 at 4:21 PM Dilip Kumar wrote: > > On Wed, Aug 23, 2023 at 3:38 PM shveta malik wrote: > > > I have reviewed the v12-0002 patch and I have some comments. I see the > latest version posted sometime back and if any of this comment is > already fixed in this version then feel free to ignore that. > > In general, code still needs a lot more comments to make it readable > and in some places, code formatting is also not as per PG standard so > that needs to be improved. > There are some other specific comments as listed below > Please see the latest patch-set (v14). Did some code-formatting, used pg_indent as well. Added more comments. Let me know specifically if some more comments or formatting is needed. > 1. > @@ -925,7 +936,7 @@ ApplyLauncherRegister(void) > memset(, 0, sizeof(bgw)); > bgw.bgw_flags = BGWORKER_SHMEM_ACCESS | > BGWORKER_BACKEND_DATABASE_CONNECTION; > - bgw.bgw_start_time = BgWorkerStart_RecoveryFinished; > + bgw.bgw_start_time = BgWorkerStart_ConsistentState; > > What is the reason for this change, can you add some comments? Sure, done. > > 2. > ApplyLauncherShmemInit(void) > { > bool found; > + bool foundSlotSync; > > Is there any specific reason to launch the sync worker from the > logical launcher instead of making this independent? > I mean in the future if we plan to sync physical slots as well then it > wouldn't be an expandable design. > > 3. > + /* > + * Remember the old dbids before we stop and cleanup this worker > + * as these will be needed in order to relaunch the worker. > + */ > + copied_dbcnt = worker->dbcount; > + copied_dbids = (Oid *)palloc0(worker->dbcount * sizeof(Oid)); > + > + for (i = 0; i < worker->dbcount; i++) > + copied_dbids[i] = worker->dbids[i]; > > probably we can just memcpy the memory containing the dbids. Done. > > 4. > + /* > + * If worker is being reused, and there is vacancy in dbids array, > + * just update dbids array and dbcount and we are done. > + * But if dbids array is exhausted, stop the worker, reallocate > + * dbids in dsm, relaunch the worker with same set of dbs as earlier > + * plus the new db. > + */ > > Why do we need to relaunch the worker, can't we just use dsa_pointer > to expand the shared memory whenever required? > Done. > 5. > > +static bool > +WaitForSlotSyncWorkerAttach(SlotSyncWorker *worker, > +uint16 generation, > +BackgroundWorkerHandle *handle) > > this function is an exact duplicate of WaitForReplicationWorkerAttach > except for the LWlock, why don't we use the same function by passing > the LWLock as a parameter > > 6. > +/* > + * Attach Slot-sync worker to worker-slot assigned by launcher. > + */ > +void > +slotsync_worker_attach(int slot) > > this is also very similar to the logicalrep_worker_attach function. > > Please check other similar functions and reuse them wherever possible > Will revisit these as stated in [1]. [1]: https://www.postgresql.org/message-id/CAJpy0uDeWjJj%2BU8nn%2BHbnGWkfY%2Bn-Bbw_kuHqgphETJ1Lucy%2BQ%40mail.gmail.com > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
Re: Synchronizing slots from primary to standby
On Fri, Aug 25, 2023 at 11:09 AM shveta malik wrote: > > On Wed, Aug 23, 2023 at 4:21 PM Dilip Kumar wrote: > > > > On Wed, Aug 23, 2023 at 3:38 PM shveta malik wrote: > > > > > I have reviewed the v12-0002 patch and I have some comments. I see the > > latest version posted sometime back and if any of this comment is > > already fixed in this version then feel free to ignore that. > > > > Thanks for the feedback. Please find my comments on a few. I will work on > rest. > > > > 2. > > ApplyLauncherShmemInit(void) > > { > > bool found; > > + bool foundSlotSync; > > > > Is there any specific reason to launch the sync worker from the > > logical launcher instead of making this independent? > > I mean in the future if we plan to sync physical slots as well then it > > wouldn't be an expandable design. > > When we started working on this, it was reusing logical-apply worker > infra, so I separated it from logical-apply worker but let it be > managed by a replication launcher considering that only logical slots > needed to be synced. I think this needs more thought and I would like > to know from others as well before concluding anything here. > > > > 5. > > > > +static bool > > +WaitForSlotSyncWorkerAttach(SlotSyncWorker *worker, > > +uint16 generation, > > +BackgroundWorkerHandle *handle) > > > > this function is an exact duplicate of WaitForReplicationWorkerAttach > > except for the LWlock, why don't we use the same function by passing > > the LWLock as a parameter > > > > Here workers (first argument) are different. We can always pass > LWLock, but since workers are different, in order to merge the common > functionality, we need to have some common worker structure between > the two workers (apply and sync-slot) and pass that to functions which > need to be merged (similar to NodeTag used in Insert/CreateStmt etc). > But changing LogicalRepWorker() would mean changing > applyworker/table-sync worker/parallel-apply-worker files. Since there > are only two such functions which you pointed out (attach and > wait_for_attach), I prefered to keep the functions as is until we > conclude on where slot-sync worker functionality actually fits in. I > can revisit these comments then. Or if you see any better way to do > it, kindly let me know. > > > 6. > > +/* > > + * Attach Slot-sync worker to worker-slot assigned by launcher. > > + */ > > +void > > +slotsync_worker_attach(int slot) > > > > this is also very similar to the logicalrep_worker_attach function. > > > > Please check other similar functions and reuse them wherever possible > > > > Also, why this function is not registering the cleanup function on > > shmmem_exit? > > > > It is doing it in ReplSlotSyncMain() since we have dsm-seg there. > Please see this: > > /* Primary initialization is complete. Now, attach to our slot. */ > slotsync_worker_attach(worker_slot); > before_shmem_exit(slotsync_worker_detach, PointerGetDatum(seg)); > PFA new patch-set which attempts to fix these: a) Synced slots on standby are not consumable i.e. pg_logical_slot_get/peek_changes will give error on these while will work on user-created slots. b) User created slots on standby will not be dropped by slot-sync workers anymore. Earlier slot-sync worker was dropping all the slots which were not part of synchronize_slot_names. c) Now DSA is being used for dbids to facilitate memory extension if required without needing to restart the worker. Earlier dsm was used alone which needed restart of the worker in case the memory allocated needs to be extended. Changes are in patch 0002. Next in pipeline: 1. Handling of corner scenarios which I explained in: https://www.postgresql.org/message-id/CAJpy0uC%2B2agRtF3H%3Dn-hW5JkoPfaZkjPXJr%3D%3Dy3_PRE04dQvhw%40mail.gmail.com 2. Revisiting comments (older ones in this thread and latest given) for patch 1. thanks Shveta v14-0001-Allow-logical-walsenders-to-wait-for-physical-st.patch Description: Binary data v14-0002-Add-logical-slot-sync-capability-to-physical-sta.patch Description: Binary data
Re: Synchronizing slots from primary to standby
On Wed, Aug 23, 2023 at 4:21 PM Dilip Kumar wrote: > > On Wed, Aug 23, 2023 at 3:38 PM shveta malik wrote: > > > I have reviewed the v12-0002 patch and I have some comments. I see the > latest version posted sometime back and if any of this comment is > already fixed in this version then feel free to ignore that. > Thanks for the feedback. Please find my comments on a few. I will work on rest. > 2. > ApplyLauncherShmemInit(void) > { > bool found; > + bool foundSlotSync; > > Is there any specific reason to launch the sync worker from the > logical launcher instead of making this independent? > I mean in the future if we plan to sync physical slots as well then it > wouldn't be an expandable design. When we started working on this, it was reusing logical-apply worker infra, so I separated it from logical-apply worker but let it be managed by a replication launcher considering that only logical slots needed to be synced. I think this needs more thought and I would like to know from others as well before concluding anything here. > 5. > > +static bool > +WaitForSlotSyncWorkerAttach(SlotSyncWorker *worker, > +uint16 generation, > +BackgroundWorkerHandle *handle) > > this function is an exact duplicate of WaitForReplicationWorkerAttach > except for the LWlock, why don't we use the same function by passing > the LWLock as a parameter > Here workers (first argument) are different. We can always pass LWLock, but since workers are different, in order to merge the common functionality, we need to have some common worker structure between the two workers (apply and sync-slot) and pass that to functions which need to be merged (similar to NodeTag used in Insert/CreateStmt etc). But changing LogicalRepWorker() would mean changing applyworker/table-sync worker/parallel-apply-worker files. Since there are only two such functions which you pointed out (attach and wait_for_attach), I prefered to keep the functions as is until we conclude on where slot-sync worker functionality actually fits in. I can revisit these comments then. Or if you see any better way to do it, kindly let me know. > 6. > +/* > + * Attach Slot-sync worker to worker-slot assigned by launcher. > + */ > +void > +slotsync_worker_attach(int slot) > > this is also very similar to the logicalrep_worker_attach function. > > Please check other similar functions and reuse them wherever possible > > Also, why this function is not registering the cleanup function on > shmmem_exit? > It is doing it in ReplSlotSyncMain() since we have dsm-seg there. Please see this: /* Primary initialization is complete. Now, attach to our slot. */ slotsync_worker_attach(worker_slot); before_shmem_exit(slotsync_worker_detach, PointerGetDatum(seg)); thanks Shveta
Re: Synchronizing slots from primary to standby
On Thu, Aug 17, 2023 at 4:09 PM shveta malik wrote: > > On Thu, Aug 17, 2023 at 11:55 AM shveta malik wrote: > > > > On Thu, Aug 17, 2023 at 11:44 AM Drouvot, Bertrand > > wrote: > > > > > > Hi, > > > > > > On 8/14/23 11:52 AM, shveta malik wrote: > > > > > > > > > > > We (myself and Ajin) performed the tests to compute the lag in standby > > > > slots as compared to primary slots with different number of slot-sync > > > > workers configured. > > > > > > > > > > Thanks! > > > > > > > 3 DBs were created, each with 30 tables and each table having one > > > > logical-pub/sub configured. So this made a total of 90 logical > > > > replication slots to be synced. Then the workload was run for aprox 10 > > > > mins. During this workload, at regular intervals, primary and standby > > > > slots' lsns were captured (from pg_replication_slots) and compared. At > > > > each capture, the intent was to know how much is each standby's slot > > > > lagging behind corresponding primary's slot by taking the distance > > > > between confirmed_flush_lsn of primary and standby slot. Then we took > > > > the average (integer value) of this distance over the span of 10 min > > > > workload > > > > > > Thanks for the explanations, make sense to me. > > > > > > > and this is what we got: > > > > > > > > With max_slot_sync_workers=1, average-lag = 42290.3563 > > > > With max_slot_sync_workers=2, average-lag = 24585.1421 > > > > With max_slot_sync_workers=3, average-lag = 14964.9215 > > > > > > > > This shows that more workers have better chances to keep logical > > > > replication slots in sync for this case. > > > > > > > > > > Agree. > > > > > > > Another statistics if it interests you is, we ran a frequency test as > > > > well (this by changing code, unit test sort of) to figure out the > > > > 'total number of times synchronization done' with different number of > > > > sync-slots workers configured. Same 3 DBs setup with each DB having 30 > > > > logical replication slots. With 'max_slot_sync_workers' set at 1, 2 > > > > and 3; total number of times synchronization done was 15874, 20205 and > > > > 23414 respectively. Note: this is not on the same machine where we > > > > captured lsn-gap data, it is on a little less efficient machine but > > > > gives almost the same picture > > > > > > > > Next we are planning to capture this data for a lesser number of slots > > > > like 10,30,50 etc. It may happen that the benefit of multi-workers > > > > over single workers in such cases could be less, but let's have the > > > > data to verify that. > > > > > > > > > > Thanks a lot for those numbers and for the testing! > > > > > > Do you think it would make sense to also get the number of using > > > the pg_failover_slots module? (and compare the pg_failover_slots numbers > > > with the > > > "one worker" case here). Idea is to check if the patch does introduce > > > some overhead as compare to pg_failover_slots. > > > > > > > Yes, definitely. We will work on that and share the numbers soon. > > > > We are working on these tests. Meanwhile attaching the patches which > attempt to implement below functionalities: > > 1) Remove extra slots on standby if those no longer exist on primary > or are no longer part of synchronize_slot_names. > 2) Make synchronize_slot_names user-modifiable. And due to change in > 'synchronize_slot_names', if dbids list is reduced, then take care of > removal of extra/old db-ids (if any) from workers db-list. > > Thanks Ajin for working on 1. Both the above changes are in > patch-0002. There is a test failure in the recovery module due to > these new changes, I am looking into it and will fix it in the next > version. > > Improvements in pipeline: > a) standby slots should not be consumable. > b) optimization of query which standby sends to primary. Currently it > has dbid filter and slot-name filter. Slot-name filter can be > optimized to have only those slots which belong to DBs assigned to the > worker rather than having all 'synchronize_slot_names'. > c) analyze if the naptime of the slot-sync worker can be auto-tuned. > If there is no activity going on (i.e. slots are not advancing on > primary) then increase
Re: Synchronizing slots from primary to standby
On Thu, Aug 17, 2023 at 11:55 AM shveta malik wrote: > > On Thu, Aug 17, 2023 at 11:44 AM Drouvot, Bertrand > wrote: > > > > Hi, > > > > On 8/14/23 11:52 AM, shveta malik wrote: > > > > > > > > We (myself and Ajin) performed the tests to compute the lag in standby > > > slots as compared to primary slots with different number of slot-sync > > > workers configured. > > > > > > > Thanks! > > > > > 3 DBs were created, each with 30 tables and each table having one > > > logical-pub/sub configured. So this made a total of 90 logical > > > replication slots to be synced. Then the workload was run for aprox 10 > > > mins. During this workload, at regular intervals, primary and standby > > > slots' lsns were captured (from pg_replication_slots) and compared. At > > > each capture, the intent was to know how much is each standby's slot > > > lagging behind corresponding primary's slot by taking the distance > > > between confirmed_flush_lsn of primary and standby slot. Then we took > > > the average (integer value) of this distance over the span of 10 min > > > workload > > > > Thanks for the explanations, make sense to me. > > > > > and this is what we got: > > > > > > With max_slot_sync_workers=1, average-lag = 42290.3563 > > > With max_slot_sync_workers=2, average-lag = 24585.1421 > > > With max_slot_sync_workers=3, average-lag = 14964.9215 > > > > > > This shows that more workers have better chances to keep logical > > > replication slots in sync for this case. > > > > > > > Agree. > > > > > Another statistics if it interests you is, we ran a frequency test as > > > well (this by changing code, unit test sort of) to figure out the > > > 'total number of times synchronization done' with different number of > > > sync-slots workers configured. Same 3 DBs setup with each DB having 30 > > > logical replication slots. With 'max_slot_sync_workers' set at 1, 2 > > > and 3; total number of times synchronization done was 15874, 20205 and > > > 23414 respectively. Note: this is not on the same machine where we > > > captured lsn-gap data, it is on a little less efficient machine but > > > gives almost the same picture > > > > > > Next we are planning to capture this data for a lesser number of slots > > > like 10,30,50 etc. It may happen that the benefit of multi-workers > > > over single workers in such cases could be less, but let's have the > > > data to verify that. > > > > > > > Thanks a lot for those numbers and for the testing! > > > > Do you think it would make sense to also get the number of using > > the pg_failover_slots module? (and compare the pg_failover_slots numbers > > with the > > "one worker" case here). Idea is to check if the patch does introduce > > some overhead as compare to pg_failover_slots. > > > > Yes, definitely. We will work on that and share the numbers soon. > Here are the numbers for pg_failover_extension. Thank You Ajin for performing all the tests and providing the data offline. -- pg_failover_slots extension: 40 slots: default nap (60 sec): 12742133.96 10ms nap: 19984.34 90 slots: default nap (60 sec): 10063342.72 10ms nap: 34483.82 -- slot-sync-workers case (default 10ms nap for each test): - 40 slots: 1 worker:20566.09 3 worker:7885.80 90 slots: 1 worker: 36706.84 3 worker: 10236.63 Observations: 1) Worker=1 case is slightly behind in our case as compared to pg_failover_extension (for the same naptime of 10ms). This is due to the support for multi-worker design where locks and dsm come into play. I will review this case for optimization. 2) The multi-worker case seems way better in all tests. Few points we observed while performing the tests on pg_failover_extension: 1) It has a naptime of 60sec which is on the higher side and thus we see huge lag in slots being synchronized. Please see default-nap readings above. The default data of extension is not comparable to our default case. And thus for apple to apple comparisons, we changed naptime to 10ms for pg_failover_extension. 2) It takes a lot of time while creating-slots. Every slot creation needs workload to be run on primary i.e. if after say 4th slot creation, there is no activity going on primary, it waits and does not proceed to create rest of the slots and thus we had to ma
Re: Adding a LogicalRepWorker type field
On Fri, Aug 18, 2023 at 8:50 AM Amit Kapila wrote: > > On Mon, Aug 14, 2023 at 12:08 PM Peter Smith wrote: > > > > The main patch for adding the worker type enum has been pushed [1]. > > > > Here is the remaining (rebased) patch for changing some previous > > cascading if/else to switch on the LogicalRepWorkerType enum instead. > > > > I see this as being useful if we plan to add more worker types. Does > anyone else see this remaining patch as an improvement? > I feel it does give a tad bit more clarity for cases where we have 'else' part with no clear comments or relevant keywords. As an example, in function 'should_apply_changes_for_rel' , we have: else return (rel->state == SUBREL_STATE_READY || (rel->state == SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn)); It is difficult to figure out which worker is this if I do not know the concept completely; 'case WORKERTYPE_APPLY' makes it better for the reader to understand. thanks Shveta
Re: Synchronizing slots from primary to standby
On Thu, Aug 17, 2023 at 11:55 AM shveta malik wrote: > > On Thu, Aug 17, 2023 at 11:44 AM Drouvot, Bertrand > wrote: > > > > Hi, > > > > On 8/14/23 11:52 AM, shveta malik wrote: > > > > > > > > We (myself and Ajin) performed the tests to compute the lag in standby > > > slots as compared to primary slots with different number of slot-sync > > > workers configured. > > > > > > > Thanks! > > > > > 3 DBs were created, each with 30 tables and each table having one > > > logical-pub/sub configured. So this made a total of 90 logical > > > replication slots to be synced. Then the workload was run for aprox 10 > > > mins. During this workload, at regular intervals, primary and standby > > > slots' lsns were captured (from pg_replication_slots) and compared. At > > > each capture, the intent was to know how much is each standby's slot > > > lagging behind corresponding primary's slot by taking the distance > > > between confirmed_flush_lsn of primary and standby slot. Then we took > > > the average (integer value) of this distance over the span of 10 min > > > workload > > > > Thanks for the explanations, make sense to me. > > > > > and this is what we got: > > > > > > With max_slot_sync_workers=1, average-lag = 42290.3563 > > > With max_slot_sync_workers=2, average-lag = 24585.1421 > > > With max_slot_sync_workers=3, average-lag = 14964.9215 > > > > > > This shows that more workers have better chances to keep logical > > > replication slots in sync for this case. > > > > > > > Agree. > > > > > Another statistics if it interests you is, we ran a frequency test as > > > well (this by changing code, unit test sort of) to figure out the > > > 'total number of times synchronization done' with different number of > > > sync-slots workers configured. Same 3 DBs setup with each DB having 30 > > > logical replication slots. With 'max_slot_sync_workers' set at 1, 2 > > > and 3; total number of times synchronization done was 15874, 20205 and > > > 23414 respectively. Note: this is not on the same machine where we > > > captured lsn-gap data, it is on a little less efficient machine but > > > gives almost the same picture > > > > > > Next we are planning to capture this data for a lesser number of slots > > > like 10,30,50 etc. It may happen that the benefit of multi-workers > > > over single workers in such cases could be less, but let's have the > > > data to verify that. > > > > > > > Thanks a lot for those numbers and for the testing! > > > > Do you think it would make sense to also get the number of using > > the pg_failover_slots module? (and compare the pg_failover_slots numbers > > with the > > "one worker" case here). Idea is to check if the patch does introduce > > some overhead as compare to pg_failover_slots. > > > > Yes, definitely. We will work on that and share the numbers soon. > We are working on these tests. Meanwhile attaching the patches which attempt to implement below functionalities: 1) Remove extra slots on standby if those no longer exist on primary or are no longer part of synchronize_slot_names. 2) Make synchronize_slot_names user-modifiable. And due to change in 'synchronize_slot_names', if dbids list is reduced, then take care of removal of extra/old db-ids (if any) from workers db-list. Thanks Ajin for working on 1. Both the above changes are in patch-0002. There is a test failure in the recovery module due to these new changes, I am looking into it and will fix it in the next version. Improvements in pipeline: a) standby slots should not be consumable. b) optimization of query which standby sends to primary. Currently it has dbid filter and slot-name filter. Slot-name filter can be optimized to have only those slots which belong to DBs assigned to the worker rather than having all 'synchronize_slot_names'. c) analyze if the naptime of the slot-sync worker can be auto-tuned. If there is no activity going on (i.e. slots are not advancing on primary) then increase naptime of slot-sync worker on standby and decrease it again when activity starts. thanks Shveta v12-0001-Allow-logical-walsenders-to-wait-for-physical-st.patch Description: Binary data v12-0002-Add-logical-slot-sync-capability-to-physical-sta.patch Description: Binary data
Re: Synchronizing slots from primary to standby
On Thu, Aug 17, 2023 at 11:44 AM Drouvot, Bertrand wrote: > > Hi, > > On 8/14/23 11:52 AM, shveta malik wrote: > > > > > We (myself and Ajin) performed the tests to compute the lag in standby > > slots as compared to primary slots with different number of slot-sync > > workers configured. > > > > Thanks! > > > 3 DBs were created, each with 30 tables and each table having one > > logical-pub/sub configured. So this made a total of 90 logical > > replication slots to be synced. Then the workload was run for aprox 10 > > mins. During this workload, at regular intervals, primary and standby > > slots' lsns were captured (from pg_replication_slots) and compared. At > > each capture, the intent was to know how much is each standby's slot > > lagging behind corresponding primary's slot by taking the distance > > between confirmed_flush_lsn of primary and standby slot. Then we took > > the average (integer value) of this distance over the span of 10 min > > workload > > Thanks for the explanations, make sense to me. > > > and this is what we got: > > > > With max_slot_sync_workers=1, average-lag = 42290.3563 > > With max_slot_sync_workers=2, average-lag = 24585.1421 > > With max_slot_sync_workers=3, average-lag = 14964.9215 > > > > This shows that more workers have better chances to keep logical > > replication slots in sync for this case. > > > > Agree. > > > Another statistics if it interests you is, we ran a frequency test as > > well (this by changing code, unit test sort of) to figure out the > > 'total number of times synchronization done' with different number of > > sync-slots workers configured. Same 3 DBs setup with each DB having 30 > > logical replication slots. With 'max_slot_sync_workers' set at 1, 2 > > and 3; total number of times synchronization done was 15874, 20205 and > > 23414 respectively. Note: this is not on the same machine where we > > captured lsn-gap data, it is on a little less efficient machine but > > gives almost the same picture > > > > Next we are planning to capture this data for a lesser number of slots > > like 10,30,50 etc. It may happen that the benefit of multi-workers > > over single workers in such cases could be less, but let's have the > > data to verify that. > > > > Thanks a lot for those numbers and for the testing! > > Do you think it would make sense to also get the number of using > the pg_failover_slots module? (and compare the pg_failover_slots numbers with > the > "one worker" case here). Idea is to check if the patch does introduce > some overhead as compare to pg_failover_slots. > Yes, definitely. We will work on that and share the numbers soon. thanks Shveta
Re: Synchronizing slots from primary to standby
On Mon, Aug 14, 2023 at 3:22 PM shveta malik wrote: > > On Tue, Aug 8, 2023 at 11:11 AM Drouvot, Bertrand > wrote: > > > > Hi, > > > > On 8/8/23 7:01 AM, shveta malik wrote: > > > On Mon, Aug 7, 2023 at 3:17 PM Drouvot, Bertrand > > > wrote: > > >> > > >> Hi, > > >> > > >> On 8/4/23 1:32 PM, shveta malik wrote: > > >>> On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand > > >>> wrote: > > >>>> On 7/28/23 4:39 PM, Bharath Rupireddy wrote: > > >> > > > > > > Agreed. That is why in v10,v11 patches, we have different infra for > > > sync-slot worker i.e. it is not relying on "logical replication > > > background worker" anymore. > > > > yeah saw that, looks like the right way to go to me. > > > > >> Maybe we should start some tests/benchmark with only one sync worker to > > >> get numbers > > >> and start from there? > > > > > > Yes, we can do that performance testing to figure out the difference > > > between the two modes. I will try to get some statistics on this. > > > > > > > Great, thanks! > > > > We (myself and Ajin) performed the tests to compute the lag in standby > slots as compared to primary slots with different number of slot-sync > workers configured. > > 3 DBs were created, each with 30 tables and each table having one > logical-pub/sub configured. So this made a total of 90 logical > replication slots to be synced. Then the workload was run for aprox 10 > mins. During this workload, at regular intervals, primary and standby > slots' lsns were captured (from pg_replication_slots) and compared. At > each capture, the intent was to know how much is each standby's slot > lagging behind corresponding primary's slot by taking the distance > between confirmed_flush_lsn of primary and standby slot. Then we took > the average (integer value) of this distance over the span of 10 min > workload and this is what we got: > I have attached the scripts for schema-setup, running workload and capturing lag. Please go through Readme for details. thanks Shveta <>
Re: Synchronizing slots from primary to standby
On Tue, Aug 8, 2023 at 11:11 AM Drouvot, Bertrand wrote: > > Hi, > > On 8/8/23 7:01 AM, shveta malik wrote: > > On Mon, Aug 7, 2023 at 3:17 PM Drouvot, Bertrand > > wrote: > >> > >> Hi, > >> > >> On 8/4/23 1:32 PM, shveta malik wrote: > >>> On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand > >>> wrote: > >>>> On 7/28/23 4:39 PM, Bharath Rupireddy wrote: > >> > > > > Agreed. That is why in v10,v11 patches, we have different infra for > > sync-slot worker i.e. it is not relying on "logical replication > > background worker" anymore. > > yeah saw that, looks like the right way to go to me. > > >> Maybe we should start some tests/benchmark with only one sync worker to > >> get numbers > >> and start from there? > > > > Yes, we can do that performance testing to figure out the difference > > between the two modes. I will try to get some statistics on this. > > > > Great, thanks! > We (myself and Ajin) performed the tests to compute the lag in standby slots as compared to primary slots with different number of slot-sync workers configured. 3 DBs were created, each with 30 tables and each table having one logical-pub/sub configured. So this made a total of 90 logical replication slots to be synced. Then the workload was run for aprox 10 mins. During this workload, at regular intervals, primary and standby slots' lsns were captured (from pg_replication_slots) and compared. At each capture, the intent was to know how much is each standby's slot lagging behind corresponding primary's slot by taking the distance between confirmed_flush_lsn of primary and standby slot. Then we took the average (integer value) of this distance over the span of 10 min workload and this is what we got: With max_slot_sync_workers=1, average-lag = 42290.3563 With max_slot_sync_workers=2, average-lag = 24585.1421 With max_slot_sync_workers=3, average-lag = 14964.9215 This shows that more workers have better chances to keep logical replication slots in sync for this case. Another statistics if it interests you is, we ran a frequency test as well (this by changing code, unit test sort of) to figure out the 'total number of times synchronization done' with different number of sync-slots workers configured. Same 3 DBs setup with each DB having 30 logical replication slots. With 'max_slot_sync_workers' set at 1, 2 and 3; total number of times synchronization done was 15874, 20205 and 23414 respectively. Note: this is not on the same machine where we captured lsn-gap data, it is on a little less efficient machine but gives almost the same picture. Next we are planning to capture this data for a lesser number of slots like 10,30,50 etc. It may happen that the benefit of multi-workers over single workers in such cases could be less, but let's have the data to verify that. Thanks Ajin for jointly working on this. thanks Shveta
Re: Synchronizing slots from primary to standby
On Mon, Aug 7, 2023 at 3:17 PM Drouvot, Bertrand wrote: > > Hi, > > On 8/4/23 1:32 PM, shveta malik wrote: > > On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand > > wrote: > >> On 7/28/23 4:39 PM, Bharath Rupireddy wrote: > > >> Sorry to be late, but I gave a second thought and I wonder if we really > >> need this design. > >> (i.e start a logical replication background worker on the standby to sync > >> the slots). > >> > >> Wouldn't that be simpler to "just" update the sync slots "metadata" > >> as the https://github.com/EnterpriseDB/pg_failover_slots module (mentioned > >> by Peter > >> up-thread) is doing? > >> (making use of LogicalConfirmReceivedLocation(), > >> LogicalIncreaseXminForSlot() > >> and LogicalIncreaseRestartDecodingForSlot(), If I read > >> synchronize_one_slot() correctly). > >> > > > > Agreed. It would be simpler to just update the metadata. I think you > > have not got chance to review the latest posted patch ('v10-0003') > > yet, it does the same. > > Thanks for the feedback! Yeah, I did not look at v10 in details and was > looking at the email thread only. > > Indeed, I now see that 0003 does update the metadata in local_slot_advance(), > that's great! > > > > > But I do not quite get it as in how can we do it w/o starting a > > background worker? > > Yeah, agree that we still need background workers. > What I meant was to avoid to use "logical replication background worker" > (aka through logicalrep_worker_launch()) to sync the slots. > Agreed. That is why in v10,v11 patches, we have different infra for sync-slot worker i.e. it is not relying on "logical replication background worker" anymore. > > The question here is how many background workers we > > need to have. Will one be sufficient or do we need one per db (as done > > earlier by the original patches in this thread) or are we good with > > dividing work among some limited number of workers? > > > > I feel syncing all slots in one worker may increase the lag between > > subsequent syncs for a particular slot and if the number of slots are > > huge, the chances of losing the slot-data is more in case of failure. > > Starting one worker per db also might not be that efficient as it will > > increase load on the system (both in terms of background worker and > > network traffic) especially for a case where the number of dbs are > > more. Thus starting max 'n' number of workers where 'n' is decided by > > GUC and dividing the work/DBs among these looks a better option to me. > > Please see the discussion in and around the email at [1] > > > > [1]: > > https://www.postgresql.org/message-id/CAJpy0uCT%2BnpL4eUvCWiV_MBEri9ixcUgJVDdsBCJSqLd0oD1fQ%40mail.gmail.com > > Thanks for the link! If I read the email thread correctly, this discussion > was before V10 (which is the first version making use of > LogicalConfirmReceivedLocation(), > LogicalIncreaseXminForSlot(), LogicalIncreaseRestartDecodingForSlot()) means > before the metadata sync only has been implemented. > > While I agree that the approach to split the sync load among workers with the > new > max_slot_sync_workers GUC seems reasonable without the sync only feature (pre > V10), > I'm not sure that with the metadata sync only in place the extra complexity > to manage multiple > sync workers is needed. > > Maybe we should start some tests/benchmark with only one sync worker to get > numbers > and start from there? Yes, we can do that performance testing to figure out the difference between the two modes. I will try to get some statistics on this. thanks Shveta
Re: Synchronizing slots from primary to standby
On Tue, Aug 1, 2023 at 4:52 PM shveta malik wrote: > > On Thu, Jul 27, 2023 at 12:13 PM shveta malik wrote: > > > > On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila > > wrote: > > > > > > On Wed, Jul 26, 2023 at 10:31 AM shveta malik > > > wrote: > > > > > > > > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila > > > > wrote: > > > > > > > > > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy > > > > > wrote: > > > > > > > > > > > > Is having one (or a few more - not > > > > > > necessarily one for each logical slot) worker for all logical slots > > > > > > enough? > > > > > > > > > > > > > > > > I guess for a large number of slots the is a possibility of a large > > > > > gap in syncing the slots which probably means we need to retain > > > > > corresponding WAL for a much longer time on the primary. If we can > > > > > prove that the gap won't be large enough to matter then this would be > > > > > probably worth considering otherwise, I think we should find a way to > > > > > scale the number of workers to avoid the large gap. > > > > > > > > > > > > > How about this: > > > > > > > > 1) On standby, spawn 1 worker per database in the start (as it is > > > > doing currently). > > > > > > > > 2) Maintain statistics on activity against each primary's database on > > > > standby by any means. Could be by maintaining 'last_synced_time' and > > > > 'last_activity_seen time'. The last_synced_time is updated every time > > > > we sync/recheck slots for that particular database. The > > > > 'last_activity_seen_time' changes only if we get any slot on that > > > > database where actually confirmed_flush or say restart_lsn has changed > > > > from what was maintained already. > > > > > > > > 3) If at any moment, we find that 'last_synced_time' - > > > > 'last_activity_seen' goes beyond a threshold, that means that DB is > > > > not active currently. Add it to list of inactive DB > > > > > > > > > > I think we should also increase the next_sync_time if in current sync, > > > there is no update. > > > > +1 > > > > > > > > > 4) Launcher on the other hand is always checking if it needs to spawn > > > > any other extra worker for any new DB. It will additionally check if > > > > number of inactive databases (maintained on standby) has gone higher > > > > (> some threshold), then it brings down the workers for those and > > > > starts a common worker which takes care of all such inactive databases > > > > (or merge all in 1), while workers for active databases remain as such > > > > (i.e. one per db). Each worker maintains the list of DBs which it is > > > > responsible for. > > > > > > > > 5) If in the list of these inactive databases, we again find any > > > > active database using the above logic, then the launcher will spawn a > > > > separate worker for that. > > > > > > > > > > I wonder if we anyway some sort of design like this because we > > > shouldn't allow to spawn as many workers as the number of databases. > > > There has to be some existing or new GUC like max_sync_slot_workers > > > which decided the number of workers. > > > > > > > Currently it does not have any such GUC for sync-slot workers. It > > mainly uses the logical-rep-worker framework for the sync-slot worker > > part and thus it relies on 'max_logical_replication_workers' GUC. Also > > it errors out if 'max_replication_slots' is set to zero. I think it is > > not the correct way of doing things for sync-slot. We can have a new > > GUC (max_sync_slot_workers) as you suggested and if the number of > > databases < max_sync_slot_workers, then we can start 1 worker per > > dbid, else divide the work equally among the max sync-workers > > possible. And for inactive database cases, we can increase the > > next_sync_time rather than starting a special worker to handle all the > > inactive databases. Thoughts? > > > > Attaching the PoC patch (0003) where attempts to implement the basic > infrastructure for the suggested design. Rebased the existing patches > (0001 and 0002) as well. > > This patch adds a new GUC max_slot_sync_workers; th
Re: Synchronizing slots from primary to standby
On Fri, Aug 4, 2023 at 2:44 PM Drouvot, Bertrand wrote: > > Hi, > > On 7/28/23 4:39 PM, Bharath Rupireddy wrote: > > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila wrote: > >> > >>> 2. All candidate standbys will start one slot sync worker per logical > >>> slot which might not be scalable. > >> > >> Yeah, that doesn't sound like a good idea but IIRC, the proposed patch > >> is using one worker per database (for all slots corresponding to a > >> database). > > > > Right. It's based on one worker for each database. > > > >>> Is having one (or a few more - not > >>> necessarily one for each logical slot) worker for all logical slots > >>> enough? > >> > >> I guess for a large number of slots the is a possibility of a large > >> gap in syncing the slots which probably means we need to retain > >> corresponding WAL for a much longer time on the primary. If we can > >> prove that the gap won't be large enough to matter then this would be > >> probably worth considering otherwise, I think we should find a way to > >> scale the number of workers to avoid the large gap. > > > > I think the gap is largely determined by the time taken to advance > > each slot and the amount of WAL that each logical slot moves ahead on > > primary. > > Sorry to be late, but I gave a second thought and I wonder if we really need > this design. > (i.e start a logical replication background worker on the standby to sync the > slots). > > Wouldn't that be simpler to "just" update the sync slots "metadata" > as the https://github.com/EnterpriseDB/pg_failover_slots module (mentioned by > Peter > up-thread) is doing? > (making use of LogicalConfirmReceivedLocation(), LogicalIncreaseXminForSlot() > and LogicalIncreaseRestartDecodingForSlot(), If I read synchronize_one_slot() > correctly). > Agreed. It would be simpler to just update the metadata. I think you have not got chance to review the latest posted patch ('v10-0003') yet, it does the same. But I do not quite get it as in how can we do it w/o starting a background worker? Even the failover-slots extension starts one background worker. The question here is how many background workers we need to have. Will one be sufficient or do we need one per db (as done earlier by the original patches in this thread) or are we good with dividing work among some limited number of workers? I feel syncing all slots in one worker may increase the lag between subsequent syncs for a particular slot and if the number of slots are huge, the chances of losing the slot-data is more in case of failure. Starting one worker per db also might not be that efficient as it will increase load on the system (both in terms of background worker and network traffic) especially for a case where the number of dbs are more. Thus starting max 'n' number of workers where 'n' is decided by GUC and dividing the work/DBs among these looks a better option to me. Please see the discussion in and around the email at [1] [1]: https://www.postgresql.org/message-id/CAJpy0uCT%2BnpL4eUvCWiV_MBEri9ixcUgJVDdsBCJSqLd0oD1fQ%40mail.gmail.com > > I've measured the time it takes for > > pg_logical_replication_slot_advance with different amounts WAL on my > > system. It took 2595ms/5091ms/31238ms to advance the slot by > > 3.7GB/7.3GB/13GB respectively. To put things into perspective here, > > imagine there are 3 logical slots to sync for a single slot sync > > worker and each of them are in need of advancing the slot by > > 3.7GB/7.3GB/13GB of WAL. The slot sync worker gets to slot 1 again > > after 2595ms+5091ms+31238ms (~40sec), gets to slot 2 again after > > advance time of slot 1 with amount of WAL that the slot has moved > > ahead on primary during 40sec, gets to slot 3 again after advance time > > of slot 1 and slot 2 with amount of WAL that the slot has moved ahead > > on primary and so on. If WAL generation on the primary is pretty fast, > > and if the logical slot moves pretty fast on the primary, the time it > > takes for a single sync worker to sync a slot can increase. > > That would be way "faster" and we would probably not need to > worry that much about the number of "sync" workers (if it/they "just" has/have > to sync slot's "metadata") as proposed above. > Agreed, we need not to worry about delay due to pg_logical_replication_slot_advance if we are only going to update a few simple things using the function calls as mentioned above. thanks Shveta
Re: Synchronizing slots from primary to standby
On Thu, Aug 3, 2023 at 12:28 AM Bharath Rupireddy wrote: > > On Tue, Aug 1, 2023 at 5:01 PM shveta malik wrote: > > > > > The work division amongst the sync workers can > > > be simple, the logical replication launcher builds a shared memory > > > structure based on number of slots to sync and starts the sync workers > > > dynamically, and each sync worker picks {dboid, slot name, conninfo} > > > from the shared memory, syncs it and proceeds with other slots. > > > > Do you mean the logical replication launcher builds a shared memory > > structure based > > on the number of 'dbs' to sync as I understood from your initial comment? > > Yes. I haven't looked at the 0003 patch posted upthread. However, the > standby must do the following at a minimum: > > - Make GUCs synchronize_slot_names and max_slot_sync_workers of > PGC_POSTMASTER type needing postmaster restart when changed as they > affect the number of slot sync workers. I agree that max_slot_sync_workers should be allowed to change only during startup but I strongly feel that synchronize_slot_names should be runtime modifiable. We should give that flexibility to the user. > - LR (logical replication) launcher connects to primary to fetch the > logical slots specified in synchronize_slot_names. This is a one-time > task. if synchronize_slot_names='*', we need to fetch slots info at regular intervals even if it is not runtime modifiable. For a runtime modifiable case, it is obvious to reftech it regular intervals. > - LR launcher prepares a dynamic shared memory (created via > dsm_create) with some state like locks for IPC and an array of > {slot_name, dboid_associated_with_slot, is_sync_in_progress} - maximum > number of elements in the array is the number of slots specified in > synchronize_slot_names. This is a one-time task. yes, we need dynamic-shared-memory but it is not a one-time-allocation. If it were a one-time allocation, then there was no need for DSM, only shared memory allocation was enough. It is not a one time allocation in any of the designs. If it is slot based design, slots may keep on varying for '*' case and if it is DB based design, then number of DBs may go beyond the initial memory allocated and we may need reallocation and relaunch of worker and thus the need of DSM. > - LR launcher decides the *best* number of slot sync workers - (based > on some perf numbers) it can just launch, say, one worker per 2 or 4 > or 8 etc. slots. > - Each slot sync worker then picks up a slot from the DSM, connects to > primary using primary conn info, syncs it, and moves to another slot. > The design based on slots i.e. launcher dividing the slots among the available workers, could prove beneficial over db based division for a case where number of slots per DB varies largely and we end up assigning all DBs with lesser slots to one worker while all heavily loaded DBs to another. But other than this, I see lot of pain points: 1) Since we are going to do slots based synching, query construction will be complex. We will have a query with a long 'where' clause: where slots in (slot1, slot2, slots...). 2) Number of pings to primary will be more as we are pinging it slot based instead of DB based. So the information which we could have fetched collectively in one query (if it was db based) is now splitted to multiple queries assuming that there could be cases where slots belonging to the same DBs end up getting splitted among different workers. 3) if number of slots < max number of workers, how are we going to assign the worker? One slot per worker or all in one worker. If it is one slot per worker, it will again be not that efficient as it will result in more network traffic. This needs more thoughts and case to case varying design. > Not having the capability of on-demand stop/launch of slot sync > workers makes the above design simple IMO. > We need to anyways relaunch workers when DSM is reallocated in case Dbs (or sya slots) exceed some initial allocation limit. thanks Shveta
Re: Synchronizing slots from primary to standby
On Fri, Jul 28, 2023 at 8:54 PM Bharath Rupireddy wrote: > > On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila wrote: > > > > I wonder if we anyway some sort of design like this because we > > shouldn't allow to spawn as many workers as the number of databases. > > There has to be some existing or new GUC like max_sync_slot_workers > > which decided the number of workers. > > It seems reasonable to not have one slot sync worker for each > database. IMV, the slot sync workers must be generic and independently > manageable - generic in the sense that given a database and primary > conninfo, each worker must sync all the slots related to the given > database, independently mangeable in the sense that separate GUC for > number of sync workers, launchable directly by logical replication > launcher dynamically. yes agreed. The patch v10-0003 attempts to do the same. > The work division amongst the sync workers can > be simple, the logical replication launcher builds a shared memory > structure based on number of slots to sync and starts the sync workers > dynamically, and each sync worker picks {dboid, slot name, conninfo} > from the shared memory, syncs it and proceeds with other slots. > Do you mean the logical replication launcher builds a shared memory structure based on the number of 'dbs' to sync as I understood from your initial comment? thanks Shveta
Re: Synchronizing slots from primary to standby
On Thu, Jul 27, 2023 at 10:55 AM Amit Kapila wrote: > > On Wed, Jul 26, 2023 at 10:31 AM shveta malik wrote: > > > > On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila wrote: > > > > > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy > > > wrote: > > > > > > > > Is having one (or a few more - not > > > > necessarily one for each logical slot) worker for all logical slots > > > > enough? > > > > > > > > > > I guess for a large number of slots the is a possibility of a large > > > gap in syncing the slots which probably means we need to retain > > > corresponding WAL for a much longer time on the primary. If we can > > > prove that the gap won't be large enough to matter then this would be > > > probably worth considering otherwise, I think we should find a way to > > > scale the number of workers to avoid the large gap. > > > > > > > How about this: > > > > 1) On standby, spawn 1 worker per database in the start (as it is > > doing currently). > > > > 2) Maintain statistics on activity against each primary's database on > > standby by any means. Could be by maintaining 'last_synced_time' and > > 'last_activity_seen time'. The last_synced_time is updated every time > > we sync/recheck slots for that particular database. The > > 'last_activity_seen_time' changes only if we get any slot on that > > database where actually confirmed_flush or say restart_lsn has changed > > from what was maintained already. > > > > 3) If at any moment, we find that 'last_synced_time' - > > 'last_activity_seen' goes beyond a threshold, that means that DB is > > not active currently. Add it to list of inactive DB > > > > I think we should also increase the next_sync_time if in current sync, > there is no update. +1 > > > 4) Launcher on the other hand is always checking if it needs to spawn > > any other extra worker for any new DB. It will additionally check if > > number of inactive databases (maintained on standby) has gone higher > > (> some threshold), then it brings down the workers for those and > > starts a common worker which takes care of all such inactive databases > > (or merge all in 1), while workers for active databases remain as such > > (i.e. one per db). Each worker maintains the list of DBs which it is > > responsible for. > > > > 5) If in the list of these inactive databases, we again find any > > active database using the above logic, then the launcher will spawn a > > separate worker for that. > > > > I wonder if we anyway some sort of design like this because we > shouldn't allow to spawn as many workers as the number of databases. > There has to be some existing or new GUC like max_sync_slot_workers > which decided the number of workers. > Currently it does not have any such GUC for sync-slot workers. It mainly uses the logical-rep-worker framework for the sync-slot worker part and thus it relies on 'max_logical_replication_workers' GUC. Also it errors out if 'max_replication_slots' is set to zero. I think it is not the correct way of doing things for sync-slot. We can have a new GUC (max_sync_slot_workers) as you suggested and if the number of databases < max_sync_slot_workers, then we can start 1 worker per dbid, else divide the work equally among the max sync-workers possible. And for inactive database cases, we can increase the next_sync_time rather than starting a special worker to handle all the inactive databases. Thoughts? thanks Shveta
Re: Synchronizing slots from primary to standby
On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy wrote: > > On Fri, Jul 21, 2023 at 5:16 PM shveta malik wrote: > > > > Thanks Bharat for letting us know. It is okay to split the patch, it > > may definitely help to understand the modules better but shall we take > > a step back and try to reevaluate the design first before moving to > > other tasks? > > Agree that design comes first. FWIW, I'm attaching the v9 patch set > that I have with me. It can't be a perfect patch set unless the design > is finalized. > Thanks for the patch and summarizing all the issues here. I was going through the patch and found that now we need to maintain 'synchronize_slot_names' on both primary and standby unlike the old way where it was maintained only on standby. I am aware of the problem in earlier implementation where each logical walsender/slot needed to wait for all standbys to catch-up before sending changes to logical subscribers even though that particular slot is not even needed to be synced by any of the standbys. Now it is more restrictive. But now, is this 'synchronize_slot_names' per standby? If there are multiple standbys each having different 'synchronize_slot_names' requirements, then how primary is going to keep track of that? Please let me know if that scenario can never arise where standbys can have different 'synchronize_slot_names'. thanks Shveta
Re: Synchronizing slots from primary to standby
On Mon, Jul 24, 2023 at 9:00 AM Amit Kapila wrote: > > On Mon, Jul 24, 2023 at 8:03 AM Bharath Rupireddy > wrote: > > > > On Fri, Jul 21, 2023 at 5:16 PM shveta malik wrote: > > > > > > Thanks Bharat for letting us know. It is okay to split the patch, it > > > may definitely help to understand the modules better but shall we take > > > a step back and try to reevaluate the design first before moving to > > > other tasks? > > > > Agree that design comes first. FWIW, I'm attaching the v9 patch set > > that I have with me. It can't be a perfect patch set unless the design > > is finalized. > > > > > I analyzed more on the issues stated in [1] for replacing LIST_SLOTS > > > with SELECT query. On rethinking, it might not be a good idea to > > > replace this cmd with SELECT in Launcher code-path > > > > I think there are open fundamental design aspects, before optimizing > > LIST_SLOTS, see below. I'm sure we can come back to this later. > > > > > Secondly, I was thinking if the design proposed in the patch is the > > > best one. No doubt, it is the most simplistic design and thus may > > > .. Any feedback is appreciated. > > > > Here are my thoughts about this feature: > > > > Current design: > > > > 1. On primary, never allow walsenders associated with logical > > replication slots to go ahead of physical standbys that are candidates > > for future primary after failover. This enables subscribers to connect > > to new primary after failover. > > 2. On all candidate standbys, periodically sync logical slots from > > primary (creating the slots if necessary) with one slot sync worker > > per logical slot. > > > > Important considerations: > > > > 1. Does this design guarantee the row versions required by subscribers > > aren't removed on candidate standbys as raised here - > > https://www.postgresql.org/message-id/20220218222319.yozkbhren7vkjbi5%40alap3.anarazel.de? > > > > It seems safe with logical decoding on standbys feature. Also, a > > test-case from upthread is already in patch sets (in v9 too) > > https://www.postgresql.org/message-id/CAAaqYe9FdKODa1a9n%3Dqj%2Bw3NiB9gkwvhRHhcJNginuYYRCnLrg%40mail.gmail.com. > > However, we need to verify the use cases extensively. > > > > Agreed. > > > 2. All candidate standbys will start one slot sync worker per logical > > slot which might not be scalable. > > > > Yeah, that doesn't sound like a good idea but IIRC, the proposed patch > is using one worker per database (for all slots corresponding to a > database). > > > Is having one (or a few more - not > > necessarily one for each logical slot) worker for all logical slots > > enough? > > > > I guess for a large number of slots the is a possibility of a large > gap in syncing the slots which probably means we need to retain > corresponding WAL for a much longer time on the primary. If we can > prove that the gap won't be large enough to matter then this would be > probably worth considering otherwise, I think we should find a way to > scale the number of workers to avoid the large gap. > How about this: 1) On standby, spawn 1 worker per database in the start (as it is doing currently). 2) Maintain statistics on activity against each primary's database on standby by any means. Could be by maintaining 'last_synced_time' and 'last_activity_seen time'. The last_synced_time is updated every time we sync/recheck slots for that particular database. The 'last_activity_seen_time' changes only if we get any slot on that database where actually confirmed_flush or say restart_lsn has changed from what was maintained already. 3) If at any moment, we find that 'last_synced_time' - 'last_activity_seen' goes beyond a threshold, that means that DB is not active currently. Add it to list of inactive DB 4) Launcher on the other hand is always checking if it needs to spawn any other extra worker for any new DB. It will additionally check if number of inactive databases (maintained on standby) has gone higher (> some threshold), then it brings down the workers for those and starts a common worker which takes care of all such inactive databases (or merge all in 1), while workers for active databases remain as such (i.e. one per db). Each worker maintains the list of DBs which it is responsible for. 5) If in the list of these inactive databases, we again find any active database using the above logic, then the launcher will spawn a separate worker for that. Pros: Lesser workers on standby as per the load on primary. Lesser poking of primary by standby i.e. standby will send queries to get slot info for all inactive DBs in 1 run instead of each worker sending such queries separately. Cons: We might see spawning and freeing of workers more frequently. Please let me know your thoughts on this. thanks Shveta
Re: Synchronizing slots from primary to standby
On Fri, Jul 21, 2023 at 11:36 AM Bharath Rupireddy wrote: > > On Thu, Jul 20, 2023 at 5:05 PM shveta malik wrote: > > > > On Fri, Jun 16, 2023 at 3:26 PM Amit Kapila wrote: > > > > > > On Mon, Apr 17, 2023 at 7:37 PM Drouvot, Bertrand > > > wrote: > > > > > > > > > 3. As mentioned in the initial email, I think it would be better to > > > replace LIST_SLOTS command with a SELECT query. > > > > > > > I had a look at this thread. I am interested to work on this and can > > spend some time addressing the comments given here. > > Thanks for your interest. Coincidentally, I started to split the patch > into 2 recently - 0001 making the specified logical wal senders wait > for specified standbys to ack, 0002 synchronize logical slots. I think > I'll have these patches ready by early next week. For 0002, I'll > consider your latest changes having LIST_SLOTS removed. > Thanks Bharat for letting us know. It is okay to split the patch, it may definitely help to understand the modules better but shall we take a step back and try to reevaluate the design first before moving to other tasks? I analyzed more on the issues stated in [1] for replacing LIST_SLOTS with SELECT query. On rethinking, it might not be a good idea to replace this cmd with SELECT in Launcher code-path, because we do not have any database-connection in launcher and 'select' query needs one and thus we need to supply dbname to it. We may take the primary-dbname info in a new GUC from users, but I feel retaining LIST cmd is a better idea over adding a new GUC. But I do not see a reason why we should get complete replication-slots info in LIST command. The logic in Launcher is to get distinct database-ids info out of all the slots and start worker per database-id. Since we are only interested in database-id, I think we should change LIST_SLOTS to something like LIST_DBID_FOR_LOGICAL_SLOTS. This new command may get only unique database-ids for all the logical-slots (or the ones mentioned in synchronize_slot_names) from primary. By doing so, we can avoid huge network traffic in cases where the number of replication slots is quite high considering that max_replication_slots can go upto MAX_BACKENDS:2^18-1. So I plan to make this change where we retain LIST cmd over SELECT query but make this cmd's output restricted to only database-Ids. Thoughts? Secondly, I was thinking if the design proposed in the patch is the best one. No doubt, it is the most simplistic design and thus may prove very efficient for scenarios where we have a reasonable number of workers starting and each one actively busy in slots-synchronisation, handling almost equivalent load. But since we are starting one worker per database id, it may not be most efficient for cases where not all the databases are actively being used. We may have some workers (started for databases not in use) just waking up and sending queries to primary and then going back to sleep and in the process generating network traffic, while others may be heavily loaded to deal with large numbers of active slots for a heavily loaded database. I feel the design should be adaptable to load conditions i.e. if we have more number of actively used slots, then we should have more workers spawned to handle it and when the work is less then the number of spawned workers should be less. I have not thought it thoroughly yet and also not sure whether it will actually come out as a better one, but it or more such designs should be considered before we start fixing bugs in this patch. Kindly let me know if there are already discussions around it which I might have missed. Any feedback is appreciated. [1] : https://www.postgresql.org/message-id/CAJpy0uCMNz3XERP-Vzp-7rHFztJgc6d%2BxsmUVCqsxWPkZvQz0Q%40mail.gmail.com thanks Shveta
Re: Synchronizing slots from primary to standby
On Fri, Jun 16, 2023 at 3:26 PM Amit Kapila wrote: > > On Mon, Apr 17, 2023 at 7:37 PM Drouvot, Bertrand > wrote: > > > 3. As mentioned in the initial email, I think it would be better to > replace LIST_SLOTS command with a SELECT query. > I had a look at this thread. I am interested to work on this and can spend some time addressing the comments given here. I tried to replace LIST_SLOTS command with a SELECT query. Attached rebased patch and PoC patch for LIST_SLOTS removal. For LIST_SLOTs cmd removal, below are the points where more analysis is needed. 1) I could not use the exposed libpqwalreceiver's functions walrcv_exec/libpqrcv_exec in LogicalRepLauncher to run select query instead of LIST_SLOTS cmd. This is because libpqrcv_exec() needs database connection but since in LogicalReplauncher, we do not have any (MyDatabseId is not set), so the API gives an error. Thus to make it work for the time-being, I used 'libpqrcv_PQexec' which is not dependent upon database connection. But since it is not exposed "yet" to other layers, I temporarily added the new code to libpqwalreceiver.c itself. In fact I reused the existing function wrapper libpqrcv_list_slots and changed the functionality to get info using select query rather than list_slots. 2) While using connect API walrcv_connect/libpqrcv_connect(), we need to tell it whether it is for logical or physical replication. In the existing patch, where we were using LIST_SLOTS cmd, we have this connection made with logical=false. But now since we need to run select query to get the same info, using connection with logical=false gives error on primary while executing select query. "ERROR: cannot execute SQL commands in WAL sender for physical replication". And thus in ApplyLauncherStartSlotSync(), I have changed connect API to use logical=true for the time being. I noticed that in the existing patch, it was using logical=false in ApplyLauncherStartSlotSync() while logical=true in synchronize_slots(). Possibly due to the same fact that logical=false connection will not allow synchronize_slots() to run select query on primary while it worked for ApplyLauncherStartSlotSync() as it was running list_slots cmd instead of select query. I am exploring further on these points to figure out which one is the correct way to deal with these. Meanwhile posting this WIP patch for early feedback. I will try addressing other comments as well in next versions. thanks Shveta v1-0001-Remove-list_slots-command.patch Description: Binary data v7-0001-Synchronize-logical-replication-slots-from-primar.patch Description: Binary data
Re: Support logical replication of DDLs
On Wed, Jun 21, 2023 at 6:38 PM Jelte Fennema wrote: > > (to be clear I only skimmed the end of this thread and did not look at > all the previous messages) > > I took a quick look at the first patch (about deparsing table ddl) and > it seems like this would also be very useful for a SHOW CREATE TABLE, > like command. Which was suggested in this thread: > https://www.postgresql.org/message-id/flat/CAFEN2wxsDSSuOvrU03CE33ZphVLqtyh9viPp6huODCDx2UQkYA%40mail.gmail.com > > On that thread I sent a patch with Citus its CREATE TABLE deparsing as > starting point. It looks like this thread went quite a different route > with some JSON intermediary representation. Still it might be useful > to look at the patch with Citus its logic for some inspiration/copying > things. I re-attached that patch here for ease of finding it. Thank You for attaching the patch for our ease. We rely on JSONB because of the flexibility it provides. It is easy to be parsed/processed/transformed arbitrarily by the subscriber using generic rules. It should be trivial to use a JSON tool to change schema A to schema B in any arbitrary DDL command, and produce another working DDL command without having to know how to write that command specifically. It helps in splitting commands as well. As an example, we may need to split commands like "ALTER TABLE foo ADD COLUMN bar double precision DEFAULT random();" so that random() have consistent values on publisher and subscriber. It would be convenient to break commands via deparsing approach rather than via plain string. Above being said, show table command can be implemented from ddl deparse code using below steps: 1) Deparsing to create JSONB format using deparsing API ddl_deparse_to_json. 2) Expanding it back to DDL command using expansion API ddl_deparse_expand_command. But these APIs rely on getting information from parse-tree. This is because we need to construct complete DDL string and info like "IF NOT EXISTS", "CONCURRENTLY" etc can not be obtained from pg_catalog. Even if we think of getting rid of parsetree, it may hit the performance, as it is more efficient for us to get info from parse-tree instead of doing catalog-access for everything. We will try to review your patch to see if there is anything which we can adopt without losing performance and flexibility. Meanwhile if you have any suggestions on our patch which can make your work simpler, please do let us know. We can review that. thanks Shveta
Re: Support logical replication of DDLs
On Mon, Jun 12, 2023 at 7:17 AM Wei Wang (Fujitsu) wrote: > > On Thur, Jun 8, 2023 20:02 PM shveta malik wrote: > > Thank You Vignesh for handling (a), Ajin for handling (b), Shi-san and > > Hou-san for contributing in (c). > > > > The new changes are in patch 0001, 0002, 0005 and 0008. > > Thanks for updating the patch set. > > Here are some comments: > === > For 0002 patch. > 1. The typo atop the function EventTriggerTableInitWrite. > ``` > +/* > + * Fire table_init_rewrite triggers. > + */ > +void > +EventTriggerTableInitWrite(Node *real_create, ObjectAddress address) > ``` > s/table_init_rewrite/table_init_write > > ~~~ > > 2. The new process for "SCT_CreateTableAs" in the function > pg_event_trigger_ddl_commands. > With the event trigger created in > test_ddl_deparse_regress/sql/test_ddl_deparse.sql, when creating the table > that > already exists with `CreateTableAs` command, an error is raised like below: > ``` > postgres=# CREATE TABLE IF NOT EXISTS as_select1 AS SELECT * FROM pg_class > WHERE relkind = 'r'; > postgres=# CREATE TABLE IF NOT EXISTS as_select1 AS SELECT * FROM pg_class > WHERE relkind = 'r'; > NOTICE: relation "as_select1" already exists, skipping > ERROR: unrecognized object class: 0 > CONTEXT: PL/pgSQL function test_ddl_deparse() line 6 at FOR over SELECT rows > ``` > It seems that we could check cmd->d.ctas.real_create in the function > pg_event_trigger_ddl_commands and return NULL in this case. > > === > For 0004 patch. > 3. The command tags that are not supported for deparsing in the tests. > ``` > FOR r IN SELECT * FROM pg_event_trigger_ddl_commands() > -- Some TABLE commands generate sequence-related commands, > also deparse them. > WHERE command_tag in ('ALTER FOREIGN TABLE', 'ALTER TABLE', > 'CREATE FOREIGN > TABLE', 'CREATE TABLE', > 'CREATE TABLE AS', > 'DROP FOREIGN TABLE', > 'DROP TABLE', > 'ALTER SEQUENCE', > 'CREATE SEQUENCE', > 'DROP SEQUENCE') > ``` > Since foreign table is not supported yet in the current patch set, it seems > that > we need to remove "FOREIGN TABLE" related command tag. If so, I think the > following three files need to be modified: > - test_ddl_deparse_regress/sql/test_ddl_deparse.sql > - test_ddl_deparse_regress/t/001_compare_dumped_results.pl > - test_ddl_deparse_regress/t/002_regress_tests.pl > > ~~~ > > 4. The different test items between meson and Makefile. > It seems that we should keep the same SQL files and the same order of SQL > files > in test_ddl_deparse_regress/meson.build and test_ddl_deparse_regress/Makefile. > > === > For 0004 && 0008 patches. > 5. The test cases in the test file > test_ddl_deparse_regress/t/001_compare_dumped_results.pl. > ``` > # load test cases from the regression tests > -my @regress_tests = split /\s+/, $ENV{REGRESS}; > +#my @regress_tests = split /\s+/, $ENV{REGRESS}; > +my @regress_tests = ("create_type", "create_schema", "create_rule", > "create_index"); > ``` > I think @regress_tests should include all SQL files, instead of just four. > BTW, > the old way (using `split /\s+/, $ENV{REGRESS}`) doesn't work in meson. > Wang-san, Thank You for your feedback. In the latest version, we have pulled out CTAS and the test_ddl_deparse_regress module. I will revisit your comments once we plan to put these modules back. thanks Shveta
Re: Support logical replication of DDLs
As per suggestion by Amit, reviewed two more formats to be used for DDL's WAL-logging purpose, analysis below: NodeToString: I do not think it is a good idea to use NodeToString in DDL Rep for reasons below: 1) It consists of too much internal and not-needed information. 2) Too large to be logged in WAL. Optimization of this will be a mammoth task because a) we need to filter out information, not all the info will be useful to the subscriber; b) it is not a simple key based search and replace/remove. Modifying strings is always difficult. 3) It's not compatible across major versions. If we want to use Node information in any format, we need to ensure that the output can be read in a different major version of PostgreSQL. Sql-ddl-to-json-schema given in [1]: This was suggested by Swada-san in one of the discussions. Since it is json format and thus essentially has to be key:value pairs like the current implementation. The difference here is that there is no "format string" maintained with each json object. And thus the awareness on how to construct the DDL (i.e. exact DDL-synatx) needs to be present at the receiver side. In our case, we maintain the "fmt" string using which the receiver can re-construct DDL statements without knowing PostgreSQL's DDL syntax. The "fmt" string tells us the order of elements/keys and also representation of values (string, identifier etc) while the JSON data created by sql-ddl-to-json-schema looks more generic; the receiver can construct a DDL statement in any form. It would be more useful for example when the receiver is not a PostgreSQL server. And thus does not seem a suitable choice for the logical replication use-case in hand. [1]: https://www.npmjs.com/package/sql-ddl-to-json-schema thanks Shveta
Re: Support logical replication of DDLs
On Thu, Jun 8, 2023 at 5:31 PM shveta malik wrote: > > Please find new set of patches addressing below: > a) Issue mentioned by Wang-san in [1], > b) Comments from Peter given in [2] > c) Comments from Amit given in the last 2 emails. > > [1]: > https://www.postgresql.org/message-id/OS3PR01MB62750D43D4F7F075B33BD2609E52A%40OS3PR01MB6275.jpnprd01.prod.outlook.com > [2]: > https://www.postgresql.org/message-id/CAHut%2BPv9vPbUQc0fzrKmDkKOsS_bj-hup_E%2BsLHNEX%2B6F%2BSY5Q%40mail.gmail.com > > Thank You Vignesh for handling (a), Ajin for handling (b), Shi-san and > Hou-san for contributing in (c). > > The new changes are in patch 0001, 0002, 0005 and 0008. > The patch 0005 has some issue in 'doc/src/sgml/logical-replication.sgml' which makes documentation to fail to compile. It will be fixed along with next-version. Please review rest of the changes meanwhile. Sorry for inconvenience. thanks Shveta
Re: Support logical replication of DDLs
On Tue, Jun 6, 2023 at 11:31 AM Wei Wang (Fujitsu) wrote: > > On Thur, June 1, 2023 at 23:42 vignesh C wrote: > > On Wed, 31 May 2023 at 14:32, Wei Wang (Fujitsu) > > wrote: > > > ~~~ > > > > > > 2. Deparsed results of the partition table. > > > When I run the following SQLs: > > > ``` > > > create table parent (a int primary key) partition by range (a); > > > create table child partition of parent default; > > > ``` > > > > > > I got the following two deparsed results: > > > ``` > > > CREATE TABLE public.parent (a pg_catalog.int4 STORAGE PLAIN , > > CONSTRAINT parent_pkey PRIMARY KEY (a)) PARTITION BY RANGE (a) > > > CREATE TABLE public.child PARTITION OF public.parent (CONSTRAINT > > child_pkey PRIMARY KEY (a)) DEFAULT > > > ``` > > > > > > When I run these two deparsed results on another instance, I got the > > > following > > error: > > > ``` > > > postgres=# CREATE TABLE public.parent (a pg_catalog.int4 STORAGE PLAIN > > > , > > CONSTRAINT parent_pkey PRIMARY KEY (a)) PARTITION BY RANGE (a); > > > CREATE TABLE > > > postgres=# CREATE TABLE public.child PARTITION OF public.parent > > (CONSTRAINT child_pkey PRIMARY KEY (a)) DEFAULT; > > > ERROR: multiple primary keys for table "child" are not allowed > > > ``` > > > > > > I think that we could skip deparsing the primary key related constraint > > > for > > > partition (child) table in the function obtainConstraints for this case. > > > > Not applicable for 0008 patch > > I think this issue still exists after applying the 0008 patch. Is this error > the > result we expected? > If no, I think we could try to address this issue in the function > deparse_Constraints_ToJsonb in 0008 patch like the attachment. What do you > think? BTW, we also need to skip the parentheses in the above case if you > think > this approach is OK. > Thank You Wang-san for the patch, we have included the fix after slight modification in the latest patch-set (*2023_06_08.patch). thanks Shveta
Re: Support logical replication of DDLs
On Tue, Jun 6, 2023 at 4:26 PM Amit Kapila wrote: > > On Mon, Jun 5, 2023 at 3:00 PM shveta malik wrote: > > > > Few assorted comments: Hi Amit, thanks for the feedback. Addressed these in recent patch posted (*2023_06_08.patch) > === > 1. I see the following text in 0005 patch: "It supports most of ALTER > TABLE command except some commands(DDL related to PARTITIONED TABLE > ...) that are recently introduced but are not yet supported by the > current ddl_deparser, we will support that later.". Is this still > correct? > Removed this from the commit message. > 2. I think the commit message of 0008 > (0008-ObjTree-Removal-for-multiple-commands-2023_05_22) should be > expanded to explain why ObjTree is not required and or how you have > accomplished the same with jsonb functions. > Done. > 3. 0005* patch has the following changes in docs: > + > + DDL Replication Support by Command Tag > + > + > + > + > + > + > +Command Tag > +For Replication > +Notes > + > + > + > + > +ALTER AGGREGATE > +- > + > + > + > +ALTER CAST > +- > + > ... > ... > > If the patch's scope is to only support replication of table DDLs, why > such other commands are mentioned? > Removed the other commands and made some adjustments. > 4. Can you write some comments about the deparsing format in one of > the file headers or in README? > Added atop ddljson.c as this file takes care of expansion based on fmt-string added. > 5. > + > +The table_init_write event occurs just after > the creation of > +table while execution of CREATE TABLE AS and > +SELECT INTO commands. > > Patch 0001 has multiple references to table_init_write trigger but it > is introduced in the 0002 patch, so those changes either belong to > 0002 or one of the later patches. I think that may reduce most of the > changes in event-trigger.sgml. > Moved it to 0002 patch. > 6. > + if (relpersist == RELPERSISTENCE_PERMANENT) > + { > + ddl_deparse_context context; > + char*json_string; > + > + context.verbose_mode = false; > + context.func_volatile = PROVOLATILE_IMMUTABLE; > > Can we write some comments as to why PROVOLATILE_IMMUTABLE is chosen here? > Added some comments and modified the variable name to make it more appropriate. > 7. > diff --git > a/src/test/modules/test_ddl_deparse_regress/t/001_compare_dumped_results.pl.orig > b/src/test/modules/test_ddl_deparse_regress/t/001_compare_dumped_results.pl.orig > new file mode 100644 > index 00..58cf7cdf33 > --- /dev/null > +++ > b/src/test/modules/test_ddl_deparse_regress/t/001_compare_dumped_results.pl.orig > > Will this file require for the 0008 patch? Or is this just a leftover? > Sorry, added by mistake. Removed it now. thanks Shveta
Re: Support logical replication of DDLs
On Mon, May 29, 2023 at 6:16 PM vignesh C wrote: > > > > > I found few comments while making some changes to the patch: > > 1) Now that objtree is removed, these comments should be modified: > > * Deparse object tree is created by using: > > * a) new_objtree("know contents") where the complete tree content is known > > or > > * the initial tree content is known. > > * b) new_objtree("") for the syntax where the object tree will be derived > > * based on some conditional checks. > > * c) new_objtree_VA where the complete tree can be derived using some fixed > > * content or by using the initial tree content along with some variable > > * arguments. > > * > > Modified > > > 2) isgrant can be removed as it is not used anymore: > > +/* > > + * Return the given object type as a string. > > + * > > + * If isgrant is true, then this function is called while deparsing GRANT > > + * statement and some object names are replaced. > > + */ > > +const char * > > +stringify_objtype(ObjectType objtype, bool isgrant) > > +{ > > + switch (objtype) > > + { > > + case OBJECT_TABLE: > > + return "TABLE"; > > + default: > > + elog(ERROR, "unsupported object type %d", objtype); > > + } > > + > > + return "???"; /* keep compiler quiet */ > > +} > > Modified > > > 3) This statement is not being handled currently, it should be implemented: > > "alter table all in tablespace tbs1 set tablespace" > > Modified > With the above fix, are the below commented tests in alter_table.sql supposed to work? If so, shall these be uncommented? -- ALTER TABLE ALL IN TABLESPACE pg_default SET TABLESPACE pg_default; -- ALTER TABLE ALL IN TABLESPACE pg_default OWNED BY ddl_testing_role SET TABLESPACE pg_default; thanks Shveta
pg_get_indexdef() modification to use TxnSnapshot
I have attempted to convert pg_get_indexdef() to use systable_beginscan() based on transaction-snapshot rather than using SearchSysCache(). The latter does not have any old info and thus provides only the latest info as per the committed txns, which could result in errors in some scenarios. One such case is mentioned atop pg_dump.c. The patch is an attempt to fix the same pg_dump's issue. Any feedback is welcome. There is a long list of pg_get_* functions which use SearchSysCache() and thus may expose similar issues. I can give it a try to review the possibility of converting all of them. Thoughts? thanks Shveta v1-0001-pg_get_indexdef-modification-to-use-TxnSnapshot.patch Description: Binary data
Re: Support logical replication of DDLs
On Mon, May 8, 2023 at 3:58 PM shveta malik wrote: > > On Tue, May 2, 2023 at 8:30 AM shveta malik wrote: > > > > On Fri, Apr 28, 2023 at 5:11 PM Amit Kapila wrote: > > > > > > Now, I think we can try to eliminate this entire ObjTree machinery and > > > directly from the JSON blob during deparsing. We have previously also > > > discussed this in an email chain at [1]. I think now the functionality > > > of JSONB has also been improved and we should investigate whether it > > > is feasible to directly use JSONB APIs to form the required blob. > > > > +1. > > I will investigate this and will share my findings. > > > > > Please find the PoC patch for create-table after object-tree removal. > Missed to mention that it is a combined effort by Vignesh and myself, so let us know your feedback. thanks Shveta
Re: Support logical replication of DDLs
On Fri, Apr 28, 2023 at 5:11 PM Amit Kapila wrote: > > On Tue, Apr 25, 2023 at 9:28 AM Zhijie Hou (Fujitsu) > wrote: > > > > I have a few high-level comments on the deparsing approach used in the > patch. As per my understanding, we first build an ObjTree from the DDL > command, then convert the ObjTree to Jsonb which is then converted to > a JSON string. Now, in the consecutive patch, via publication event > triggers, we get the JSON string via the conversions mentioned, WAL > log it, which then walsender will send to the subscriber, which will > convert the JSON string back to the DDL command and execute it. > > Now, I think we can try to eliminate this entire ObjTree machinery and > directly from the JSON blob during deparsing. We have previously also > discussed this in an email chain at [1]. I think now the functionality > of JSONB has also been improved and we should investigate whether it > is feasible to directly use JSONB APIs to form the required blob. +1. I will investigate this and will share my findings. thanks Shveta
Re: Support logical replication of DDLs
On Thu, Apr 20, 2023 at 3:40 PM Amit Kapila wrote: > > On Thu, Apr 20, 2023 at 9:11 AM shveta malik wrote: > > > > > > Few comments for ddl_deparse.c in patch dated April17: > > > > > 6) There are plenty of places where we use 'append_not_present' > > without using 'append_null_object'. > > Do we need to have 'append_null_object' along with > > 'append_not_present' at these places? > > > > What is the difference if we add a null object or not before not_present? > Sorry I missed this question earlier. There is not much difference execution wise, The "present": false' attribute is sufficient to indicate that the expansion of that element is not needed when we convert back json to ddl command. It is only needed for 'verbose' mode. The 'append_null_object' call makes the format string complete for the user to understand it better. Example: --without append_null_object, we get: "collation": {"fmt": "COLLATE", "present": false} --with append_null_object, we get: With append_null_object(tmp_obj, "%{name}D"), it is: "collation": {"fmt": "COLLATE %{name}D", "name": null, "present": false} So fmt: "COLLATE %{name}D" is more complete (even though not needed when the COLLATE clause is absent) and thus aligns to what we expect out of verbose mode. thanks Shveta
Re: Support logical replication of DDLs
On Thu, Apr 20, 2023 at 2:28 PM shveta malik wrote: > > On Thu, Apr 20, 2023 at 9:11 AM shveta malik wrote: >> >> On Mon, Apr 17, 2023 at 5:32 PM Zhijie Hou (Fujitsu) >> wrote: >> > >> > Attach the new version patch set which include the following changes: >> Few comments for ddl_deparse.c in patch dated April17: >> > Few comments for ddl_json.c in the patch dated April17: > Few more comments, mainly for event_trigger.c in the patch dated April17: 1)EventTriggerCommonSetup() +pub_funcoid = LookupFuncName(pubfuncname, 0, NULL, true); This is the code where we have special handling for ddl-triggers. Here we are passing 'missing_ok' as true, so shouldn't we check pub_funcoid against 'InvalidOid' ? 2) EventTriggerTableInitWriteEnd() + if (stmt->objtype == OBJECT_TABLE) + { +parent = currentEventTriggerState->currentCommand->parent; +pfree(currentEventTriggerState->currentCommand); +currentEventTriggerState->currentCommand = parent; + } + else + { +MemoryContext oldcxt; +oldcxt = MemoryContextSwitchTo(currentEventTriggerState->cxt); +currentEventTriggerState->currentCommand->d.simple.address = address; +currentEventTriggerState->commandList = + lappend(currentEventTriggerState->commandList, + currentEventTriggerState->currentCommand); + + MemoryContextSwitchTo(oldcxt); + } +} It will be good to add some comments in the 'else' part. It is not very much clear what exactly we are doing here and for which scenario. 3) EventTriggerTableInitWrite() + if (!currentEventTriggerState) + return; + + /* Do nothing if no command was collected. */ + if (!currentEventTriggerState->currentCommand) + return; Is it possible that when we reach here no command is collected yet? I think this can happen only for the case when commandCollectionInhibited is true. If so, above can be modified to: if (!currentEventTriggerState || currentEventTriggerState->commandCollectionInhibited) return; Assert(currentEventTriggerState->currentCommand != NULL); This will make the behaviour and checks consistent across similar functions in this file. 4) EventTriggerTableInitWriteEnd() Here as well, we can have the same assert as below: Assert(currentEventTriggerState->currentCommand != NULL); 'currentEventTriggerState' and 'commandCollectionInhibited' checks are already present. 5) logical_replication.sgml: + 'This is especially useful if you have lots schema changes over time that need replication.' lots schema --> lots of schema thanks Shveta
Re: Support logical replication of DDLs
On Thu, Apr 20, 2023 at 9:11 AM shveta malik wrote: > On Mon, Apr 17, 2023 at 5:32 PM Zhijie Hou (Fujitsu) > wrote: > > > > Attach the new version patch set which include the following changes: > > > > Few comments for ddl_deparse.c in patch dated April17: > > Few comments for ddl_json.c in the patch dated April17: 1) expand_jsonval_string() I think we need to assert if jsonval is neither jbvString nor jbvBinary. 2) expand_jsonval_strlit() same here, assert if not jbvString (like we do in expand_jsonval_number and expand_jsonval_identifier etc) 3) expand_jsonb_array() arrayelem is allocated as below, but not freed. initStringInfo() 4) expand_jsonb_array(), we initialize iterator as below which internally does palloc it = JsonbIteratorInit(container); Shall this be freed at the end? I see usage of this function in other files. At a few places, it is freed while not freed at other places. 5) deparse_ddl_json_to_string(char *json_str, char** owner) str = palloc(value->val.string.len + 1); we do palloc here and return allocated memory to caller as 'owner'. Caller sets this 'owner' using SetConfigOption() which internally allocates new memory and copies 'owner' to that. So the memory allocated in deparse_ddl_json_to_string is never freed. Better way should be the caller passing this allocated memory to deparse_ddl_json_to_string() and freeing it when done. Thoughts? 6)expand_fmt_recursive(): value = findJsonbValueFromContainer(container, JB_FOBJECT, ); Should this 'value' be freed at the end like we do at all other places in this file? thanks Shveta
Re: Support logical replication of DDLs
On Mon, Apr 17, 2023 at 5:32 PM Zhijie Hou (Fujitsu) wrote: > > On Monday, April 10, 2023 7:20 PM Amit Kapila wrote: > > > > On Fri, Apr 7, 2023 at 8:52 AM houzj.f...@fujitsu.com > > wrote: > > > > > > Sorry, there was a miss when rebasing the patch which could cause the > > > CFbot to fail and here is the correct patch set. > > > > > > > I see the following note in the patch: "Note: For ATTACH/DETACH PARTITION, > > we haven't added extra logic on the subscriber to handle the case where the > > table on the publisher is a PARTITIONED TABLE while the target table on the > > subscriber side is a NORMAL table. We will research this more and improve it > > later." and wonder what should we do about this. I can think of the > > following > > possibilities: (a) Convert a non-partitioned table to a partitioned one and > > then > > attach the partition; (b) Add the partition as a separate new table; (c) > > give an > > error that table types mismatch. For Detach partition, I don't see much > > possibility than giving an error that no such partition exists or something > > like > > that. Even for the Attach operation, I prefer (c) as the other options > > don't seem > > logical to me and may add more complexity to this work. > > > > Thoughts? > > I also think option (c) makes sense and is same as the latest patch's > behavior. > > Attach the new version patch set which include the following changes: > Few comments for ddl_deparse.c in patch dated April17: 1) append_format_string() I think we need to have 'Assert(sub_fmt)' here like we have it in all other similar functions (append_bool_object, append_object_object, ...) 2) append_object_to_format_string() here we have code piece : if (sub_fmt == NULL || tree->fmtinfo == NULL) return sub_fmt; but sub_fmt will never be null when we reach this function as all its callers assert for null sub_fmt. So that means when tree->fmtinfo is null, we end up returning sub_fmt as it is, instead of extracting object name from that. Is this intended? 3) We can remove extra spaces after full-stop in the comment below /* * Deparse a ColumnDef node within a typed table creation. This is simpler * than the regular case, because we don't have to emit the type declaration, * collation, or default. Here we only return something if the column is being * declared NOT NULL. ... deparse_ColumnDef_typed() 4) These functions are not being used, do we need to retain these in this patch? deparse_Type_Storage() deparse_Type_Receive() deparse_Type_Send() deparse_Type_Typmod_In() deparse_Type_Typmod_Out() deparse_Type_Analyze() deparse_Type_Subscript() 5) deparse_AlterRelation() We have below variable initialized to false in the beginning 'boolistype = false;' And then we have many conditional codes using the above, eg: istype ? "ATTRIBUTE" : "COLUMN". We are not changing 'istype' anywhere and it is hard-coded in the beginning. It means there are parts of code in this function which will never be htt (written for 'istype=true' case) , so why do we need this variable and conditional code around it? 6) There are plenty of places where we use 'append_not_present' without using 'append_null_object'. Do we need to have 'append_null_object' along with 'append_not_present' at these places? 7) deparse_utility_command(): Rather than inject --> Rather than injecting thanks Shveta
Re: Support logical replication of DDLs
On Tue, Apr 4, 2023 at 8:43 AM houzj.f...@fujitsu.com wrote: > Attach the new version patch set which did the following changes: > Hello, I tried below: pubnew=# ALTER PUBLICATION mypub2 SET (ddl = 'table'); ALTER PUBLICATION pubnew=# \dRp+ Publication mypub2 Owner | All tables | All DDLs | Table DDLs | ++--++- shveta | t | f | f (1 row) I still see 'Table DDLs' as false and ddl replication did not work for this case. thanks Shveta
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Tue, Feb 7, 2023 at 8:18 AM shiy.f...@fujitsu.com wrote: > > > On Thu, Feb 2, 2023 11:48 AM shveta malik wrote: > > > > > > So to fix this, I think either we update origin and slot entries in > > the system catalog after the creation has passed or we clean-up the > > system catalog in case of failure. What do you think? > > > > I think the first way seems better. Yes, I agree. > > I reproduced the problem I reported before with latest patch (v7-0001, > v10-0002), and looked into this problem. It is caused by a similar reason. > Here > is some analysis for the problem I reported [1].#6. > > First, a tablesync worker (worker-1) started for "tbl1", its originname is > "pg_16398_1". And it exited because of unique constraint. In > LogicalRepSyncTableStart(), originname in pg_subscription_rel is updated when > updating table state to DATASYNC, and the origin is created when updating > table > state to FINISHEDCOPY. So when it exited with state DATASYNC , the origin is > not > created but the originname has been updated in pg_subscription_rel. > > Then a tablesync worker (worker-2) started for "tbl2", its originname is > "pg_16398_2". After tablesync of "tbl2" finished, this worker moved to sync > table "tbl1". In LogicalRepSyncTableStart(), it got the originname of "tbl1" - > "pg_16398_1", by calling ReplicationOriginNameForLogicalRep(), and tried to > drop > the origin (although it is not actually created before). After that, it called > replorigin_by_name to get the originid whose name is "pg_16398_1" and the > result > is InvalidOid. Origin won't be created in this case because the sync worker > has > created a replication slot (when it synced tbl2), so the originid was still > invalid and it caused an assertion failure when calling replorigin_advance(). > > It seems we don't need to drop previous origin in worker-2 because the > previous > origin was not created in worker-1. I think one way to fix it is to not update > originname of pg_subscription_rel when setting state to DATASYNC, and only do > that when setting state to FINISHEDCOPY. If so, the originname in > pg_subscription_rel will be set at the same time the origin is created. +1. Update of system-catalog needs to be done carefully and only when origin is created. thanks Shveta
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Thu, Feb 2, 2023 at 5:01 PM shveta malik wrote: > > Reviewing further > Hi Melih, int64 rep_slot_id; int64 lastusedid; int64 sublastusedid --Should all of the above be unsigned integers? thanks Shveta
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Fri, Feb 3, 2023 at 11:50 AM shveta malik wrote: > > On Thu, Feb 2, 2023 at 5:01 PM shveta malik wrote: > > > > 2e) > When it goes to reuse flow (i.e. before walrcv_slot_snapshot), if > needed we can dump newly obtained origin_startpos also: > > ereport(DEBUG2, > (errmsg("[subid:%d] LogicalRepSyncWorker_%ld reusing slot %s and origin %s", > MyLogicalRepWorker->subid, MyLogicalRepWorker->rep_slot_id, slotname, > originname))); > One addition, I think it will be good to add relid as well in above so that we can get info as in we are reusing old slot for which relid. Once we have all the above in log-file, it makes it very easy to diagnose reuse-sync worker related problems just by looking at the logfile. thanks Shveta
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Thu, Feb 2, 2023 at 5:01 PM shveta malik wrote: > > Reviewing further > Few more comments for v10-0002 and v7-0001: 1) + * need_full_snapshot + * if true, create a snapshot able to read all tables, + * otherwise do not create any snapshot. + * CreateDecodingContext(..,CreateDecodingContext,..) --Is the comment correct? Shall we have same comment here as that of 'CreateDecodingContext' * need_full_snapshot -- if true, must obtain a snapshot able to read all * tables; if false, one that can read only catalogs is acceptable. This function is not going to create a snapshot anyways. It is just a pre-step and then the caller needs to call 'SnapBuild' functions to build a snapshot. Here need_full_snapshot decides whether we need all tables or only catalog tables changes only and thus the comment change is needed. == 2) Can we please add more logging: 2a) when lastusedId is incremented and updated in pg_* table ereport(DEBUG2, (errmsg("[subid:%d] Incremented lastusedid to:%ld",MySubscription->oid, MySubscription->lastusedid))); Comments for LogicalRepSyncTableStart(): 2b ) After every UpdateSubscriptionRel: ereport(DEBUG2, (errmsg("[subid:%d] LogicalRepSyncWorker_%ld updated origin to %s and slot to %s for relid %d", MyLogicalRepWorker->subid, MyLogicalRepWorker->rep_slot_id, originname, slotname, MyLogicalRepWorker->relid))); 2c ) After walrcv_create_slot: ereport(DEBUG2, (errmsg("[subid:%d] LogicalRepSyncWorker_%ld created slot %s", MyLogicalRepWorker->subid, MyLogicalRepWorker->rep_slot_id, slotname))); 2d) After replorigin_create: ereport(DEBUG2, (errmsg("[subid:%d] LogicalRepSyncWorker_%ld created origin %s [id: %d]", MyLogicalRepWorker->subid, MyLogicalRepWorker->rep_slot_id, originname, originid))); 2e) When it goes to reuse flow (i.e. before walrcv_slot_snapshot), if needed we can dump newly obtained origin_startpos also: ereport(DEBUG2, (errmsg("[subid:%d] LogicalRepSyncWorker_%ld reusing slot %s and origin %s", MyLogicalRepWorker->subid, MyLogicalRepWorker->rep_slot_id, slotname, originname))); 2f) Also in existing comment: + (errmsg("logical replication table synchronization worker for subscription \"%s\" has moved to sync table \"%s\".", + MySubscription->name, get_rel_name(MyLogicalRepWorker->relid; we can add relid also along with relname. relid is the one stored in pg_subscription_rel and thus it becomes easy to map. Also we can change 'logical replication table synchronization worker' to 'LogicalRepSyncWorker_%ld'. Same change needed in other similar log lines where we say that worker started and finished. Please feel free to change the above log lines as you find appropriate. I have given just a sample sort of thing. thanks Shveta
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Wed, Feb 1, 2023 at 5:37 PM Melih Mutlu wrote: > > Hi, > Please see attached patches. > > Thanks, > -- > Melih Mutlu > Microsoft Hi Melih, Few suggestions on v10-0002-Reuse patch 1) for (int64 i = 1; i <= lastusedid; i++) { charoriginname_to_drop[NAMEDATALEN] = {0}; snprintf(originname_to_drop, sizeof(originname_to_drop), "pg_%u_%lld", subid, (long long) i); ... } --Is it better to use the function 'ReplicationOriginNameForLogicalRep' here instead of sprintf, just to be consistent everywhere to construct origin-name? 2) pa_launch_parallel_worker: launched = logicalrep_worker_launch(MyLogicalRepWorker->dbid, MySubscription->oid, MySubscription->name, MyLogicalRepWorker->userid, InvalidOid, dsm_segment_handle(winfo->dsm_seg), 0); --Can we please define 'InvalidRepSlotId' macro and pass it here as the last arg to make it more readable. #define InvalidRepSlotId 0 Same in ApplyLauncherMain where we are passing 0 as last arg to logicalrep_worker_launch. 3) We are safe to drop the replication trackin origin after this --typo: trackin -->tracking 4) process_syncing_tables_for_sync: if (MyLogicalRepWorker->slot_name && strcmp(syncslotname, MyLogicalRepWorker->slot_name) != 0) { .. ReplicationOriginNameForLogicalRep(MyLogicalRepWorker->subid, MyLogicalRepWorker->relid, originname, sizeof(originname)); /* Drop replication origin */ replorigin_drop_by_name(originname, true, false); } --Are we passing missing_ok as true (second arg) intentionally here in replorigin_drop_by_name? Once we fix the issue reported in my earlier email (ASSERT), do you think it makes sense to pass missing_ok as false here? 5) process_syncing_tables_for_sync: foreach(lc, rstates) { rstate = (SubscriptionRelState *) palloc(sizeof(SubscriptionRelState)); memcpy(rstate, lfirst(lc), sizeof(SubscriptionRelState)); /* * Pick the table for the next run if it is not already picked up * by another worker. */ LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); if (rstate->state != SUBREL_STATE_SYNCDONE && !logicalrep_worker_find(MySubscription->oid, rstate->relid, false)) { . } LWLockRelease(LogicalRepWorkerLock); } --Do we need to palloc for each relation separately? Shall we do it once outside the loop and reuse it? Also pfree is not done for rstate here. 6) LogicalRepSyncTableStart: 1385 slotname = (char *) palloc(NAMEDATALEN); 1413 prev_slotname = (char *) palloc(NAMEDATALEN); 1481 slotname = prev_slotname; 1502 pfree(prev_slotname); 1512 UpdateSubscriptionRel(MyLogicalRepWorker->subid, 1513 MyLogicalRepWorker->relid, 1514 MyLogicalRepWorker->relstate, 1515 MyLogicalRepWorker->relstate_lsn, 1516 slotname, 1517 originname); Can you please review the above flow (I have given line# along with), I think it could be problematic. We alloced prev_slotname, assigned it to slotname, freed prev_slotname and used slotname after freeing the prev_slotname. Also slotname is allocated some memory too, that is not freed. Reviewing further JFYI, I get below while applying patch: shveta@shveta-vm:~/repos/postgres1/postgres$ git am ~/Desktop/shared/reuse/v10-0002-Reuse-Logical-Replication-Background-worker.patch Applying: Reuse Logical Replication Background worker .git/rebase-apply/patch:142: trailing whitespace. values[Anum_pg_subscription_rel_srrelslotname - 1] = .git/rebase-apply/patch:692: indent with spaces. errmsg("could not receive list of slots associated with the subscription %u, error: %s", .git/rebase-apply/patch:1055: trailing whitespace. /* .git/rebase-apply/patch:1057: trailing whitespace. * relations. .git/rebase-apply/patch:1059: trailing whitespace. * and origin. Then stop the worker gracefully. warning: 5 lines add whitespace errors. thanks Shveta
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Thu, Feb 2, 2023 at 9:18 AM shveta malik wrote: > > > Hi Melih, > I think I am able to identify the root cause. It is not memory > corruption, but the way origin-names are stored in system-catalog > mapped to a particular relation-id before even those are created. > Apart from the problem mentioned in my earlier email, I think there is one more issue here as seen by the same assert causing testcase. The 'lastusedid' stored in system-catalog kept on increasing w/o even slot and origin getting created. 2 workers worked well with max_replication_slots=2 and then since all slots were consumed 3rd one could not create any slot and exited but it increased lastusedid. Then another worker came, incremented lastusedId in system-catalog and failed to create slot and exited and so on. This makes lastUsedId incremented continuously until you kill the subscriber or free any slot used previously. If you keep subscriber running long enough, it will make lastUsedId go beyond its limit. Shouldn't lastUsedId be incremented only after making sure that worker has created a slot and origin corresponding to that particular rep_slot_id (derived using lastUsedId). Please let me know if my understanding is not correct. thanks Shveta
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Wed, Feb 1, 2023 at 5:42 PM Melih Mutlu wrote: > > Hi Shveta, > > shveta malik , 1 Şub 2023 Çar, 15:01 tarihinde şunu > yazdı: >> >> On Wed, Feb 1, 2023 at 5:05 PM Melih Mutlu wrote: >> 2) I found a crash in the previous patch (v9), but have not tested it >> on the latest yet. Crash happens when all the replication slots are >> consumed and we are trying to create new. I tweaked the settings like >> below so that it can be reproduced easily: >> max_sync_workers_per_subscription=3 >> max_replication_slots = 2 >> and then ran the test case shared by you. I think there is some memory >> corruption happening. (I did test in debug mode, have not tried in >> release mode). I tried to put some traces to identify the root-cause. >> I observed that worker_1 keeps on moving from 1 table to another table >> correctly, but at some point, it gets corrupted i.e. origin-name >> obtained for it is wrong and it tries to advance that and since that >> origin does not exist, it asserts and then something else crashes. >> From log: (new trace lines added by me are prefixed by shveta, also >> tweaked code to have my comment 1 fixed to have clarity on worker-id). >> >> form below traces, it is clear that worker_1 was moving from one >> relation to another, always getting correct origin 'pg_16688_1', but >> at the end it got 'pg_16688_49' which does not exist. Second part of >> trace shows who updated 'pg_16688_49', it was done by worker_49 which >> even did not get chance to create this origin due to max_rep_slot >> reached. > > > Thanks for investigating this error. I think it's the same error as the one > Shi reported earlier. [1] > I couldn't reproduce it yet but will apply your tweaks and try again. > Looking into this. > > [1] > https://www.postgresql.org/message-id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpnprd01.prod.outlook.com > Hi Melih, I think I am able to identify the root cause. It is not memory corruption, but the way origin-names are stored in system-catalog mapped to a particular relation-id before even those are created. After adding few more logs: [4227] LOG: shveta- LogicalRepSyncTableStart- worker_49 constructed originname :pg_16684_49, relid:16540 [4227] LOG: shveta- LogicalRepSyncTableStart- worker_49 updated-origin in system catalog:pg_16684_49, slot:pg_16684_sync_49_7195149685251088378, relid:16540 [4227] LOG: shveta- LogicalRepSyncTableStart- worker_49 get-origin-id2 originid:0, originname:pg_16684_49 [4227] ERROR: could not create replication slot "pg_16684_sync_49_7195149685251088378": ERROR: all replication slots are in use HINT: Free one or increase max_replication_slots. [4428] LOG: shveta- LogicalRepSyncTableStart- worker_148 constructed originname :pg_16684_49, relid:16540 [4428] LOG: could not drop replication slot "pg_16684_sync_49_7195149685251088378" on publisher: ERROR: replication slot "pg_16684_sync_49_7195149 685251088378" does not exist [4428] LOG: shveta- LogicalRepSyncTableStart- worker_148 drop-origin originname:pg_16684_49 [4428] LOG: shveta- LogicalRepSyncTableStart- worker_148 updated-origin:pg_16684_49, slot:pg_16684_sync_148_7195149685251088378, relid:16540 So from above, worker_49 came and picked up relid:16540, constructed origin-name and slot-name and updated in system-catalog and then it tried to actually create that slot and origin but since max-slot count was reached, it failed and exited, but did not do cleanup from system catalog for that relid. Then worker_148 came and also picked up table 16540 since it was not completed/started by previous worker, but this time it found that origin and slot entry present in system-catalog against this relid, so it picked the same names and started processing, but since those do not exist, it asserted while advancing the origin. The assert is only reproduced when an already running worker (say worker_1) who has 'created=true' set, gets to sync the relid for which a previously failed worker has tried and updated origin-name w/o creating it. In such a case worker_1 (with created=true) will try to reuse the origin and thus will try to advance it and will end up asserting. That is why you might not see the assertion always. The condition 'created' is set to true for that worker and it goes to reuse the origin updated by the previous worker. So to fix this, I think either we update origin and slot entries in the system catalog after the creation has passed or we clean-up the system catalog in case of failure. What do you think? thanks Shveta
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Wed, Feb 1, 2023 at 5:42 PM Melih Mutlu wrote: > > > Thanks for investigating this error. I think it's the same error as the one > Shi reported earlier. [1] > I couldn't reproduce it yet but will apply your tweaks and try again. > Looking into this. > > [1] > https://www.postgresql.org/message-id/OSZPR01MB631013C833C98E826B3CFCB9FDC69%40OSZPR01MB6310.jpnprd01.prod.outlook.com I tried Shi-san's testcase earlier but I too could not reproduce it, so I assumed that it is fixed in one of your patches already and thus thought that the current issue is a new one. thanks Shveta
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Wed, Feb 1, 2023 at 5:05 PM Melih Mutlu wrote: > > Hi, > > Please see attached patches for the below changes. > > Thanks for reviewing, > -- > Melih Mutlu > Microsoft Hello Melih, Thank you for making the changes. I have few more comments: 1) src/backend/replication/logical/worker.c: (errmsg("logical replication table synchronization worker for subscription \"%s\", table \"%s\" has started", src/backend/replication/logical/worker.c: (errmsg("logical replication table synchronization worker for subscription \"%s\" has moved to sync table \"%s\".", src/backend/replication/logical/tablesync.c: (errmsg("logical replication table synchronization worker for subscription \"%s\", table \"%s\" has finished", In above can we have rep_slot_id as well in trace message, else it is not clear which worker moved to next relation. We may have: logical replication table synchronization worker_%d for subscription \"%s\" has moved to sync table, rep_slot_id, Overall we need to improve the tracing. I will give my suggestions on this later (in detail). 2) I found a crash in the previous patch (v9), but have not tested it on the latest yet. Crash happens when all the replication slots are consumed and we are trying to create new. I tweaked the settings like below so that it can be reproduced easily: max_sync_workers_per_subscription=3 max_replication_slots = 2 and then ran the test case shared by you. I think there is some memory corruption happening. (I did test in debug mode, have not tried in release mode). I tried to put some traces to identify the root-cause. I observed that worker_1 keeps on moving from 1 table to another table correctly, but at some point, it gets corrupted i.e. origin-name obtained for it is wrong and it tries to advance that and since that origin does not exist, it asserts and then something else crashes. >From log: (new trace lines added by me are prefixed by shveta, also tweaked code to have my comment 1 fixed to have clarity on worker-id). form below traces, it is clear that worker_1 was moving from one relation to another, always getting correct origin 'pg_16688_1', but at the end it got 'pg_16688_49' which does not exist. Second part of trace shows who updated 'pg_16688_49', it was done by worker_49 which even did not get chance to create this origin due to max_rep_slot reached. == .. 2023-02-01 16:01:38.041 IST [9243] LOG: logical replication table synchronization worker_1 for subscription "mysub", table "table_93" has finished 2023-02-01 16:01:38.047 IST [9243] LOG: logical replication table synchronization worker_1 for subscription "mysub" has moved to sync table "table_98". 2023-02-01 16:01:38.055 IST [9243] LOG: shveta- LogicalRepSyncTableStart- worker_1 get-origin-id2 originid:2, originname:pg_16688_1 2023-02-01 16:01:38.055 IST [9243] LOG: shveta- LogicalRepSyncTableStart- Worker_1 reusing slot:pg_16688_sync_1_7195132648087016333, originid:2, originname:pg_16688_1 2023-02-01 16:01:38.094 IST [9243] LOG: shveta- LogicalRepSyncTableStart- worker_1 updated-origin2 originname:pg_16688_1 2023-02-01 16:01:38.096 IST [9243] LOG: logical replication table synchronization worker_1 for subscription "mysub", table "table_98" has finished 2023-02-01 16:01:38.096 IST [9243] LOG: logical replication table synchronization worker_1 for subscription "mysub" has moved to sync table "table_60". 2023-02-01 16:01:38.099 IST [9243] LOG: shveta- LogicalRepSyncTableStart- worker_1 get-origin originid:0, originname:pg_16688_49 2023-02-01 16:01:38.099 IST [9243] LOG: could not drop replication slot "pg_16688_sync_49_7195132648087016333" on publisher: ERROR: replication slot "pg_16688_sync_49_7195132648087016333" does not exist 2023-02-01 16:01:38.103 IST [9243] LOG: shveta- LogicalRepSyncTableStart- Worker_1 reusing slot:pg_16688_sync_1_7195132648087016333, originid:0, originname:pg_16688_49 TRAP: failed Assert("node != InvalidRepOriginId"), File: "origin.c", Line: 892, PID: 9243 postgres: logical replication worker for subscription 16688 sync 16384 (ExceptionalCondition+0xbb)[0x56019194d3b7] postgres: logical replication worker for subscription 16688 sync 16384 (replorigin_advance+0x6d)[0x5601916b53d4] postgres: logical replication worker for subscription 16688 sync 16384 (LogicalRepSyncTableStart+0xbb4)[0x5601916cb648] postgres: logical replication worker for subscription 16688 sync 16384 (+0x5d25e2)[0x5601916d35e2] postgres: logical replication worker for subscription 16688 sync 16384 (+0x5d282c)[0x5601916d382c] postgres: logical replication worker for subscription 16688 sync 16384 (ApplyWorkerMain+0x17b)[0x5601916d4078] postgres: logical replication worker for subscription 16688 sync 16384 (StartBackgroundWorker+0x248)[0x56019167f943] postgres: logical replication worker for subscription 16688 sync 16384 (+0x589ad3)[0x56019168aad3] postgres: logical replication worker for subscription 16688 sync 16384 (+0x589ee3)[0x56019168aee3] postgres: logical
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Tue, Jan 31, 2023 at 3:57 PM wangw.f...@fujitsu.com wrote: > > On Mon, Jan 23, 2023 21:00 PM Melih Mutlu wrote: > > Hi, > > > > Thanks for your review. > > Attached updated versions of the patches. > > Thanks for updating the patch set. > > > > 5. New member "created_slot" in structure LogicalRepWorker > > > + /* > > > +* Indicates if the sync worker created a replication slot or it > > > reuses an > > > +* existing one created by another worker. > > > +*/ > > > + boolcreated_slot; > > > > Yes, I think it makes sense. Thanks for the detailed explanation. > I think I misunderstood the second half of the comment. I previously thought > it > meant that it was also true when reusing an existing slot. > I agree with Wang-san that the comment is confusing, I too misunderstood it initially during my first run of the code. Maybe it can be improved. 'Indicates if the sync worker created a replication slot for itself; set to false if sync worker reuses an existing one created by another worker' thanks Shveta
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Fri, Jan 27, 2023 at 3:41 PM shveta malik wrote: > > > I am reviewing further... > thanks > Shveta Few more comments: v4-0001: 1) REPLICATION_SLOT_SNAPSHOT --Do we need 'CREATE' prefix with it i.e. CREATE_REPLICATION_SNAPSHOT (or some other brief one with CREATE?). 'REPLICATION_SLOT_SNAPSHOT' does not look like a command/action and thus is confusing. 2) is used in the currenct transaction. This command is currently only supported for logical replication. slots. --typo: currenct-->current --slots can be moved to previous line 3) /* * Signal that we don't need the timeout mechanism. We're just creating * the replication slot and don't yet accept feedback messages or send * keepalives. As we possibly need to wait for further WAL the walsender * would otherwise possibly be killed too soon. */ We're just creating the replication slot --> We're just creating the replication snapshot 4) I see XactReadOnly check in CreateReplicationSlot, do we need the same in ReplicationSlotSnapshot() as well? === v9-0002: 5) /* We are safe to drop the replication trackin origin after this --typo: tracking 6) slot->data.catalog_xmin = xmin_horizon; slot->effective_xmin = xmin_horizon; SpinLockRelease(>mutex); xmin_horizon = GetOldestSafeDecodingTransactionId(!need_full_snapshot); ReplicationSlotsComputeRequiredXmin(true); --do we need to set xmin_horizon in slot after 'GetOldestSafeDecodingTransactionId' call, otherwise it will be set to InvalidId in slot. Is that intentional? I see that we do set this correct xmin_horizon in builder->initial_xmin_horizon but the slot is carrying Invalid one. thanks Shveta
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Thu, Jan 26, 2023 at 7:53 PM Melih Mutlu wrote: > > If a relation is currently being synced by a tablesync worker and uses a > replication slot/origin for that operation, then srrelslotname and > srreloriginname fields will have values. > When a relation is done with its replication slot/origin, their info gets > removed from related catalog row, so that slot/origin can be reused for > another table or dropped if not needed anymore. > In your case, all relations are in READY state so it's expected that > srrelslotname and srreloriginname are empty. READY relations do not need a > replication slot/origin anymore. > > Tables are probably synced so quickly that you're missing the moments when a > tablesync worker copies a relation and stores its rep. slot/origin in the > catalog. > If initial sync is long enough, then you should be able to see the columns > get updated. I follow [1] to make it longer and test if the patch really > updates the catalog. > Thank You for the details. It is clear now. > > > Rebased and resolved conflicts. Please check the new version > Please find my suggestions on v9: 1. --Can we please add a few more points to the Summary to make it more clear. a) something telling that reusability of workers is for tables under one subscription and not across multiple subscriptions. b) Since we are reusing both workers and slots, can we add: --when do we actually end up spawning a new worker --when do we actually end up creating a new slot in a worker rather than using existing one. --if we reuse existing slots, what happens to the snapshot? 2. + The last used ID for tablesync workers. This ID is used to + create replication slots. The last used ID needs to be stored + to make logical replication can safely proceed after any interruption. + If sublastusedid is 0, then no table has been synced yet. --typo: to make logical replication can safely proceed ==> to make logical replication safely proceed --Also, does it sound better: The last used ID for tablesync workers. It acts as an unique identifier for replication slots which are created by table-sync workers. The last used ID needs to be persisted... 3. is_first_run; move_to_next_rel; --Do you think one variable is enough here as we do not get any extra info by using 2 variables? Can we have 1 which is more generic like 'ready_to_reuse'. Otherwise, please let me know if we must use 2. 4. /* missin_ok = true, since the origin might be already dropped. */ typo: missing_ok 5. GetReplicationSlotNamesBySubId: errmsg("not tuple returned.")); Can we have a better error msg: ereport(ERROR, errmsg("could not receive list of slots associated with subscription %d, error: %s", subid, res->err)); 6. static void clean_sync_worker(void) --can we please add introductory comment for this function. 7. /* * Pick the table for the next run if there is not another worker * already picked that table. */ Pick the table for the next run if it is not already picked up by another worker. 8. process_syncing_tables_for_sync() /* Cleanup before next run or ending the worker. */ --can we please improve this comment: if there is no more work left for this worker, stop the worker gracefully, else do clean-up and make it ready for the next relation/run. 9. LogicalRepSyncTableStart: * Read previous slot name from the catalog, if exists. */ prev_slotname = (char *) palloc0(NAMEDATALEN); Do we need to free this at the end? 10. if (strlen(prev_slotname) == 0) { elog(ERROR, "Replication slot could not be found for relation %u", MyLogicalRepWorker->relid); } shall we mention subid also in error msg. I am reviewing further... thanks Shveta
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
On Mon, Jan 23, 2023 at 6:30 PM Melih Mutlu wrote: > > Hi, > > Thanks for your review. > Attached updated versions of the patches. > Hello, I am still in the process of reviewing the patch, before that I tried to run below test: --publisher create table tab1(id int , name varchar); create table tab2(id int primary key , name varchar); create table tab3(id int primary key , name varchar); Insert into tab1 values(10, 'a'); Insert into tab1 values(20, 'b'); Insert into tab1 values(30, 'c'); Insert into tab2 values(10, 'a'); Insert into tab2 values(20, 'b'); Insert into tab2 values(30, 'c'); Insert into tab3 values(10, 'a'); Insert into tab3 values(20, 'b'); Insert into tab3 values(30, 'c'); create publication mypub for table tab1, tab2, tab3; --subscriber create table tab1(id int , name varchar); create table tab2(id int primary key , name varchar); create table tab3(id int primary key , name varchar); create subscription mysub connection 'dbname=postgres host=localhost user=shveta port=5432' publication mypub; --I see initial data copied, but new catalog columns srrelslotname and srreloriginname are not updated: postgres=# select sublastusedid from pg_subscription; sublastusedid --- 2 postgres=# select * from pg_subscription_rel; srsubid | srrelid | srsubstate | srsublsn | srrelslotname | srreloriginname -+-++---+---+- 16409 | 16384 | r | 0/15219E0 | | 16409 | 16389 | r | 0/15219E0 | | 16409 | 16396 | r | 0/15219E0 | | When are these supposed to be updated? I thought the slotname created will be updated here. Am I missing something here? Also the v8 patch does not apply on HEAD, giving merge conflicts. thanks Shveta
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Tue, Jan 24, 2023 at 5:49 PM Takamichi Osumi (Fujitsu) wrote: > > > Attached the patch v20 that has incorporated all comments so far. > Kindly have a look at the attached patch. > > > Best Regards, > Takamichi Osumi > Thank You for patch. My previous comments are addressed. Tested it and it looks good. Logging is also fine now. Just one comment, in summary, we see : If the subscription sets min_apply_delay parameter, the logical replication worker will delay the transaction commit for min_apply_delay milliseconds. Is it better to write "delay the transaction apply" instead of "delay the transaction commit" just to be consistent as we do not actually delay the commit for regular transactions. thanks Shveta
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Fri, Jan 20, 2023 at 2:23 PM shveta malik wrote: > > On Fri, Jan 20, 2023 at 1:08 PM Peter Smith wrote: > > > a) the message should say that this is the *remaining* time to left to wait. > > > > b) it might be convenient to know from the log what was the original > > min_apply_delay value in the 1st place. > > > > For example, the logs might look something like this: > > > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > 16 ms. Remaining wait time: 159972 ms > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > 16 ms. Remaining wait time: 142828 ms > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > 16 ms. Remaining wait time: 129994 ms > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > > 16 ms. Remaining wait time: 110001 ms > > ... > > > > +1 > This will also help when min_apply_delay is set to a new value in > between the current wait. Lets say, I started with min_apply_delay=5 > min, when the worker was half way through this, I changed > min_apply_delay to 3 min or say 10min, I see the impact of that change > i.e. new wait-time is adjusted, but log becomes confusing. So, please > keep this scenario as well in mind while improving logging. > when we send-feedback during apply-delay after every wal_receiver_status_interval , the log comes as: 023-01-19 17:12:56.000 IST [404795] DEBUG: sending feedback (force 1) to recv 0/1570840, write 0/1570840, flush 0/1570840 Shall we have some info here to indicate that it is sent while waiting for apply_delay to distinguish it from other such send-feedback logs? It will make apply_delay flow clear in logs. thanks Shveta
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Fri, Jan 20, 2023 at 1:08 PM Peter Smith wrote: > a) the message should say that this is the *remaining* time to left to wait. > > b) it might be convenient to know from the log what was the original > min_apply_delay value in the 1st place. > > For example, the logs might look something like this: > > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > 16 ms. Remaining wait time: 159972 ms > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > 16 ms. Remaining wait time: 142828 ms > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > 16 ms. Remaining wait time: 129994 ms > DEBUG: time-delayed replication for txid 1234, min_apply_delay = > 16 ms. Remaining wait time: 110001 ms > ... > +1 This will also help when min_apply_delay is set to a new value in between the current wait. Lets say, I started with min_apply_delay=5 min, when the worker was half way through this, I changed min_apply_delay to 3 min or say 10min, I see the impact of that change i.e. new wait-time is adjusted, but log becomes confusing. So, please keep this scenario as well in mind while improving logging. thanks Shveta
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Thu, Jan 19, 2023 at 12:42 PM Takamichi Osumi (Fujitsu) wrote: > > On Wednesday, January 18, 2023 4:06 PM Peter Smith > wrote: > > Here are my review comments for the latest patch v16-0001. (excluding the > > test code) > Hi, thank you for your review ! > > > == > > > > General > > > > 1. > > > > Since the value of min_apply_delay cannot be < 0, I was thinking probably > > it > > should have been declared everywhere in this patch as a > > uint64 instead of an int64, right? > No, we won't be able to adopt this idea. > > It seems that we are not able to use uint for catalog type. > So, can't applying it to the pg_subscription.h definitions > and then similarly Int64GetDatum to store catalog variables > and the argument variable of Int64GetDatum. > > Plus, there is a possibility that type Interval becomes negative value, > then we are not able to change the int64 variable to get > the return value of interval2ms(). > > > == > > > > Commit message > > > > 2. > > > > If the subscription sets min_apply_delay parameter, the logical replication > > worker will delay the transaction commit for min_apply_delay milliseconds. > > > > ~ > > > > IMO there should be another sentence before this just to say that a new > > parameter is being added: > > > > e.g. > > This patch implements a new subscription parameter called > > 'min_apply_delay'. > Added. > > > > == > > > > doc/src/sgml/config.sgml > > > > 3. > > > > + > > + For time-delayed logical replication, the apply worker sends a > > Standby > > + Status Update message to the corresponding publisher per the > > indicated > > + time of this parameter. Therefore, if this parameter is longer than > > + wal_sender_timeout on the publisher, then the > > + walsender doesn't get any update message during the delay and > > repeatedly > > + terminates due to the timeout errors. Hence, make sure this > > parameter > > is > > + shorter than the wal_sender_timeout of the > > publisher. > > + If this parameter is set to zero with time-delayed replication, the > > + apply worker doesn't send any feedback messages during the > > + min_apply_delay. > > + > > > > > > This paragraph seemed confusing. I think it needs to be reworded to change > > all > > of the "this parameter" references because there are at least 3 different > > parameters mentioned in this paragraph. e.g. maybe just change them to > > explicitly name the parameter you are talking about. > > > > I also think it needs to mention the ‘min_apply_delay’ subscription > > parameter > > up-front and then refer to it appropriately. > > > > The end result might be something like I wrote below (this is just my guess > > ? > > probably you can word it better). > > > > SUGGESTION > > For time-delayed logical replication (i.e. when the subscription is created > > with > > parameter min_apply_delay > 0), the apply worker sends a Standby Status > > Update message to the publisher with a period of > > wal_receiver_status_interval . > > Make sure to set wal_receiver_status_interval less than the > > wal_sender_timeout on the publisher, otherwise, the walsender will > > repeatedly > > terminate due to the timeout errors. If wal_receiver_status_interval is set > > to zero, > > the apply worker doesn't send any feedback messages during the subscriber’s > > min_apply_delay period. > Applied. Also, I added one reference for min_apply_delay parameter > at the end of this description. > > > > == > > > > doc/src/sgml/ref/create_subscription.sgml > > > > 4. > > > > + > > + By default, the subscriber applies changes as soon as possible. > > As > > + with the physical replication feature > > + (), it can be > > useful to > > + have a time-delayed logical replica. This parameter lets the > > user to > > + delay the application of changes by a specified amount of > > time. If this > > + value is specified without units, it is taken as milliseconds. > > The > > + default is zero(no delay). > > + > > > > 4a. > > As with the physical replication feature (recovery_min_apply_delay), it can > > be > > useful to have a time-delayed logical replica. > > > > IMO not sure that the above sentence is necessary. It seems only to be > > saying > > that this parameter can be useful. Why do we need to say that? > Removed the sentence. > > > > ~ > > > > 4b. > > "This parameter lets the user to delay" -> "This parameter lets the user > > delay" > > OR > > "This parameter lets the user to delay" -> "This parameter allows the user > > to > > delay" > Fixed. > > > > ~ > > > > 4c. > > "If this value is specified without units" -> "If the value is specified > > without > > units" > Fixed. > > > ~ > > > > 4d. > > "zero(no delay)." -> "zero (no delay)." > Fixed. > > > > > > > 5. > > > > + > > + The delay occurs only on WAL records for transaction begins and
Re: Perform streaming logical transactions by background workers and parallel apply
On Thu, Jan 19, 2023 at 3:44 PM shveta malik wrote: > > On Thu, Jan 19, 2023 at 11:11 AM Amit Kapila wrote: > > > > On Wed, Jan 18, 2023 at 12:09 PM Amit Kapila > > wrote: > > > > > > On Fri, Jan 13, 2023 at 11:50 AM Peter Smith > > > wrote: > > > > > > > > Here are some review comments for patch v79-0002. > > > > > > > > > > So, this is about the latest > > > v84-0001-Stop-extra-worker-if-GUC-was-changed. > > > > > > > > > > > I feel this patch just adds more complexity for almost no gain: > > > > - reducing the 'max_apply_workers_per_suibscription' seems not very > > > > common in the first place. > > > > - even when the GUC is reduced, at that point in time all the workers > > > > might be in use so there may be nothing that can be immediately done. > > > > - IIUC the excess workers (for a reduced GUC) are going to get freed > > > > naturally anyway over time as more transactions are completed so the > > > > pool size will reduce accordingly. > > > > > > > > > > I am still not sure if it is worth pursuing this patch because of the > > > above reasons. I don't think it would be difficult to add this even at > > > a later point in time if we really see a use case for this. > > > Sawada-San, IIRC, you raised this point. What do you think? > > > > > > The other point I am wondering is whether we can have a different way > > > to test partial serialization apart from introducing another developer > > > GUC (stream_serialize_threshold). One possibility could be that we can > > > have a subscription option (parallel_send_timeout or something like > > > that) with some default value (current_timeout used in the patch) > > > which will be used only when streaming = parallel. Users may want to > > > wait for more time before serialization starts depending on the > > > workload (say when resource usage is high on a subscriber-side > > > machine, or there are concurrent long-running transactions that can > > > block parallel apply for a bit longer time). I know with this as well > > > it may not be straightforward to test the functionality because we > > > can't be sure how many changes would be required for a timeout to > > > occur. This is just for brainstorming other options to test the > > > partial serialization functionality. > > > > > > > Apart from the above, we can also have a subscription option to > > specify parallel_shm_queue_size (queue_size used to determine the > > queue between the leader and parallel worker) and that can be used for > > this purpose. Basically, configuring it to a smaller value can help in > > reducing the test time but still, it will not eliminate the need for > > dependency on timing we have to wait before switching to partial > > serialize mode. I think this can be used in production as well to tune > > the performance depending on workload. > > > > Yet another way is to use the existing parameter logical_decode_mode > > [1]. If the value of logical_decoding_mode is 'immediate', then we can > > immediately switch to partial serialize mode. This will eliminate the > > dependency on timing. The one argument against using this is that it > > won't be as clear as a separate parameter like > > 'stream_serialize_threshold' proposed by the patch but OTOH we already > > have a few parameters that serve a different purpose when used on the > > subscriber. For example, 'max_replication_slots' is used to define the > > maximum number of replication slots on the publisher and the maximum > > number of origins on subscribers. Similarly, > > wal_retrieve_retry_interval' is used for different purposes on > > subscriber and standby nodes. > > > > [1] - https://www.postgresql.org/docs/devel/runtime-config-developer.html > > > > -- > > With Regards, > > Amit Kapila. > > Hi Amit, > > On rethinking the complete model, what I feel is that the name > logical_decoding_mode is not something which defines modes of logical > decoding. We, I think, picked it based on logical_decoding_work_mem. > As per current implementation, the parameter 'logical_decoding_mode' > tells what happens when work-memory used by logical decoding reaches > its limit. So it is in-fact 'logicalrep_workmem_vacate_mode' or Minor correction in what I said earlier: As per current implementation, the parameter 'logical_decoding_mode' more or less tells how to deal with workmem i.e. to keep it
Re: Perform streaming logical transactions by background workers and parallel apply
On Thu, Jan 19, 2023 at 11:11 AM Amit Kapila wrote: > > On Wed, Jan 18, 2023 at 12:09 PM Amit Kapila wrote: > > > > On Fri, Jan 13, 2023 at 11:50 AM Peter Smith wrote: > > > > > > Here are some review comments for patch v79-0002. > > > > > > > So, this is about the latest v84-0001-Stop-extra-worker-if-GUC-was-changed. > > > > > > > > I feel this patch just adds more complexity for almost no gain: > > > - reducing the 'max_apply_workers_per_suibscription' seems not very > > > common in the first place. > > > - even when the GUC is reduced, at that point in time all the workers > > > might be in use so there may be nothing that can be immediately done. > > > - IIUC the excess workers (for a reduced GUC) are going to get freed > > > naturally anyway over time as more transactions are completed so the > > > pool size will reduce accordingly. > > > > > > > I am still not sure if it is worth pursuing this patch because of the > > above reasons. I don't think it would be difficult to add this even at > > a later point in time if we really see a use case for this. > > Sawada-San, IIRC, you raised this point. What do you think? > > > > The other point I am wondering is whether we can have a different way > > to test partial serialization apart from introducing another developer > > GUC (stream_serialize_threshold). One possibility could be that we can > > have a subscription option (parallel_send_timeout or something like > > that) with some default value (current_timeout used in the patch) > > which will be used only when streaming = parallel. Users may want to > > wait for more time before serialization starts depending on the > > workload (say when resource usage is high on a subscriber-side > > machine, or there are concurrent long-running transactions that can > > block parallel apply for a bit longer time). I know with this as well > > it may not be straightforward to test the functionality because we > > can't be sure how many changes would be required for a timeout to > > occur. This is just for brainstorming other options to test the > > partial serialization functionality. > > > > Apart from the above, we can also have a subscription option to > specify parallel_shm_queue_size (queue_size used to determine the > queue between the leader and parallel worker) and that can be used for > this purpose. Basically, configuring it to a smaller value can help in > reducing the test time but still, it will not eliminate the need for > dependency on timing we have to wait before switching to partial > serialize mode. I think this can be used in production as well to tune > the performance depending on workload. > > Yet another way is to use the existing parameter logical_decode_mode > [1]. If the value of logical_decoding_mode is 'immediate', then we can > immediately switch to partial serialize mode. This will eliminate the > dependency on timing. The one argument against using this is that it > won't be as clear as a separate parameter like > 'stream_serialize_threshold' proposed by the patch but OTOH we already > have a few parameters that serve a different purpose when used on the > subscriber. For example, 'max_replication_slots' is used to define the > maximum number of replication slots on the publisher and the maximum > number of origins on subscribers. Similarly, > wal_retrieve_retry_interval' is used for different purposes on > subscriber and standby nodes. > > [1] - https://www.postgresql.org/docs/devel/runtime-config-developer.html > > -- > With Regards, > Amit Kapila. Hi Amit, On rethinking the complete model, what I feel is that the name logical_decoding_mode is not something which defines modes of logical decoding. We, I think, picked it based on logical_decoding_work_mem. As per current implementation, the parameter 'logical_decoding_mode' tells what happens when work-memory used by logical decoding reaches its limit. So it is in-fact 'logicalrep_workmem_vacate_mode' or 'logicalrep_trans_eviction_mode'. So if it is set to immediate, meaning vacate the work-memory immediately or evict transactions immediately. Add buffered means the reverse (i.e. keep on buffering transactions until we reach a limit). Now coming to subscribers, we can reuse the same parameter. On subscriber as well, shared-memory queue could be considered as its workmem and thus the name 'logicalrep_workmem_vacate_mode' might look more relevant. On publisher: logicalrep_workmem_vacate_mode=immediate, buffered. On subscriber: logicalrep_workmem_vacate_mode=stream_serialize (or if we want to keep immediate here too, that will also be fine). Thoughts? And I am assuming it is possible to change the GUC name before the coming release. If not, please let me know, we can brainstorm other ideas. thanks Shveta
Re: Question about initial logical decoding snapshot
Hello, I was curious as to why we need 3rd running_xact and wanted to learn more about it, so I have made a few changes to come up with a patch which builds the snapshot in 2 running_xacts. The motive is to run the tests to see the failures/issues with this approach to understand the need of reading 3rd running_xact to build a consistent snapshot. On this patch, I have got one test-failure which is test_decoding/twophase_snapshot. Approach: When we start building a snapshot, on the occurrence of first running_xact, move the state from START to BUILDING and wait for all in-progress transactions to finish. On the second running_xact where we find oldestRunningXid >= 1st xl_running_xact's nextXid, move to CONSISTENT state. So, it means all the transactions started before BUILDING state are now finished and all the new transactions that are currently in progress are the ones that are started after BUILDING state and thus have enough info to be decoded. Failure analysis for twophase_snapshot test: After the patch application, test-case fails because slot is created sooner and 'PREPARE TRANSACTION test1' is available as result of first 'pg_logical_slot_get_changes' itself. Intent of this testcase is to see how two-phase txn is handled when snapshot-build completes in 3 stages (BUILDING-->FULL-->CONSISTENT). Originally, the PREPARED txn is started between FULL and CONSISTENT stage and thus as per the current code logic, 'DecodePrepare' will skip it. Please see code in DecodePrepare: /* We can't start streaming unless a consistent state is reached. */ if (SnapBuildCurrentState(builder) < SNAPBUILD_CONSISTENT) { ReorderBufferSkipPrepare(ctx->reorder, xid); return; } So first 'pg_logical_slot_get_changes' will not show these changes. Once we do 'commit prepared' after CONSISTENT state is reached, it will be available for next 'pg_logical_slot_get_changes' to consume. On the other hand, after the current patch, since we reach consistent state sooner, so with the same test-case, PREPARED transaction now ends up starting after CONSISTENT state and thus will be available to be consumed by first 'pg_logical_slot_get_changes' itself. This makes the testcase to fail. Please note that in the patch, I have maintained 'WAIT for all running transactions to end' even after reaching CONSISTENT state. I have tried running tests even after removing that WAIT after CONSISTENT, with that, we get one more test failure which is test_decoding/ondisk_startup. The reason for failure here is the same as previous case i.e., since we reach CONSISTENT state earlier, slot-creation finishes faster and thus we see slight change in result for this test. ('step s1init completed' seen earlier in log file). Both the failing tests here are written in such a way that they align with the 3-phase snapshot build process. Otherwise, I do not see any logical issues yet with this approach based on the test-cases available so far. So, I still have not gotten clarity on why we need 3rd running_xact here. In code, I see a comment in SnapBuildFindSnapshot() which says "c) ...But for older running transactions no viable snapshot exists yet, so CONSISTENT will only be reached once all of those have finished." This comment refers to txns started between BUILDING and FULL state. I do not understand it fully. I am not sure what tests I need to run on the patch to reproduce this issue where we do not have a viable snapshot when we go by two running_xacts only. Any thoughts/comments are most welcome. Attached the patch for review. Thanks Shveta On Fri, Dec 30, 2022 at 11:57 PM Chong Wang wrote: > > Hi hackers. > > I'm studying the source code about creation of initial logical decoding > snapshot. What confused me is that why must we process 3 xl_running_xacts > before we get to the consistent state. I think we only need 2 > xl_running_xacts. > > I think we can get to consistent state when we meet the 2nd xl_running_xact > with its oldestRunningXid > 1st xl_running_xact's nextXid, this means the > active transactions in 1st xl_running_xact all had commited, and we have all > the logs of transactions who will commit afterwards, so there is consistent > state in this time point and we can export a snapshot. > > I had read the discussion in [0] and the comment of commit '955a684', but I > haven't got a detailed explanation about why we need 4 stages during creation > of initial logical decoding snapshot but not 3 stages. > > My rencent job is relevant to logical decoding so I want to figure this > problem out, I'm very grateful if you can answer me, thanks. > > > > [0] > https://www.postgresql.org/message-id/flat/f37e975c-908f-858e-707f-058d3b1eb214%402ndquadrant.com > > > > -- > > Best regards > > Chong Wang > > Greenplum DataFlow team v1-0001-SnapBuild-in-2-stages.patch Description: Binary data
Re: Perform streaming logical transactions by background workers and parallel apply
On Tue, Jan 17, 2023 at 9:07 AM houzj.f...@fujitsu.com wrote: > > On Tuesday, January 17, 2023 11:32 AM Peter Smith > wrote: > > > > On Tue, Jan 17, 2023 at 1:21 PM houzj.f...@fujitsu.com > > wrote: > > > > > > On Tuesday, January 17, 2023 5:43 AM Peter Smith > > wrote: > > > > > > > > On Mon, Jan 16, 2023 at 5:41 PM Amit Kapila > > > > > > > > wrote: > > > > > > > > > > On Mon, Jan 16, 2023 at 10:24 AM Peter Smith > > > > > > > > > wrote: > > > > > > > > > > > > 2. > > > > > > > > > > > > /* > > > > > > + * Return the pid of the leader apply worker if the given pid > > > > > > +is the pid of a > > > > > > + * parallel apply worker, otherwise return InvalidPid. > > > > > > + */ > > > > > > +pid_t > > > > > > +GetLeaderApplyWorkerPid(pid_t pid) { int leader_pid = > > > > > > +InvalidPid; int i; > > > > > > + > > > > > > + LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); > > > > > > + > > > > > > + for (i = 0; i < max_logical_replication_workers; i++) { > > > > > > + LogicalRepWorker *w = >workers[i]; > > > > > > + > > > > > > + if (isParallelApplyWorker(w) && w->proc && pid == > > > > > > + w->proc->pid) { leader_pid = w->leader_pid; break; } } > > > > > > + > > > > > > + LWLockRelease(LogicalRepWorkerLock); > > > > > > + > > > > > > + return leader_pid; > > > > > > +} > > > > > > > > > > > > 2a. > > > > > > IIUC the IsParallelApplyWorker macro does nothing except check > > > > > > that the leader_pid is not InvalidPid anyway, so AFAIK this > > > > > > algorithm does not benefit from using this macro because we will > > > > > > want to return InvalidPid anyway if the given pid matches. > > > > > > > > > > > > So the inner condition can just say: > > > > > > > > > > > > if (w->proc && w->proc->pid == pid) { leader_pid = > > > > > > w->leader_pid; break; } > > > > > > > > > > > > > > > > Yeah, this should also work but I feel the current one is explicit > > > > > and more clear. > > > > > > > > OK. > > > > > > > > But, I have one last comment about this function -- I saw there are > > > > already other functions that iterate max_logical_replication_workers > > > > like this looking for things: > > > > - logicalrep_worker_find > > > > - logicalrep_workers_find > > > > - logicalrep_worker_launch > > > > - logicalrep_sync_worker_count > > > > > > > > So I felt this new function (currently called > > > > GetLeaderApplyWorkerPid) ought to be named similarly to those ones. > > > > e.g. call it something like "logicalrep_worker_find_pa_leader_pid". > > > > > > > > > > I am not sure we can use the name, because currently all the API name > > > in launcher that used by other module(not related to subscription) are > > > like AxxBxx style(see the functions in logicallauncher.h). > > > logicalrep_worker_xxx style functions are currently only declared in > > > worker_internal.h. > > > > > > > OK. I didn't know there was another header convention that you were > > following. > > In that case, it is fine to leave the name as-is. > > Thanks for confirming! > > Attach the new version 0001 patch which addressed all other comments. > > Best regards, > Hou zj Hello Hou-san, 1. Do we need to extend test-cases to review the leader_pid column in pg_stats tables? 2. Do we need to follow the naming convention for 'GetLeaderApplyWorkerPid' like other functions in the same file which starts with 'logicalrep_' thanks Shveta
Re: Perform streaming logical transactions by background workers and parallel apply
On Thu, Jan 12, 2023 at 4:37 PM Amit Kapila wrote: > > > But then do you suggest that tomorrow if we allow parallel sync > workers then we have a separate column leader_sync_pid? I think that > doesn't sound like a good idea and moreover one can refer to docs for > clarification. > > -- okay, leader_pid is fine I think. thanks Shveta
Re: Perform streaming logical transactions by background workers and parallel apply
On Thu, Jan 12, 2023 at 10:34 AM Amit Kapila wrote: > > On Thu, Jan 12, 2023 at 9:54 AM Peter Smith wrote: > > > > > > doc/src/sgml/monitoring.sgml > > > > 5. pg_stat_subscription > > > > @@ -3198,11 +3198,22 @@ SELECT pid, wait_event_type, wait_event FROM > > pg_stat_activity WHERE wait_event i > > > > > > > > + apply_leader_pid integer > > + > > + > > + Process ID of the leader apply worker, if this process is a apply > > + parallel worker. NULL if this process is a leader apply worker or a > > + synchronization worker. > > + > > + > > + > > + > > + > > relid oid > > > > > > OID of the relation that the worker is synchronizing; null for the > > - main apply worker > > + main apply worker and the parallel apply worker > > > > > > > > 5a. > > > > (Same as general comment #1 about terminology) > > > > "apply_leader_pid" --> "leader_apply_pid" > > > > How about naming this as just leader_pid? I think it could be helpful > in the future if we decide to parallelize initial sync (aka parallel > copy) because then we could use this for the leader PID of parallel > sync workers as well. > > -- I still prefer leader_apply_pid. leader_pid does not tell which 'operation' it belongs to. 'apply' gives the clarity that it is apply related process. The terms used in patch look very confusing. I had to read a few lines multiple times to understand it. 1. Summary says 'main_worker_pid' to be added but I do not see 'main_worker_pid' added in pg_stat_subscription, instead I see 'apply_leader_pid'. Am I missing something? Also, as stated above 'leader_apply_pid' makes more sense. it is better to correct it everywhere (apply leader-->leader apply). Once that is done, it can be reviewed again. thanks Shveta
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Wed, Jan 11, 2023 at 6:16 PM Hayato Kuroda (Fujitsu) wrote: > > > 2. > > I think users can set ' wal_receiver_status_interval ' to 0 or more > > than 'wal_sender_timeout'. But is this a frequent use-case scenario or > > do we see DBAs setting these in such a way by mistake? If so, then I > > think, it is better to give Warning message in such a case when a user > > tries to create or alter a subscription with a large 'min_apply_delay' > > (>= 'wal_sender_timeout') , rather than leaving it to the user's > > understanding that WalSender may repeatedly timeout in such a case. > > Parse_subscription_options and AlterSubscription can be modified to > > log a warning. Any thoughts? > > Yes, DBAs may set wal_receiver_status_interval to more than > wal_sender_timeout by > mistake. > > But to handle the scenario we must compare between min_apply_delay *on > subscriber* > and wal_sender_timeout *on publisher*. Both values are not transferred to > opposite > sides, so the WARNING cannot be raised. I considered that such a mechanism > seemed > to be complex. The discussion around [1] may be useful. > > [1]: > https://www.postgresql.org/message-id/CAA4eK1Lq%2Bh8qo%2BrqGU-E%2BhwJKAHYocV54y4pvou4rLysCgYD-g%40mail.gmail.com > okay, I see. So even when 'wal_receiver_status_interval' is set to 0, no log/warning is needed when the user tries to set min_apply_delay>0? Are we good with doc alone? One trivial correction in config.sgml: + terminates due to the timeout errors. Hence, make sure this parameter + shorter than the wal_sender_timeout of the publisher. Hence, make sure this parameter is shorter... thanks Shveta