On Tue, 6 Feb 2024 at 18:30, Alexander Lakhin <exclus...@gmail.com> wrote: > > 05.02.2024 13:13, vignesh C wrote: > > Thanks for the steps for the issue, I was able to reproduce this issue > > in my environment with the steps provided. The attached patch has a > > proposed fix where the latch will not be set in case of the apply > > worker exiting immediately after starting. > > It looks like the proposed fix doesn't help when ApplyLauncherWakeup() > called by a backend executing CREATE SUBSCRIPTION command. > That is, with the v4-0002 patch applied and pg_usleep(300000L); added > just below > if (!worker_in_use) > return worker_in_use; > I still observe the test 027_nosuperuser running for 3+ minutes: > t/027_nosuperuser.pl .. ok > All tests successful. > Files=1, Tests=19, 187 wallclock secs ( 0.01 usr 0.00 sys + 4.82 cusr 4.47 > csys = 9.30 CPU) > > IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION > regression_sub ...", gets missed when launcher waits for start of another > worker (logical replication worker for subscription "admin_sub"), launched > just before that command.
Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was getting missed in this case. The wakeup call can be sent during subscription creation/modification and when the apply worker exits. WaitForReplicationWorkerAttach should not reset the latch here as it will end up delaying the apply worker to get started after 180 seconds timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset the latch and lets ApplyLauncherMain to reset the latch and checks if any new worker or missing worker needs to be started. Regards, Vignesh
From f04db050a583c9c01eb77766f830a0cf77b0a6c7 Mon Sep 17 00:00:00 2001 From: Vignesh C <vignes...@gmail.com> Date: Mon, 5 Feb 2024 14:55:48 +0530 Subject: [PATCH v5 2/2] Apply worker will get started after 180 seconds by the launcher in case the apply worker exits immediately after startup. Apply worker was getting started after 180 seconds tiemout of the launcher in case the apply worker exits immediately after startup. This was happening because the launcher process's latch was getting reset after the apply worker was started, which resulted in the launcher to wait for the next 180 seconds timeout before starting the apply worker. Fixed this issue by not resetting the latch, as this latch will be set for subscription modifications and apply worker exit. We should check if the new worker needs to be started or not and reset the latch in ApplyLauncherMain. --- src/backend/replication/logical/launcher.c | 17 +++++++---------- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index 122db0bb13..a754f2c757 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -191,7 +191,6 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker, BackgroundWorkerHandle *handle) { BgwHandleStatus status; - int rc; for (;;) { @@ -226,16 +225,14 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker, /* * We need timeout because we generally don't get notified via latch * about the worker attach. But we don't expect to have to wait long. + * Since this latch is also used for subscription creation/modification + * and apply worker process exit signal handling, the latch must not be + * reset here. We should check if the new worker needs to be started or + * not and reset the latch in ApplyLauncherMain. */ - rc = WaitLatch(MyLatch, - WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, - 10L, WAIT_EVENT_BGWORKER_STARTUP); - - if (rc & WL_LATCH_SET) - { - ResetLatch(MyLatch); - CHECK_FOR_INTERRUPTS(); - } + (void) WaitLatch(MyLatch, + WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + 10L, WAIT_EVENT_BGWORKER_STARTUP); } } -- 2.34.1
From 385c96a71396ea8efa86d6136bbf0bfe5282f1d1 Mon Sep 17 00:00:00 2001 From: Vignesh C <vignes...@gmail.com> Date: Thu, 1 Feb 2024 09:46:40 +0530 Subject: [PATCH v5 1/2] Table sync missed for newly added tables because subscription relation cache invalidation was not handled in certain concurrent scenarios. Table sync was missed if the invalidation of table sync occurs while the non ready tables were getting prepared. This occurrs because the table state was being set to valid at the end of non ready table list preparation irrespective of any inavlidations occurred while preparing the list. Fixed it by changing the boolean variable to an tri-state enum and by setting table state to valid only if no invalidations have been occurred while the list is being prepared. --- src/backend/replication/logical/tablesync.c | 25 +++++++++++++++++---- src/tools/pgindent/typedefs.list | 1 + 2 files changed, 22 insertions(+), 4 deletions(-) diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index ee06629088..9e24fb608c 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -123,7 +123,14 @@ #include "utils/syscache.h" #include "utils/usercontext.h" -static bool table_states_valid = false; +typedef enum +{ + SYNC_TABLE_STATE_NEEDS_REBUILD, + SYNC_TABLE_STATE_REBUILD_STARTED, + SYNC_TABLE_STATE_VALID, +} SyncingTablesState; + +static SyncingTablesState table_states_valid = SYNC_TABLE_STATE_NEEDS_REBUILD; static List *table_states_not_ready = NIL; static bool FetchTableStates(bool *started_tx); @@ -273,7 +280,7 @@ wait_for_worker_state_change(char expected_state) void invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue) { - table_states_valid = false; + table_states_valid = SYNC_TABLE_STATE_NEEDS_REBUILD; } /* @@ -1568,13 +1575,15 @@ FetchTableStates(bool *started_tx) *started_tx = false; - if (!table_states_valid) + if (table_states_valid != SYNC_TABLE_STATE_VALID) { MemoryContext oldctx; List *rstates; ListCell *lc; SubscriptionRelState *rstate; + table_states_valid = SYNC_TABLE_STATE_REBUILD_STARTED; + /* Clean the old lists. */ list_free_deep(table_states_not_ready); table_states_not_ready = NIL; @@ -1608,7 +1617,15 @@ FetchTableStates(bool *started_tx) has_subrels = (table_states_not_ready != NIL) || HasSubscriptionRelations(MySubscription->oid); - table_states_valid = true; + /* + * If the subscription relation cache has been invalidated since we + * entered this routine, we still use and return the relations we just + * finished constructing, to avoid infinite loops, but we leave the + * table states marked as stale so that we'll rebuild it again on next + * access. Otherwise, we mark the table states as valid. + */ + if (table_states_valid == SYNC_TABLE_STATE_REBUILD_STARTED) + table_states_valid = SYNC_TABLE_STATE_VALID; } return has_subrels; diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index 91433d439b..5a40f549f9 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -2703,6 +2703,7 @@ SupportRequestSelectivity SupportRequestSimplify SupportRequestWFuncMonotonic Syn +SyncingTablesState SyncOps SyncRepConfigData SyncRepStandbyData -- 2.34.1