Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

vignesh C Thu, 08 Feb 2024 01:25:57 -0800

On Tue, 6 Feb 2024 at 18:30, Alexander Lakhin <exclus...@gmail.com> wrote:
>
> 05.02.2024 13:13, vignesh C wrote:
> > Thanks for the steps for the issue, I was able to reproduce this issue
> > in my environment with the steps provided. The attached patch has a
> > proposed fix where the latch will not be set in case of the apply
> > worker exiting immediately after starting.
>
> It looks like the proposed fix doesn't help when ApplyLauncherWakeup()
> called by a backend executing CREATE SUBSCRIPTION command.
> That is, with the v4-0002 patch applied and pg_usleep(300000L); added
> just below
>              if (!worker_in_use)
>                  return worker_in_use;
> I still observe the test 027_nosuperuser running for 3+ minutes:
> t/027_nosuperuser.pl .. ok
> All tests successful.
> Files=1, Tests=19, 187 wallclock secs ( 0.01 usr  0.00 sys +  4.82 cusr  4.47 
> csys =  9.30 CPU)
>
> IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION
> regression_sub ...", gets missed when launcher waits for start of another
> worker (logical replication worker for subscription "admin_sub"), launched
> just before that command.


Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was
getting missed in this case. The wakeup call can be sent during
subscription creation/modification and when the apply worker exits.
WaitForReplicationWorkerAttach should not reset the latch here as it
will end up delaying the apply worker to get started after 180 seconds
timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset
the latch and lets ApplyLauncherMain to reset the latch and checks if
any new worker or missing worker needs to be started.

Regards,
Vignesh

From f04db050a583c9c01eb77766f830a0cf77b0a6c7 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignes...@gmail.com>
Date: Mon, 5 Feb 2024 14:55:48 +0530
Subject: [PATCH v5 2/2] Apply worker will get started after 180 seconds by the
 launcher in case the apply worker exits immediately after startup.

Apply worker was getting started after 180 seconds tiemout of the
launcher in case the apply worker exits immediately after startup. This
was happening because the launcher process's latch was getting reset
after the apply worker was started, which resulted in the launcher to
wait for the next 180 seconds timeout before starting the apply worker.
Fixed this issue by not resetting the latch, as this latch will be set for
subscription modifications and apply worker exit. We should check if the
new worker needs to be started or not and reset the latch in ApplyLauncherMain.
---
 src/backend/replication/logical/launcher.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 122db0bb13..a754f2c757 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -191,7 +191,6 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 							   BackgroundWorkerHandle *handle)
 {
 	BgwHandleStatus status;
-	int			rc;
 
 	for (;;)
 	{
@@ -226,16 +225,14 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 		/*
 		 * We need timeout because we generally don't get notified via latch
 		 * about the worker attach.  But we don't expect to have to wait long.
+		 * Since this latch is also used for subscription creation/modification
+		 * and apply worker process exit signal handling, the latch must not be
+		 * reset here. We should check if the new worker needs to be started or
+		 * not and reset the latch in ApplyLauncherMain.
 		 */
-		rc = WaitLatch(MyLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-					   10L, WAIT_EVENT_BGWORKER_STARTUP);
-
-		if (rc & WL_LATCH_SET)
-		{
-			ResetLatch(MyLatch);
-			CHECK_FOR_INTERRUPTS();
-		}
+		(void) WaitLatch(MyLatch,
+						 WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+						 10L, WAIT_EVENT_BGWORKER_STARTUP);
 	}
 }
 
-- 
2.34.1

From 385c96a71396ea8efa86d6136bbf0bfe5282f1d1 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignes...@gmail.com>
Date: Thu, 1 Feb 2024 09:46:40 +0530
Subject: [PATCH v5 1/2] Table sync missed for newly added tables because
 subscription relation cache invalidation was not handled in certain
 concurrent scenarios.

Table sync was missed if the invalidation of table sync occurs while
the non ready tables were getting prepared. This occurrs because
the table state was being set to valid at the end of non ready table
list preparation irrespective of any inavlidations occurred while
preparing the list. Fixed it by changing the boolean variable to an
tri-state enum and by setting table state to valid only if no
invalidations have been occurred while the list is being prepared.
---
 src/backend/replication/logical/tablesync.c | 25 +++++++++++++++++----
 src/tools/pgindent/typedefs.list            |  1 +
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index ee06629088..9e24fb608c 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -123,7 +123,14 @@
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
 
-static bool table_states_valid = false;
+typedef enum
+{
+	SYNC_TABLE_STATE_NEEDS_REBUILD,
+	SYNC_TABLE_STATE_REBUILD_STARTED,
+	SYNC_TABLE_STATE_VALID,
+} SyncingTablesState;
+
+static SyncingTablesState table_states_valid = SYNC_TABLE_STATE_NEEDS_REBUILD;
 static List *table_states_not_ready = NIL;
 static bool FetchTableStates(bool *started_tx);
 
@@ -273,7 +280,7 @@ wait_for_worker_state_change(char expected_state)
 void
 invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 {
-	table_states_valid = false;
+	table_states_valid = SYNC_TABLE_STATE_NEEDS_REBUILD;
 }
 
 /*
@@ -1568,13 +1575,15 @@ FetchTableStates(bool *started_tx)
 
 	*started_tx = false;
 
-	if (!table_states_valid)
+	if (table_states_valid != SYNC_TABLE_STATE_VALID)
 	{
 		MemoryContext oldctx;
 		List	   *rstates;
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		table_states_valid = SYNC_TABLE_STATE_REBUILD_STARTED;
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1608,7 +1617,15 @@ FetchTableStates(bool *started_tx)
 		has_subrels = (table_states_not_ready != NIL) ||
 			HasSubscriptionRelations(MySubscription->oid);
 
-		table_states_valid = true;
+		/*
+		 * If the subscription relation cache has been invalidated since we
+		 * entered this routine, we still use and return the relations we just
+		 * finished constructing, to avoid infinite loops, but we leave the
+		 * table states marked as stale so that we'll rebuild it again on next
+		 * access. Otherwise, we mark the table states as valid.
+		 */
+		if (table_states_valid == SYNC_TABLE_STATE_REBUILD_STARTED)
+			table_states_valid = SYNC_TABLE_STATE_VALID;
 	}
 
 	return has_subrels;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91433d439b..5a40f549f9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2703,6 +2703,7 @@ SupportRequestSelectivity
 SupportRequestSimplify
 SupportRequestWFuncMonotonic
 Syn
+SyncingTablesState
 SyncOps
 SyncRepConfigData
 SyncRepStandbyData
-- 
2.34.1

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

Reply via email to