Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

vignesh C Mon, 05 Feb 2024 02:14:14 -0800

On Fri, 2 Feb 2024 at 11:30, Alexander Lakhin <exclus...@gmail.com> wrote:
>
> Hello Vignesh and Hou-san,
>
> 01.02.2024 07:59, vignesh C wrote:
> > Here is an updated patch which changes the boolean variable to a
> > tri-state enum and set stable state to valid only if no invalidations
> > have been occurred while the list is being prepared.
> >
>
> While testing the v3 patch, I observed another anomaly with the 014_binary
> test. When I run the test in parallel, I see that sometimes one of the test
> instances runs much longer than others. For example:
> 49      All tests successful.
> 49      Files=1, Tests=8,  4 wallclock secs ( 0.03 usr  0.01 sys + 0.43 cusr  
> 0.30 csys =  0.77 CPU)
> 49      Result: PASS
> 12      All tests successful.
> 12      Files=1, Tests=8, 184 wallclock secs ( 0.02 usr  0.01 sys + 0.46 cusr 
>  0.40 csys =  0.89 CPU)
> 12      Result: PASS
>
> As far as I can see, this happens due to another race condition, this time
> in launcher.c.
> For such a three-minute case I see in _subscriber.log:
> 2024-02-01 14:33:13.604 UTC [949255] DEBUG:  relation 
> "public.test_mismatching_types" does not exist
> 2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for 
> replication origin "pg_16398" during message
> type "INSERT" in transaction 757, finished at 0/153C838
> 2024-02-01 14:33:13.604 UTC [949255] ERROR:  logical replication target 
> relation "public.test_mismatching_types" does
> not exist
> 2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for 
> replication origin "pg_16398" during message
> type "INSERT" in transaction 757, finished at 0/153C838
> ...
> 2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG:  statement: CREATE 
> TABLE public.test_mismatching_types (
>              a int PRIMARY KEY
>          );
> 2024-02-01 14:33:13.605 UTC [942451] DEBUG:  unregistering background worker 
> "logical replication apply worker for
> subscription 16398"
> 2024-02-01 14:33:13.605 UTC [942451] LOG:  background worker "logical 
> replication apply worker" (PID 949255) exited with
> exit code 1
> ...
> 2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG:  statement: ALTER 
> SUBSCRIPTION tsub REFRESH PUBLICATION;
> ...
> 2024-02-01 14:36:13.642 UTC [942527] DEBUG:  starting logical replication 
> worker for subscription "tsub"
> (there is no interesting activity between 14:33:13 and 14:36:13)
>
> So logical replication apply worker exited because CREATE TABLE was not
> executed on subscriber yet, and new replication worker started because of a
> timeout occurred in WaitLatch(..., wait_time, ...) inside
> ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE
> (180 sec)).
>
> But in a normal (fast) case the same WaitLatch exits due to MyLatch set as
> a result of:
> logical replication apply worker| logicalrep_worker_onexit() ->
>    ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) ->
> launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)).
>
> In a bad case, I see that the SetLatch() called as well, but then the latch
> gets reset by the following code in WaitForReplicationWorkerAttach():
>          rc = WaitLatch(MyLatch,
>                         WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
>                         10L, WAIT_EVENT_BGWORKER_STARTUP);
>
>          if (rc & WL_LATCH_SET)
>          {
>              ResetLatch(MyLatch);
>              CHECK_FOR_INTERRUPTS();
>          }
>
> With pg_usleep(300000); added just before ResetLatch and
>   $node_subscriber->safe_psql(
>       'postgres', qq(
> +SELECT pg_sleep(0.5);
>       CREATE TABLE public.test_mismatching_types (
> in 014_binary.pl, I can see the anomaly without running tests in parallel,
> just when running that test in a loop:
> for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C 
> src/test/subscription/ PROVE_TESTS="t/014*"; done
> ...
> iteration 2
> # +++ tap check in src/test/subscription +++
> t/014_binary.pl .. ok
> All tests successful.
> Files=1, Tests=8,  5 wallclock secs ( 0.00 usr  0.00 sys +  0.24 cusr  0.18 
> csys =  0.42 CPU)
> Result: PASS
> iteration 3
> # +++ tap check in src/test/subscription +++
> t/014_binary.pl .. ok
> All tests successful.
> Files=1, Tests=8, 183 wallclock secs ( 0.02 usr  0.00 sys +  0.28 cusr  0.25 
> csys =  0.55 CPU)
> Result: PASS
> ...
>
> In other words, the abnormal test execution takes place due to the
> following events:
> 1. logicalrep worker launcher launches replication worker and waits for it
>   to attach:
>   ApplyLauncherMain() -> logicalrep_worker_launch() -> 
> WaitForReplicationWorkerAttach()
> 2. logicalrep worker exits due to some error (logical replication target
>   relation does not exist, in our case) and sends a signal to set the latch
>   for launcher
> 3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets
>   it inside WaitForReplicationWorkerAttach()
> 4. launcher gets back to ApplyLauncherMain() where it waits for the latch
>   (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms).
>
> Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the
> test 027_nosuperuser executes for 3+ minutes on each run for me.


Thanks for the steps for the issue, I was able to reproduce this issue
in my environment with the steps provided. The attached patch has a
proposed fix where the latch will not be set in case of the apply
worker exiting immediately after starting.

Regards,
Vignesh

From 87b290a6209f078b3029e27b0b61d3644026dae3 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignes...@gmail.com>
Date: Thu, 1 Feb 2024 09:46:40 +0530
Subject: [PATCH v4 1/2] Table sync missed for newly added tables because
 subscription relation cache invalidation was not handled in certain
 concurrent scenarios.

Table sync was missed if the invalidation of table sync occurs while
the non ready tables were getting prepared. This occurrs because
the table state was being set to valid at the end of non ready table
list preparation irrespective of any inavlidations occurred while
preparing the list. Fixed it by changing the boolean variable to an
tri-state enum and by setting table state to valid only if no
invalidations have been occurred while the list is being prepared.
---
 src/backend/replication/logical/tablesync.c | 25 +++++++++++++++++----
 src/tools/pgindent/typedefs.list            |  1 +
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index ee06629088..9e24fb608c 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -123,7 +123,14 @@
 #include "utils/syscache.h"
 #include "utils/usercontext.h"
 
-static bool table_states_valid = false;
+typedef enum
+{
+	SYNC_TABLE_STATE_NEEDS_REBUILD,
+	SYNC_TABLE_STATE_REBUILD_STARTED,
+	SYNC_TABLE_STATE_VALID,
+} SyncingTablesState;
+
+static SyncingTablesState table_states_valid = SYNC_TABLE_STATE_NEEDS_REBUILD;
 static List *table_states_not_ready = NIL;
 static bool FetchTableStates(bool *started_tx);
 
@@ -273,7 +280,7 @@ wait_for_worker_state_change(char expected_state)
 void
 invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
 {
-	table_states_valid = false;
+	table_states_valid = SYNC_TABLE_STATE_NEEDS_REBUILD;
 }
 
 /*
@@ -1568,13 +1575,15 @@ FetchTableStates(bool *started_tx)
 
 	*started_tx = false;
 
-	if (!table_states_valid)
+	if (table_states_valid != SYNC_TABLE_STATE_VALID)
 	{
 		MemoryContext oldctx;
 		List	   *rstates;
 		ListCell   *lc;
 		SubscriptionRelState *rstate;
 
+		table_states_valid = SYNC_TABLE_STATE_REBUILD_STARTED;
+
 		/* Clean the old lists. */
 		list_free_deep(table_states_not_ready);
 		table_states_not_ready = NIL;
@@ -1608,7 +1617,15 @@ FetchTableStates(bool *started_tx)
 		has_subrels = (table_states_not_ready != NIL) ||
 			HasSubscriptionRelations(MySubscription->oid);
 
-		table_states_valid = true;
+		/*
+		 * If the subscription relation cache has been invalidated since we
+		 * entered this routine, we still use and return the relations we just
+		 * finished constructing, to avoid infinite loops, but we leave the
+		 * table states marked as stale so that we'll rebuild it again on next
+		 * access. Otherwise, we mark the table states as valid.
+		 */
+		if (table_states_valid == SYNC_TABLE_STATE_REBUILD_STARTED)
+			table_states_valid = SYNC_TABLE_STATE_VALID;
 	}
 
 	return has_subrels;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 91433d439b..5a40f549f9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2703,6 +2703,7 @@ SupportRequestSelectivity
 SupportRequestSimplify
 SupportRequestWFuncMonotonic
 Syn
+SyncingTablesState
 SyncOps
 SyncRepConfigData
 SyncRepStandbyData
-- 
2.34.1

From 817dd4089d7c9c3e4481d0718a0a269f7f0f30c5 Mon Sep 17 00:00:00 2001
From: Vignesh C <vignes...@gmail.com>
Date: Mon, 5 Feb 2024 14:55:48 +0530
Subject: [PATCH v4 2/2] Apply worker will get started after 180 seconds by the
 launcher in case the apply worker exits immediately after startup.

Apply worker was getting started after 180 seconds tiemout of the
launcher in case the apply worker exits immediately after startup. This
was happening because the launcher process's latch was getting reset
after the apply worker was started, which resulted in the launcher to
wait for the next 180 seconds timeout before starting the apply worker.
Fixed this issue by not setting the latch in case the apply worker has
exited.
---
 src/backend/replication/logical/launcher.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 122db0bb13..2c750e2cd6 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -233,6 +233,19 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
 
 		if (rc & WL_LATCH_SET)
 		{
+			bool worker_in_use;
+
+			LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
+			worker_in_use = worker->in_use;
+			LWLockRelease(LogicalRepWorkerLock);
+
+			/*
+			 * Latch must not be reset in case of the apply worker exited
+			 * immediately after start.
+			 */
+			if (!worker_in_use)
+				return worker_in_use;
+
 			ResetLatch(MyLatch);
 			CHECK_FOR_INTERRUPTS();
 		}
-- 
2.34.1

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

Reply via email to