Hi,

> I mean to commit the open transaction at the below place in
> wait_for_relation_state_change()
>
> wait_for_relation_state_change()
> {
> ...
> -- commit the xact
> WaitLatch();
> ...
> }
>
> Then start after the wait is over. This is just to test whether it
> improves the difference in regression test timing.

I tried the above approach and observed that the performance of this
approach is nearly same as the previous approach.

For Linux VM:
Summary    | Subscription | 100 tables in pub  | 1000 tables in pub
                    | Test (sec)     | and Sub (sec)       | and Sub (sec)
------------------------------------------------------------------------------
old patch     |  107.4545     |   6.911           | 77.918
alternate      |  108.3985    | 6.9835           | 78.111
approach

For Performance Machine:
Summary    | Subscription | 100 tables in pub  | 1000 tables in pub
                    | Test (sec)     | and Sub (sec)       | and Sub (sec)
------------------------------------------------------------------------------
old patch     |  115.922      |   6.7305         | 81.1525
alternate      |  115.8215   | 6.7685            | 81.2335
approach

I have attached the patch for this approach as 'alternate_approach.patch'.

Since the performance is the same, I think that the previous approach
is better. As in this approach we are using CommitTransactionCommand()
and StartTransactionCommand() inside a 'for loop'.

I also fixed the comment in previous approach and attached here as
'v2-0001-Deadlock-when-apply-worker-tablesync-worker-and-c.patch'

Thanks and Regards


Shlok Kyal
From 11072d138d900227b963b54d1a3626cf256db721 Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Fri, 24 Nov 2023 16:37:14 +0530
Subject: [PATCH v2] Deadlock when apply worker,tablesync worker and client
 backend run parallelly.

Apply worker holds a lock on the table pg_subscription_rel and waits for
notification from the tablesync workers where the relation is synced, which
happens through updating the pg_subscription_rel row. And that wait happens in
wait_for_relation_state_change, which simply checks the row in a loop, with a
sleep by WaitLatch(). Unfortunately, the tablesync workers can't update the row
because the client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
sneaked in, and waits for an AccessExclusiveLock. So the tablesync workers are
stuck in the queue and can't proceed.

The client backend is waiting for a lock held by the apply worker. The tablesync
workers can't proceed because their lock request is stuck behind the
AccessExclusiveLock request. And the apply worker is waiting for status update
from the tablesync workers. And the deadlock is undetected, because the apply
worker is not waiting on a lock, but sleeping on a latch.

We have resolved the issue by releasing the lock by commiting the transaction
before the apply worker goes to wait state.
---
 src/backend/replication/logical/tablesync.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index df3c42eb5d..b71234d998 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -542,15 +542,26 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 					LWLockRelease(LogicalRepWorkerLock);
 
 					/*
-					 * Enter busy loop and wait for synchronization worker to
-					 * reach expected state (or die trying).
+					 * If we have a transaction, we must commit it to release
+					 * any locks we have. Then start a new transaction so we 
+					 * can examine catalog state.
 					 */
-					if (!started_tx)
+					if (started_tx)
+					{
+						CommitTransactionCommand();
+						pgstat_report_stat(false);
+						StartTransactionCommand();
+					}
+					else
 					{
 						StartTransactionCommand();
 						started_tx = true;
 					}
 
+					/*
+					 * Enter busy loop and wait for synchronization worker to
+					 * reach expected state (or die trying).
+					 */
 					wait_for_relation_state_change(rstate->relid,
 												   SUBREL_STATE_SYNCDONE);
 				}
-- 
2.34.1

From ce261399d22a7fd9ee7ccdd3b9d1cc7f4f72ce4b Mon Sep 17 00:00:00 2001
From: Shlok Kyal <shlok.kyal.oss@gmail.com>
Date: Thu, 7 Dec 2023 10:55:26 +0530
Subject: [PATCH] Deadlock when apply worker,tablesync worker and client
 backend run parallelly.

Apply worker holds a lock on the table pg_subscription_rel and waits for
notification from the tablesync workers where the relation is synced, which
happens through updating the pg_subscription_rel row. And that wait happens in
wait_for_relation_state_change, which simply checks the row in a loop, with a
sleep by WaitLatch(). Unfortunately, the tablesync workers can't update the row
because the client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
sneaked in, and waits for an AccessExclusiveLock. So the tablesync workers are
stuck in the queue and can't proceed.

The client backend is waiting for a lock held by the apply worker. The tablesync
workers can't proceed because their lock request is stuck behind the
AccessExclusiveLock request. And the apply worker is waiting for status update
from the tablesync workers. And the deadlock is undetected, because the apply
worker is not waiting on a lock, but sleeping on a latch.

We have resolved the issue by releasing the lock by commiting the transaction
before the apply worker goes to wait state.
---
 src/backend/replication/logical/tablesync.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index df3c42eb5d..fb0ac84450 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -174,7 +174,7 @@ finish_sync_worker(void)
  * CATCHUP state to SYNCDONE.
  */
 static bool
-wait_for_relation_state_change(Oid relid, char expected_state)
+wait_for_relation_state_change(Oid relid, char expected_state, bool *started_tx)
 {
 	char		state;
 
@@ -203,6 +203,16 @@ wait_for_relation_state_change(Oid relid, char expected_state)
 		if (!worker)
 			break;
 
+		/*
+		 * If we have a transaction, we must commit it to release any locks we
+		 * have. Then start a new transaction so we can examine catalog state.
+		 */
+		if (*started_tx)
+		{
+			CommitTransactionCommand();
+			pgstat_report_stat(false);
+			StartTransactionCommand();
+		}
 		(void) WaitLatch(MyLatch,
 						 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
 						 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE);
@@ -552,7 +562,8 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 					}
 
 					wait_for_relation_state_change(rstate->relid,
-												   SUBREL_STATE_SYNCDONE);
+												   SUBREL_STATE_SYNCDONE,
+												   &started_tx);
 				}
 				else
 					LWLockRelease(LogicalRepWorkerLock);
-- 
2.34.1

Reply via email to