Hi, > I mean to commit the open transaction at the below place in > wait_for_relation_state_change() > > wait_for_relation_state_change() > { > ... > -- commit the xact > WaitLatch(); > ... > } > > Then start after the wait is over. This is just to test whether it > improves the difference in regression test timing.
I tried the above approach and observed that the performance of this approach is nearly same as the previous approach. For Linux VM: Summary | Subscription | 100 tables in pub | 1000 tables in pub | Test (sec) | and Sub (sec) | and Sub (sec) ------------------------------------------------------------------------------ old patch | 107.4545 | 6.911 | 77.918 alternate | 108.3985 | 6.9835 | 78.111 approach For Performance Machine: Summary | Subscription | 100 tables in pub | 1000 tables in pub | Test (sec) | and Sub (sec) | and Sub (sec) ------------------------------------------------------------------------------ old patch | 115.922 | 6.7305 | 81.1525 alternate | 115.8215 | 6.7685 | 81.2335 approach I have attached the patch for this approach as 'alternate_approach.patch'. Since the performance is the same, I think that the previous approach is better. As in this approach we are using CommitTransactionCommand() and StartTransactionCommand() inside a 'for loop'. I also fixed the comment in previous approach and attached here as 'v2-0001-Deadlock-when-apply-worker-tablesync-worker-and-c.patch' Thanks and Regards Shlok Kyal
From 11072d138d900227b963b54d1a3626cf256db721 Mon Sep 17 00:00:00 2001 From: Shlok Kyal <shlok.kyal.oss@gmail.com> Date: Fri, 24 Nov 2023 16:37:14 +0530 Subject: [PATCH v2] Deadlock when apply worker,tablesync worker and client backend run parallelly. Apply worker holds a lock on the table pg_subscription_rel and waits for notification from the tablesync workers where the relation is synced, which happens through updating the pg_subscription_rel row. And that wait happens in wait_for_relation_state_change, which simply checks the row in a loop, with a sleep by WaitLatch(). Unfortunately, the tablesync workers can't update the row because the client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION sneaked in, and waits for an AccessExclusiveLock. So the tablesync workers are stuck in the queue and can't proceed. The client backend is waiting for a lock held by the apply worker. The tablesync workers can't proceed because their lock request is stuck behind the AccessExclusiveLock request. And the apply worker is waiting for status update from the tablesync workers. And the deadlock is undetected, because the apply worker is not waiting on a lock, but sleeping on a latch. We have resolved the issue by releasing the lock by commiting the transaction before the apply worker goes to wait state. --- src/backend/replication/logical/tablesync.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index df3c42eb5d..b71234d998 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -542,15 +542,26 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn) LWLockRelease(LogicalRepWorkerLock); /* - * Enter busy loop and wait for synchronization worker to - * reach expected state (or die trying). + * If we have a transaction, we must commit it to release + * any locks we have. Then start a new transaction so we + * can examine catalog state. */ - if (!started_tx) + if (started_tx) + { + CommitTransactionCommand(); + pgstat_report_stat(false); + StartTransactionCommand(); + } + else { StartTransactionCommand(); started_tx = true; } + /* + * Enter busy loop and wait for synchronization worker to + * reach expected state (or die trying). + */ wait_for_relation_state_change(rstate->relid, SUBREL_STATE_SYNCDONE); } -- 2.34.1
From ce261399d22a7fd9ee7ccdd3b9d1cc7f4f72ce4b Mon Sep 17 00:00:00 2001 From: Shlok Kyal <shlok.kyal.oss@gmail.com> Date: Thu, 7 Dec 2023 10:55:26 +0530 Subject: [PATCH] Deadlock when apply worker,tablesync worker and client backend run parallelly. Apply worker holds a lock on the table pg_subscription_rel and waits for notification from the tablesync workers where the relation is synced, which happens through updating the pg_subscription_rel row. And that wait happens in wait_for_relation_state_change, which simply checks the row in a loop, with a sleep by WaitLatch(). Unfortunately, the tablesync workers can't update the row because the client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION sneaked in, and waits for an AccessExclusiveLock. So the tablesync workers are stuck in the queue and can't proceed. The client backend is waiting for a lock held by the apply worker. The tablesync workers can't proceed because their lock request is stuck behind the AccessExclusiveLock request. And the apply worker is waiting for status update from the tablesync workers. And the deadlock is undetected, because the apply worker is not waiting on a lock, but sleeping on a latch. We have resolved the issue by releasing the lock by commiting the transaction before the apply worker goes to wait state. --- src/backend/replication/logical/tablesync.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index df3c42eb5d..fb0ac84450 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -174,7 +174,7 @@ finish_sync_worker(void) * CATCHUP state to SYNCDONE. */ static bool -wait_for_relation_state_change(Oid relid, char expected_state) +wait_for_relation_state_change(Oid relid, char expected_state, bool *started_tx) { char state; @@ -203,6 +203,16 @@ wait_for_relation_state_change(Oid relid, char expected_state) if (!worker) break; + /* + * If we have a transaction, we must commit it to release any locks we + * have. Then start a new transaction so we can examine catalog state. + */ + if (*started_tx) + { + CommitTransactionCommand(); + pgstat_report_stat(false); + StartTransactionCommand(); + } (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, 1000L, WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE); @@ -552,7 +562,8 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn) } wait_for_relation_state_change(rstate->relid, - SUBREL_STATE_SYNCDONE); + SUBREL_STATE_SYNCDONE, + &started_tx); } else LWLockRelease(LogicalRepWorkerLock); -- 2.34.1