On 11/22/23 11:38, Amit Kapila wrote: > On Tue, Nov 21, 2023 at 6:56 PM Tomas Vondra > <tomas.von...@enterprisedb.com> wrote: >> >> On 11/21/23 14:16, Amit Kapila wrote: >>> On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra >>> <tomas.von...@enterprisedb.com> wrote: >>>> >>> >>> It seems there is some inconsistency in what you have written for >>> client backends/tablesync worker vs. apply worker. The above text >>> seems to be saying that the client backend and table sync worker are >>> waiting on a "subscription row in pg_subscription" and the apply >>> worker is operating on "pg_subscription_rel". So, if that is true then >>> they shouldn't get stuck. >>> >>> I think here client backend and tablesync worker seems to be blocked >>> for a lock on pg_subscription_rel. >>> >> >> Not really, they are all locking the subscription. All the locks are on >> classid=6100, which is pg_subscription: >> >> test=# select 6100::regclass; >> regclass >> ----------------- >> pg_subscription >> (1 row) >> >> The thing is, the tablesync workers call UpdateSubscriptionRelState, >> which locks the pg_subscription catalog at the very beginning: >> >> LockSharedObject(SubscriptionRelationId, ...); >> >> So that's the issue. I haven't explored why it's done this way, and >> there's no comment explaining locking the subscriptions is needed ... >> > > I think it prevents concurrent drop of rel during the REFRESH operation. >
Yes. Or maybe some other concurrent DDL on the relations included in the subscription. >>>> The tablesync workers can't proceed because their lock request is stuck >>>> behind the AccessExclusiveLock request. >>>> >>>> And the apply worker can't proceed, because it's waiting for status >>>> update from the tablesync workers. >>>> >>> >>> This part is not clear to me because >>> wait_for_relation_state_change()->GetSubscriptionRelState() seems to >>> be releasing the lock while closing the relation. Am, I missing >>> something? >>> >> >> I think you're missing the fact that GetSubscriptionRelState() acquires >> and releases the lock on pg_subscription_rel, but that's not the lock >> causing the issue. The problem is the lock on the pg_subscription row. >> > > Okay. IIUC, what's going on here is that the apply worker acquires > AccessShareLock on pg_subscription to update rel state for one of the > tables say tbl-1, and then for another table say tbl-2, it started > waiting for a state change via wait_for_relation_state_change(). I > think here the fix is to commit the transaction before we go for a > wait. I guess we need something along the lines of what is proposed in > [1] though we have solved the problem in that thread in some other > way.. > Possibly. I haven't checked if the commit might have some unexpected consequences, but I can confirm I can no longer reproduce the deadlock with the patch applied. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company