On Tue, Nov 21, 2023 at 6:56 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > On 11/21/23 14:16, Amit Kapila wrote: > > On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra > > <tomas.von...@enterprisedb.com> wrote: > >> > > > > It seems there is some inconsistency in what you have written for > > client backends/tablesync worker vs. apply worker. The above text > > seems to be saying that the client backend and table sync worker are > > waiting on a "subscription row in pg_subscription" and the apply > > worker is operating on "pg_subscription_rel". So, if that is true then > > they shouldn't get stuck. > > > > I think here client backend and tablesync worker seems to be blocked > > for a lock on pg_subscription_rel. > > > > Not really, they are all locking the subscription. All the locks are on > classid=6100, which is pg_subscription: > > test=# select 6100::regclass; > regclass > ----------------- > pg_subscription > (1 row) > > The thing is, the tablesync workers call UpdateSubscriptionRelState, > which locks the pg_subscription catalog at the very beginning: > > LockSharedObject(SubscriptionRelationId, ...); > > So that's the issue. I haven't explored why it's done this way, and > there's no comment explaining locking the subscriptions is needed ... >
I think it prevents concurrent drop of rel during the REFRESH operation. > >> The tablesync workers can't proceed because their lock request is stuck > >> behind the AccessExclusiveLock request. > >> > >> And the apply worker can't proceed, because it's waiting for status > >> update from the tablesync workers. > >> > > > > This part is not clear to me because > > wait_for_relation_state_change()->GetSubscriptionRelState() seems to > > be releasing the lock while closing the relation. Am, I missing > > something? > > > > I think you're missing the fact that GetSubscriptionRelState() acquires > and releases the lock on pg_subscription_rel, but that's not the lock > causing the issue. The problem is the lock on the pg_subscription row. > Okay. IIUC, what's going on here is that the apply worker acquires AccessShareLock on pg_subscription to update rel state for one of the tables say tbl-1, and then for another table say tbl-2, it started waiting for a state change via wait_for_relation_state_change(). I think here the fix is to commit the transaction before we go for a wait. I guess we need something along the lines of what is proposed in [1] though we have solved the problem in that thread in some other way.. [1] - https://www.postgresql.org/message-id/1412708.1674417574%40sss.pgh.pa.us -- With Regards, Amit Kapila.