On 11/22/23 11:38, Amit Kapila wrote:
> On Tue, Nov 21, 2023 at 6:56 PM Tomas Vondra
> <tomas.von...@enterprisedb.com> wrote:
>>
>> On 11/21/23 14:16, Amit Kapila wrote:
>>> On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra
>>> <tomas.von...@enterprisedb.com> wrote:
>>>>
>>>
>>> It seems there is some inconsistency in what you have written for
>>> client backends/tablesync worker vs. apply worker. The above text
>>> seems to be saying that the client backend and table sync worker are
>>> waiting on a "subscription row in pg_subscription" and the apply
>>> worker is operating on "pg_subscription_rel". So, if that is true then
>>> they shouldn't get stuck.
>>>
>>> I think here client backend and tablesync worker seems to be blocked
>>> for a lock on pg_subscription_rel.
>>>
>>
>> Not really, they are all locking the subscription. All the locks are on
>> classid=6100, which is pg_subscription:
>>
>>   test=# select 6100::regclass;
>>       regclass
>>   -----------------
>>    pg_subscription
>>   (1 row)
>>
>> The thing is, the tablesync workers call UpdateSubscriptionRelState,
>> which locks the pg_subscription catalog at the very beginning:
>>
>>    LockSharedObject(SubscriptionRelationId, ...);
>>
>> So that's the issue. I haven't explored why it's done this way, and
>> there's no comment explaining locking the subscriptions is needed ...
>>
> 
> I think it prevents concurrent drop of rel during the REFRESH operation.
> 

Yes. Or maybe some other concurrent DDL on the relations included in the
subscription.

>>>> The tablesync workers can't proceed because their lock request is stuck
>>>> behind the AccessExclusiveLock request.
>>>>
>>>> And the apply worker can't proceed, because it's waiting for status
>>>> update from the tablesync workers.
>>>>
>>>
>>> This part is not clear to me because
>>> wait_for_relation_state_change()->GetSubscriptionRelState() seems to
>>> be releasing the lock while closing the relation. Am, I missing
>>> something?
>>>
>>
>> I think you're missing the fact that GetSubscriptionRelState() acquires
>> and releases the lock on pg_subscription_rel, but that's not the lock
>> causing the issue. The problem is the lock on the pg_subscription row.
>>
> 
> Okay. IIUC, what's going on here is that the apply worker acquires
> AccessShareLock on pg_subscription to update rel state for one of the
> tables say tbl-1, and then for another table say tbl-2, it started
> waiting for a state change via wait_for_relation_state_change(). I
> think here the fix is to commit the transaction before we go for a
> wait. I guess we need something along the lines of what is proposed in
> [1] though we have solved the problem in that thread in some other
> way..
> 

Possibly. I haven't checked if the commit might have some unexpected
consequences, but I can confirm I can no longer reproduce the deadlock
with the patch applied.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to