Hi, On 2023-02-01 11:23:57 +0530, Amit Kapila wrote: > On Tue, Jan 31, 2023 at 6:08 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > > > Attached updated patches. > > > > Thanks, Andres, others, do you see a better way to fix this problem? I > have reproduced it manually and the steps are shared at [1] and > Sawada-San also reproduced it, see [2]. > > [1] - > https://www.postgresql.org/message-id/CAA4eK1KDFeh%3DZbvSWPx%3Dir2QOXBxJbH0K8YqifDtG3xJENLR%2Bw%40mail.gmail.com > [2] - > https://www.postgresql.org/message-id/CAD21AoDKJBB6p4X-%2B057Vz44Xyc-zDFbWJ%2Bg9FL6qAF5PC2iFg%40mail.gmail.com
Hm. It's worrysome to now hold ProcArrayLock exclusively while iterating over the slots. ReplicationSlotsComputeRequiredXmin() can be called at a non-neglegible frequency. Callers like CreateInitDecodingContext(), that pass already_locked=true worry me a lot less, because obviously that's not a very frequent operation. This is particularly not great because we need to acquire ReplicationSlotControlLock while already holding ProcArrayLock. But clearly there's a pretty large hole in the lock protection right now. I'm a bit confused about why we (Robert and I, or just I) thought it's ok to do it this way. I wonder if we could instead invert the locks, and hold ReplicationSlotControlLock until after ProcArraySetReplicationSlotXmin(), and acquire ProcArrayLock just for ProcArraySetReplicationSlotXmin(). That'd mean that already_locked = true callers have to do a bit more work (we have to be sure the locks are always acquired in the same order, or we end up in unresolved deadlock land), but I think we can live with that. This would still allow concurrent invocations of ReplicationSlotsComputeRequiredXmin() come up with slightly different values, but that's possible with the proposed patch as well, as effective_xmin is updated without any of the other locks. But I don't see a problem with that. Greetings, Andres Freund