On Sat, Jun 27, 2026 at 2:11 PM Xuneng Zhou <[email protected]> wrote: > > On Fri, Jun 26, 2026 at 5:31 PM Vitaly Davydov <[email protected]> > wrote: > > > > Hi Fujii-san, Xuneng Zhou, > > > > > 1) It looks that this patch is doing some refactoring around > > > LockBufferForCleanup. The part being touched seems flaky itself on > > > HEAD, should we fix the issue in the buffer side first? [1] > > > > Thank you for mentioning it. It seems, there is a conflict with > > the another patch due to the introduction of RegisterPinCountWaiter > > function, that can not be resolved automatically. I'm ok to wait for > > the conflicting fix to be released. It seems, our patch should be > > reconsidered after the conflicting patch release. > > OK. > > > > 2) buf_state &= ~BM_PIN_COUNT_WAITER in stable versions seems not > > > necessary to me as LockBufferForCleanup will handle the clearing for > > > us. > > > > It seems it was my mistake when I resolved conflicts when cherry-picking > > the changes from other branches. I removed this line in my recent v 6.1 > > attached patches. Thank you for catching it. > > Thanks. > > Here are some additional comments after further review: > > 3) should we let RegisterPinCountWaiter do less? > > Currently, the function does a lot like: > RegisterPinCountWaiter() > { > check other waiter; > maybe unlock; > maybe elog(ERROR); > register; > unlock header; > } > > We need to handle the unlock & elog for two > callers(BufferIsReadyForCleanup and LockBufferForCleanup) with > different needs, which could be error-prone sometimes. > > + Assert((buf_state & BM_PIN_COUNT_WAITER) == 0 || > + bufHdr->wait_backend_pgprocno == MyProcNumber); > + > + if ((buf_state & BM_PIN_COUNT_WAITER) != 0 && > + bufHdr->wait_backend_pgprocno != MyProcNumber) > + { > + UnlockBufHdr(bufHdr); > + elog(ERROR, "multiple processes attempting to wait for pincount 1"); > + } > + > > For example, unlocking the header lock only is sufficient for > BufferIsReadyForCleanup but not for LockBufferForCleanup as we hold > the extra content lock. For LockBufferForCleanup, the check seems > duplicated as we already did it before invoking this function. So to > make this work, we may need to distinguish these two callers. > > Another potential issue of doing safe checks here is the order of > assert and error out: > In assert build, if the condition is true, then the header lock and > content lock would not be released properly. So we may need to adjust > the order. > > This leads me to wonder -- would it be better to let this help > function to do less, like just for registration and let the callers to > handle others. > > 4) the already-expired standby-delay path may not wake new pin holders > > If ltime has already passed, the patch sends > RECOVERY_CONFLICT_BUFFERPIN once before entering the loop, then waits > inside the new loop. What if another backend pins the buffer after > that broadcast? > > Before the patch, this seems fine because we return to > BufferIsReadyForCleanup and enter the function > ResolveRecoveryConflictWithBufferPin to do the re-check of ltime and > send the conflict signal once more. > > With the new inner loop, BufferIsReadyForCleanup() can re-register and > continue waiting, but no timeout is active and no new cancelling > request is sent. So recovery can continue waiting for a buffer pin > even though max_standby_*_delay has already expired. > > To fix this, should we add an in-loop check: if ltime != 0 && > GetCurrentTimestamp() >= ltime in here: > > + if (got_standby_delay_timeout) > + { > + SendRecoveryConflictWithBufferPin(RECOVERY_CONFLICT_BUFFERPIN); > + break; > + } > > 5) The name of BufferIsReadyForCleanup seems broader than the actual protocol. > > The function name seems not align with the intended use of it as the > comment suggested: > > * > * This is only for the hot-standby path in LockBufferForCleanup(), via > * ResolveRecoveryConflictWithBufferPin(), after ProcWaitForSignal() returns. > * The caller must already be registered as the shared buffer's > * BM_PIN_COUNT_WAITER. > > Would it be better to rename it to something like > RecheckCleanupLockPinWait or RecheckBufferPinWaiter?
I took another look at the patch. Here is an another thought: 6) We might want to be more elaborative on the commit message >From personal experience, the message seems a bit hard to follow if one does not read the related discussion. I like your analysis of the problem at the start of this thread [1]. Would it be sensible to summarize and reuse some part of it for the message? The motivation of this patch is to fix deadlock detector activation, but the wait-loop change has a slightly broader benefit. It also improves the standby timeout path indirectly by avoiding needless timeout teardown and re-entry on unrelated wakeups, and by keeping any already-fired timeout state until the loop handles it. Should we mention this as an additional benefit, or even frame the message broader like: Fix the buffer-pin recovery-conflict wait loop. [1] https://www.postgresql.org/message-id/44c24dcf-5710-410f-b1b6-d10b315f3d51%40postgrespro.ru -- Regards, Xuneng Zhou HighGo Software Co., Ltd.
