Hi,

On 2026-06-22 16:14:08 +0800, Xuneng Zhou wrote:
> This looks like a real lost-wakeup race in LockBufferForCleanup().

I don't think any wakeup is lost. The problem is that we're waiting for a
wakeup when it is not guaranteed that one will arrive.


> The relevant sequence in failed test 048 is:
> 
> - Session B opens a cursor and fetches one heap tuple, leaving a heap
> buffer pinned.
> - Session A starts VACUUM (FREEZE).
> - VACUUM reaches that page and waits in LockBufferForCleanup().
> - Session B later advances/closes the cursor, releasing the pin.
> - VACUUM is expected to wake up and finish.
> 
> In the failure, the tap test had already passed its three sql
> assertions, but then it timed out waiting for VACUUM completion:
> 
> ok 3 - Cursor query returned 7 from second fetch
> poll_query_until timed out:
>   SELECT vacuum_count > 0 ...
> last actual query output: f
> Looks like your test exited with 29 just after 3.
> 
> The diagnostic log shows the actual race:
> 
> LOG: !!!LockBufferForCleanup| wakeup lost
> CONTEXT: while scanning block 90 of relation "public.vac_horizon_floor_table"

> LOG: !!!LockBufferForCleanup]| before ProcWaitForSignal
> CONTEXT: while scanning block 90 of relation "public.vac_horizon_floor_table"

I don't think this is meaningful output, as written upthread.


> There was no later "after ProcWaitForSignal" before the shutdown,
> which implies that VACUUM published itself as a waiter, entered
> ProcWaitForSignal(), and not been woken up later.

> The direct regression appears to be 5310fac6e0f. It allows this interleaving:
> 
> W: LockBufferForCleanup() holds buffer header lock
> W: observes refcount > 1
> P: releases the last competing pin with atomic fetch_sub
> P: old state does not contain BM_PIN_COUNT_WAITER, so no wakeup
> W: publishes BM_PIN_COUNT_WAITER
> W: sleeps in ProcWaitForSignal()
> 
> At this point the condition W wanted is already true: refcount is 1,
> meaning only W's own pin remains. So W could sleep indefinitely as no
> future unpin to wake it.

However, this indeed seems to be the problem.


> We can fix this with the state returned by UnlockBufHdrExt() when
> publishing BM_PIN_COUNT_WAITER. If the wait refcount is 1, do not
> enter the wait path. Instead, fall through to the existing waiter-bit
> cleanup and retry the loop to acquire the cleanup lock normally. The
> reproducer test passed after applying the patch.

Why are we retrying, when we already achieved everything we wanted? Seems we
should just unset BM_PIN_COUNT_WAITER / wait_backend_pgprocno.  I think the
easiest way to do this would be to not unlock the buffer header lock, but just
atomically-or-in BM_PIN_COUNT_WAITER and then recheck if we already are done,
while still holding the header lock.

Greetings,

Andres Freund


Reply via email to