On Tue, May 19, 2026 at 11:39 PM Joel Jacobson <[email protected]> wrote: > > Hi hackers, > > I had another pass over the async.c rework committed in 282b1cd, > and found a race that can cause a notification committed after > the listener registered its queue position to be missed entirely. > > This can happen in the small time window between PreCommit_Notify(), > where the first LISTEN registers the backend and records its queue > position, and AtCommit_Notify(), where the staged listen action is > made active in the shared channel map by setting listening = true. > > If a concurrent NOTIFY commits in that window, SignalBackends() can > see the staged listener entry with listening = false and conclude that > the backend is not interested in the channel. With direct advancement, > that can move the backend's queue pointer past the notification instead > of waking it. > > This is distinct from the documented LISTEN startup race in > listen.sgml. The documented race can produce false positives: after > LISTEN returns, an application may receive a notification for work > already observed by its initial database scan. That is harmless. This > race is a false negative: a notification can be missed entirely. > > The fix is just to treat staged LISTEN entries as possible listeners > when deciding whom to wake: > > ```diff > - if (!listeners[j].listening) > - continue; /* ignore not-yet-committed listeners */ > ``` > > The attached patches split the report into tests and fix: > > 0001 Test missed LISTEN startup notification > 0002 Test LISTEN startup notification for already-seen work > 0003 Fix LISTEN startup race with direct advancement > > /Joel
Hi, Thank you for the fix! I tried the test and managed to reproduce it. I think what is definitely wrong here is that we can do direct advancement over messages that the listener is interested in, when it has set the position already. So +1 for the fix. I think the fix is correct and it's IIUC really harmless as the worst thing that can happen is an unnecessary signal. One point - looks like the 0003 contains the same test as 0001. Also while I was trying to understand the issue, another bad schedule came to my mind: it's basically the same scenario as you reproduced, but now we have a message in between the listener's position and notification that the listener is interested in. In this case the notifier can't do direct advancement, so the message is not lost, but the notifier still doesn't signal about a new message, so the listener doesn't know that there is something to read until something triggers it (like another message). PFA the reproducer (there is a schedule diagram in the head of the file). It depends on 0001. Your patch fixes it too. Thank you! Best regards, Arseniy Mukhin
untriggered-notification.patch.nocfbot
Description: Binary data
