Re: XID-wraparound hazards in LISTEN/NOTIFY

2023-11-08 Thread Bruce Momjian
On Wed, Nov  8, 2023 at 02:52:16PM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Sat, Nov 23, 2019 at 12:10:56PM -0500, Tom Lane wrote:
> >>> It suddenly strikes me to worry that we have an XID wraparound hazard
> >>> for entries in the notify queue.
> 
> > Is this still an open issue?  Should it be a TODO item?
> 
> I don't think anyone's done anything about it, so yeah.
> 
> Realistically, if you've got NOTIFY messages that are going unread
> for long enough to risk XID wraparound, your app is broken.  So
> maybe it'd be sufficient to discard messages that are old enough
> to approach the wrap horizon.  But still that's code that doesn't
> exist.

Thanks, TODO added.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: XID-wraparound hazards in LISTEN/NOTIFY

2023-11-08 Thread Tom Lane
Bruce Momjian  writes:
> On Sat, Nov 23, 2019 at 12:10:56PM -0500, Tom Lane wrote:
>>> It suddenly strikes me to worry that we have an XID wraparound hazard
>>> for entries in the notify queue.

> Is this still an open issue?  Should it be a TODO item?

I don't think anyone's done anything about it, so yeah.

Realistically, if you've got NOTIFY messages that are going unread
for long enough to risk XID wraparound, your app is broken.  So
maybe it'd be sufficient to discard messages that are old enough
to approach the wrap horizon.  But still that's code that doesn't
exist.

regards, tom lane




Re: XID-wraparound hazards in LISTEN/NOTIFY

2023-11-08 Thread Bruce Momjian
On Sat, Nov 23, 2019 at 12:10:56PM -0500, Tom Lane wrote:
> Mark Dilger  writes:
> > On 11/23/19 8:34 AM, Tom Lane wrote:
> >> It suddenly strikes me to worry that we have an XID wraparound hazard
> >> for entries in the notify queue.
> 
> > Is it worth checking for this condition in autovacuum?
> 
> Dunno, maybe.  It's a different avenue to consider, at least.
> 
> > There shouldn't be too much reason to back-patch any of this, since
> > the change in 51004c717 only applies to v13 and onward.  Or do you
> > see the risk you described as "pretty minimal" as still being large
> > enough to outweigh the risk of anything we might back-patch?
> 
> There may not be a risk large enough to worry about before 51004c717,
> assuming that we discount cases like a single session staying
> idle-in-transaction for long enough for the XID counter to wrap
> (which'd cause problems for more than just LISTEN/NOTIFY).  I haven't
> analyzed this carefully enough to be sure.  We'd have to consider
> that, as well as the complexity of whatever fix we choose for HEAD,
> while deciding if we need a back-patch.

Is this still an open issue?  Should it be a TODO item?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: XID-wraparound hazards in LISTEN/NOTIFY

2019-11-23 Thread Tom Lane
Mark Dilger  writes:
> On 11/23/19 8:34 AM, Tom Lane wrote:
>> It suddenly strikes me to worry that we have an XID wraparound hazard
>> for entries in the notify queue.

> Is it worth checking for this condition in autovacuum?

Dunno, maybe.  It's a different avenue to consider, at least.

> There shouldn't be too much reason to back-patch any of this, since
> the change in 51004c717 only applies to v13 and onward.  Or do you
> see the risk you described as "pretty minimal" as still being large
> enough to outweigh the risk of anything we might back-patch?

There may not be a risk large enough to worry about before 51004c717,
assuming that we discount cases like a single session staying
idle-in-transaction for long enough for the XID counter to wrap
(which'd cause problems for more than just LISTEN/NOTIFY).  I haven't
analyzed this carefully enough to be sure.  We'd have to consider
that, as well as the complexity of whatever fix we choose for HEAD,
while deciding if we need a back-patch.

regards, tom lane




Re: XID-wraparound hazards in LISTEN/NOTIFY

2019-11-23 Thread Mark Dilger




On 11/23/19 8:34 AM, Tom Lane wrote:

In connection with a different issue, I wrote:


* The point of calling asyncQueueReadAllNotifications in
Exec_ListenPreCommit is to advance over already-committed queue entries
before we start sending any events to the client.  Without this, a
newly-listening client could be sent some very stale events.  (Note
that 51004c717 changed this from "somewhat stale" to "very stale".


It suddenly strikes me to worry that we have an XID wraparound hazard
for entries in the notify queue.  The odds of seeing a live problem
with that before 51004c717 were pretty minimal, but now that we don't
aggressively advance the queue tail, I think it's a very real risk for
low-notify-traffic installations.  In the worst case, imagine

* Somebody sends one NOTIFY, maybe just as a test.

* Nothing happens for a couple of weeks, during which the XID counter
advances by 2 billion or so.

* Newly-listening sessions will now think that that old event is
"in the future", hence fail to advance over it, resulting in denial
of service for new notify traffic.  There is no recourse short of
a server restart or waiting another couple weeks for wraparound.


Is it worth checking for this condition in autovacuum?  Even for
installations with autovacuum disabled, would the anti-wraparound
vacuums happen frequently enough to also advance the tail if modified
to test for this condition?


I thought about fixing this by tracking the queue's oldest XID in
the shared queue info, and forcing a tail advance when that got
too old --- but if nobody actively uses any listen or notify
features for awhile, no such code is going to execute, so the
above scenario could happen anyway.

The only bulletproof fix I can think of offhand is to widen the
queue entries to 64-bit XIDs, which is a tad annoying from a
space consumption standpoint.  Possibly we could compromise by
storing the high-order bits just once per SLRU page (and then
forcing an advance to a new page when those bits change).

A somewhat less bulletproof fix is to detect far-in-the-future queue
entries and discard them.  That would prevent the freezeup scenario,
but there'd be a residual hazard of transmitting ancient
notifications to clients (if a queue entry survived for 4G
transactions not just 2G).  But maybe that's OK, given the rather
tiny probabilities involved, and the low guarantees around notify
reliability in general.  It'd be a much more back-patchable answer
than a queue format change, too.


There shouldn't be too much reason to back-patch any of this, since
the change in 51004c717 only applies to v13 and onward.  Or do you
see the risk you described as "pretty minimal" as still being large
enough to outweigh the risk of anything we might back-patch?



--
Mark Dilger