Re: Infinite loop in XLogPageRead() on standby

Kyotaro Horiguchi Thu, 29 Feb 2024 17:29:34 -0800

At Fri, 1 Mar 2024 08:17:04 +0900, Michael Paquier <mich...@paquier.xyz> wrote 
in 
> On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote:
> > On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota....@gmail.com>
> > wrote:
> >> In the first place, it's important to note that we do not guarantee
> >> that an async standby can always switch its replication connection to
> >> the old primary or another sibling standby. This is due to the
> >> variations in replication lag among standbys. pg_rewind is required to
> >> adjust such discrepancies.
> > 
> > Sure, I know. But in this case the async standby received and flushed
> > absolutely the same amount of WAL as the promoted one.
> 
> Ugh.  If it can happen transparently to the user without the user
> knowing directly about it, that does not sound good to me.  I did not
> look very closely at monitoring tools available out there, but if both
> standbys flushed the same WAL locations a rewind should not be
> required.  It is not something that monitoring tools would be able to
> detect because they just look at LSNs.
> 
> >> In the repro script, the replication connection of the second standby
> >> is switched from the old primary to the first standby after its
> >> promotion. After the switching, replication is expected to continue
> >> from the beginning of the last replayed segment.
> > 
> > Well, maybe, but apparently the standby is busy trying to decode a record
> > that spans multiple pages, and it is just infinitely waiting for the next
> > page to arrive. Also, the restart "fixes" the problem, because indeed it is
> > reading the file from the beginning.
> 
> What happens if the continuation record spawns across multiple segment
> files boundaries in this case?  We would go back to the beginning of
> the segment where the record spawning across multiple segments began,
> right?  (I may recall this part of contrecords incorrectly, feel free
> to correct me if necessary.)


After reading this, I came up with a possibility that walreceiver
recovers more quickly than the calling interval to
WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call
to the function WaitForWAL...(), and then somehow recovers the
connection before the next call, the function doesn't notice the
disconnection and returns XLREAD_SUCCESS wrongly. If this assumption
is correct, the correct fix might be for us to return XLREAD_FAIL when
reconnection happens after the last call to the WaitForWAL...()
function.

> >> But with the script,
> >> the second standby copies the intentionally broken file, which differs
> >> from the data that should be received via streaming.
> > 
> > As I already said, this is a simple way to emulate the primary crash while
> > standbys receiving WAL.
> > It could easily happen that the record spans on multiple pages is not fully
> > received and flushed.
> 
> I think that's OK to do so at test level to force a test in the
> backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has
> proved that this can be designed to be cheap and stable across the
> buildfarm fleet.

Yeah, I agree that it clearly illustrates the final state after the
issue happened, but if my assumption above is correct, the test
doesn't manifest the real issue.

> For anything like that, the result had better have solid test
> coverage, where perhaps we'd better refactor some of the routines of
> 039_end_of_wal.pl into a module to use them here, rather than
> duplicate the code.  The other test has a few assumptions with the
> calculation of page boundaries, and I'd rather not duplicate that
> across the tree.

I agree to the point.

> Seeing that Alexander is a maintainer of Patroni, which is very
> probably used by his employer across a large set of PostgreSQL
> instances, if he says that he's seen that in the field, that's good
> enough for me to respond to the problem, especially if reconnecting a
> standby to a promoted node where both flushed the same LSN.  Now the
> level of response also depends on the invasiness of the change, and we
> need a very careful evaluation here.

I don't mean to say that we should respond with DNF to this "issue" at
all. I simply wanted to make the real issue clear.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Infinite loop in XLogPageRead() on standby

Reply via email to