At Fri, 1 Mar 2024 08:17:04 +0900, Michael Paquier <mich...@paquier.xyz> wrote in > On Thu, Feb 29, 2024 at 05:44:25PM +0100, Alexander Kukushkin wrote: > > On Thu, 29 Feb 2024 at 08:18, Kyotaro Horiguchi <horikyota....@gmail.com> > > wrote: > >> In the first place, it's important to note that we do not guarantee > >> that an async standby can always switch its replication connection to > >> the old primary or another sibling standby. This is due to the > >> variations in replication lag among standbys. pg_rewind is required to > >> adjust such discrepancies. > > > > Sure, I know. But in this case the async standby received and flushed > > absolutely the same amount of WAL as the promoted one. > > Ugh. If it can happen transparently to the user without the user > knowing directly about it, that does not sound good to me. I did not > look very closely at monitoring tools available out there, but if both > standbys flushed the same WAL locations a rewind should not be > required. It is not something that monitoring tools would be able to > detect because they just look at LSNs. > > >> In the repro script, the replication connection of the second standby > >> is switched from the old primary to the first standby after its > >> promotion. After the switching, replication is expected to continue > >> from the beginning of the last replayed segment. > > > > Well, maybe, but apparently the standby is busy trying to decode a record > > that spans multiple pages, and it is just infinitely waiting for the next > > page to arrive. Also, the restart "fixes" the problem, because indeed it is > > reading the file from the beginning. > > What happens if the continuation record spawns across multiple segment > files boundaries in this case? We would go back to the beginning of > the segment where the record spawning across multiple segments began, > right? (I may recall this part of contrecords incorrectly, feel free > to correct me if necessary.)
After reading this, I came up with a possibility that walreceiver recovers more quickly than the calling interval to WaitForWALtoBecomeAvailable(). If walreceiver disconnects after a call to the function WaitForWAL...(), and then somehow recovers the connection before the next call, the function doesn't notice the disconnection and returns XLREAD_SUCCESS wrongly. If this assumption is correct, the correct fix might be for us to return XLREAD_FAIL when reconnection happens after the last call to the WaitForWAL...() function. > >> But with the script, > >> the second standby copies the intentionally broken file, which differs > >> from the data that should be received via streaming. > > > > As I already said, this is a simple way to emulate the primary crash while > > standbys receiving WAL. > > It could easily happen that the record spans on multiple pages is not fully > > received and flushed. > > I think that's OK to do so at test level to force a test in the > backend, FWIW, because that's cheaper, and 039_end_of_wal.pl has > proved that this can be designed to be cheap and stable across the > buildfarm fleet. Yeah, I agree that it clearly illustrates the final state after the issue happened, but if my assumption above is correct, the test doesn't manifest the real issue. > For anything like that, the result had better have solid test > coverage, where perhaps we'd better refactor some of the routines of > 039_end_of_wal.pl into a module to use them here, rather than > duplicate the code. The other test has a few assumptions with the > calculation of page boundaries, and I'd rather not duplicate that > across the tree. I agree to the point. > Seeing that Alexander is a maintainer of Patroni, which is very > probably used by his employer across a large set of PostgreSQL > instances, if he says that he's seen that in the field, that's good > enough for me to respond to the problem, especially if reconnecting a > standby to a promoted node where both flushed the same LSN. Now the > level of response also depends on the invasiness of the change, and we > need a very careful evaluation here. I don't mean to say that we should respond with DNF to this "issue" at all. I simply wanted to make the real issue clear. regards. -- Kyotaro Horiguchi NTT Open Source Software Center