Hi Michael, On Thu, 29 Feb 2024 at 06:05, Michael Paquier <mich...@paquier.xyz> wrote:
> > Wow. Have you seen that in an actual production environment? > Yes, we see it regularly, and it is reproducible in test environments as well. > my $start_page = start_of_page($end_lsn); > my $wal_file = write_wal($primary, $TLI, $start_page, > "\x00" x $WAL_BLOCK_SIZE); > # copy the file we just "hacked" to the archive > copy($wal_file, $primary->archive_dir); > > So you are emulating a failure by filling with zeros the second page > where the last emit_message() generated a record, and the page before > that includes the continuation record. Then abuse of WAL archiving to > force the replay of the last record. That's kind of cool. > Right, at this point it is easier than to cause an artificial crash on the primary after it finished writing just one page. > > To be honest, I don't know yet how to fix it nicely. I am thinking about > > returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a > new > > timeline while trying to read a page and if this page is invalid. > > Hmm. I suspect that you may be right on a TLI change when reading a > page. There are a bunch of side cases with continuation records and > header validation around XLogReaderValidatePageHeader(). Perhaps you > have an idea of patch to show your point? > Not yet, but hopefully I will get something done next week. > > Nit. In your test, it seems to me that you should not call directly > set_standby_mode and enable_restoring, just rely on has_restoring with > the standby option included. > Thanks, I'll look into it. -- Regards, -- Alexander Kukushkin