Hi Michael,

On Thu, 29 Feb 2024 at 06:05, Michael Paquier <mich...@paquier.xyz> wrote:

>
> Wow.  Have you seen that in an actual production environment?
>

Yes, we see it regularly, and it is reproducible in test environments as
well.


> my $start_page = start_of_page($end_lsn);
> my $wal_file = write_wal($primary, $TLI, $start_page,
>                          "\x00" x $WAL_BLOCK_SIZE);
> # copy the file we just "hacked" to the archive
> copy($wal_file, $primary->archive_dir);
>
> So you are emulating a failure by filling with zeros the second page
> where the last emit_message() generated a record, and the page before
> that includes the continuation record.  Then abuse of WAL archiving to
> force the replay of the last record.  That's kind of cool.
>

Right, at this point it is easier than to cause an artificial crash on the
primary after it finished writing just one page.


> > To be honest, I don't know yet how to fix it nicely. I am thinking about
> > returning XLREAD_FAIL from XLogPageRead() if it suddenly switched to a
> new
> > timeline while trying to read a page and if this page is invalid.
>
> Hmm.  I suspect that you may be right on a TLI change when reading a
> page.  There are a bunch of side cases with continuation records and
> header validation around XLogReaderValidatePageHeader().  Perhaps you
> have an idea of patch to show your point?
>

Not yet, but hopefully I will get something done next week.


>
> Nit.  In your test, it seems to me that you should not call directly
> set_standby_mode and enable_restoring, just rely on has_restoring with
> the standby option included.
>

Thanks, I'll look into it.

-- 
Regards,
--
Alexander Kukushkin

Reply via email to