Re: new option to allow pg_rewind to run without full_page_writes

Andres Freund Mon, 07 Nov 2022 16:34:43 -0800

Hi,

On 2022-11-08 00:07:09 +0100, Jérémie Grauer wrote:
> On 06/11/2022 03:38, Andres Freund wrote:
> > Hi,
> >
> > On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote:
> > > Currently pg_rewind refuses to run if full_page_writes is off. This is to
> > > prevent it to run into a torn page during operation.
> > >
> > > This is usually a good call, but some file systems like ZFS are naturally
> > > immune to torn page (maybe btrfs too, but I don't know for sure for this
> > > one).
> >
> > Note that this isn't about torn pages in case of crashes, but about reading
> > pages while they're being written to.


> Like I wrote above, ZFS will prevent torn pages on writes, like
> full_page_writes does.

Unfortunately not relevant for pg_rewind due to the issues mentioned
subsequently.


> > Right now, that definitely allows for torn reads, because of the way
> > pg_read_binary_file() is implemented.  We only ensure a 4k read size from 
> > the
> > view of our code, which obviously can lead to torn 8k page reads, no matter
> > what the filesystem guarantees.
> >
> > Also, for reasons I don't understand we use C streaming IO or
> > pg_read_binary_file(), so you'd also need to ensure that the buffer size 
> > used
> > by the stream implementation can't cause the reads to happen in smaller
> > chunks.  Afaict we really shouldn't use file streams here, then we'd at 
> > least
> > have control over that aspect.
> >
> >
> > Does ZFS actually guarantee that there never can be short reads? As soon as
> > they are possible, full page writes are neededI may be missing something
> > here: how does full_page_writes prevents

> short _reads_ ?

Yes.


> Presumably, if we do something like read the first 4K of a file, then change
> the file, then read the next 4K, the second 4K may be a torn read.

Correct.


> But I fail to see how full_page_writes prevents this since it only act on 
> writes

It ensures the damage is later repaired during WAL replay. Which can only
happen if the WAL contains the necessary information to do so - the full page
writes.


I suspect to avoid the need for this we'd need to atomically read all the
pages involved in a WAL record (presumably by locking the pages against
IO). That'd then safely allow skipping replay of WAL records based on the LSN.a

A slightly easier thing would be to force-enable full page writes just for the
duration of a rewind, similar to what we do during base backups. But that'd
still require a bunch more work than done here.


> > This isn't an fundamental issue - we could have a version of
> > pg_read_binary_file() for relation data that prevents the page being written
> > out concurrently by locking the buffer page. In addition it could often 
> > avoid
> > needing to read the page from the OS / disk, if present in shared buffers
> > (perhaps minus cases where we haven't flushed the WAL yet, but we could also
> > flush the WAL in those).
> > I agree, but this would need a differen patch, which may be beyond my
> skills.
> > Greetings,
> >
> > Andres Freund
> Anyway, ZFS will act like full_page_writes is always active, so isn't the
> proposed modification to pg_rewind valid?

No. This really isn't about the crash safety aspects of full page writes, so
the fact that ZFS is used is just not really relevant.

Regards,

Andres

Re: new option to allow pg_rewind to run without full_page_writes

Reply via email to