Hi, On 2022-11-08 00:07:09 +0100, Jérémie Grauer wrote: > On 06/11/2022 03:38, Andres Freund wrote: > > Hi, > > > > On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote: > > > Currently pg_rewind refuses to run if full_page_writes is off. This is to > > > prevent it to run into a torn page during operation. > > > > > > This is usually a good call, but some file systems like ZFS are naturally > > > immune to torn page (maybe btrfs too, but I don't know for sure for this > > > one). > > > > Note that this isn't about torn pages in case of crashes, but about reading > > pages while they're being written to.
> Like I wrote above, ZFS will prevent torn pages on writes, like > full_page_writes does. Unfortunately not relevant for pg_rewind due to the issues mentioned subsequently. > > Right now, that definitely allows for torn reads, because of the way > > pg_read_binary_file() is implemented. We only ensure a 4k read size from > > the > > view of our code, which obviously can lead to torn 8k page reads, no matter > > what the filesystem guarantees. > > > > Also, for reasons I don't understand we use C streaming IO or > > pg_read_binary_file(), so you'd also need to ensure that the buffer size > > used > > by the stream implementation can't cause the reads to happen in smaller > > chunks. Afaict we really shouldn't use file streams here, then we'd at > > least > > have control over that aspect. > > > > > > Does ZFS actually guarantee that there never can be short reads? As soon as > > they are possible, full page writes are neededI may be missing something > > here: how does full_page_writes prevents > short _reads_ ? Yes. > Presumably, if we do something like read the first 4K of a file, then change > the file, then read the next 4K, the second 4K may be a torn read. Correct. > But I fail to see how full_page_writes prevents this since it only act on > writes It ensures the damage is later repaired during WAL replay. Which can only happen if the WAL contains the necessary information to do so - the full page writes. I suspect to avoid the need for this we'd need to atomically read all the pages involved in a WAL record (presumably by locking the pages against IO). That'd then safely allow skipping replay of WAL records based on the LSN.a A slightly easier thing would be to force-enable full page writes just for the duration of a rewind, similar to what we do during base backups. But that'd still require a bunch more work than done here. > > This isn't an fundamental issue - we could have a version of > > pg_read_binary_file() for relation data that prevents the page being written > > out concurrently by locking the buffer page. In addition it could often > > avoid > > needing to read the page from the OS / disk, if present in shared buffers > > (perhaps minus cases where we haven't flushed the WAL yet, but we could also > > flush the WAL in those). > > I agree, but this would need a differen patch, which may be beyond my > skills. > > Greetings, > > > > Andres Freund > Anyway, ZFS will act like full_page_writes is always active, so isn't the > proposed modification to pg_rewind valid? No. This really isn't about the crash safety aspects of full page writes, so the fact that ZFS is used is just not really relevant. Regards, Andres