Hi, On 2022-11-03 16:54:13 +0100, Jérémie Grauer wrote: > Currently pg_rewind refuses to run if full_page_writes is off. This is to > prevent it to run into a torn page during operation. > > This is usually a good call, but some file systems like ZFS are naturally > immune to torn page (maybe btrfs too, but I don't know for sure for this > one).
Note that this isn't about torn pages in case of crashes, but about reading pages while they're being written to. Right now, that definitely allows for torn reads, because of the way pg_read_binary_file() is implemented. We only ensure a 4k read size from the view of our code, which obviously can lead to torn 8k page reads, no matter what the filesystem guarantees. Also, for reasons I don't understand we use C streaming IO or pg_read_binary_file(), so you'd also need to ensure that the buffer size used by the stream implementation can't cause the reads to happen in smaller chunks. Afaict we really shouldn't use file streams here, then we'd at least have control over that aspect. Does ZFS actually guarantee that there never can be short reads? As soon as they are possible, full page writes are needed. This isn't an fundamental issue - we could have a version of pg_read_binary_file() for relation data that prevents the page being written out concurrently by locking the buffer page. In addition it could often avoid needing to read the page from the OS / disk, if present in shared buffers (perhaps minus cases where we haven't flushed the WAL yet, but we could also flush the WAL in those). Greetings, Andres Freund