Summary to date:
It's worse than I thought originally, because: - Most widely deployed kernels have cases where they don't tell you about losing your writes at all; and - Information about loss of writes can be masked by closing and re-opening a file So the checkpointer cannot trust that a successful fsync() means ... a successful fsync(). Also, it's been reported to me off-list that anyone on the system calling sync(2) or the sync shell command will also generally consume the write error, causing us not to see it when we fsync(). The same is true for /proc/sys/vm/drop_caches. I have not tested these yet. There's some level of agreement that we should PANIC on fsync() errors, at least on Linux, but likely everywhere. But we also now know it's insufficient to be fully protective. I previously though that errors=remount-ro was a sufficient safeguard. It isn't. There doesn't seem to be anything that is, for ext3, ext4, btrfs or xfs. It's not clear to me yet why data_err=abort isn't sufficient in data=ordered or data=writeback mode on ext3 or ext4, needs more digging. (In my test tools that's: make FSTYPE=ext4 MKFSOPTS="" MOUNTOPTS="errors=remount-ro, data_err=abort,data=journal" as of the current version d7fe802ec). AFAICS that's because data_error=abort only affects data=ordered, not data=journal. If you use data=ordered, you at least get retries of the same write failing. This post https://lkml.org/lkml/2008/10/10/80 added the option and has some explanation, but doesn't explain why it doesn't affect data=journal. zfs is probably not affected by the issues, per Thomas Munro. I haven't run my test scripts on it yet because my kernel doesn't have zfs support and I'm prioritising the multi-process / open-and-close issues. So far none of the FSes and options I've tried exhibit the behavour I actually want, which is to make the fs readonly or inaccessible on I/O error. ENOSPC doesn't seem to be a concern during normal operation of major file systems (ext3, ext4, btrfs, xfs) because they reserve space before returning from write(). But if a buffered write does manage to fail due to ENOSPC we'll definitely see the same problems. This makes ENOSPC on NFS a potentially data corrupting condition since NFS doesn't preallocate space before returning from write(). I think what we really need is a block-layer fix, where an I/O error flips the block device into read-only mode, as if blockdev --setro had been used. Though I'd settle for a kernel panic, frankly. I don't think anybody really wants this, but I'd rather either of those to silent data loss. I'm currently tweaking my test to do some close and reopen the file between each write() and fsync(), and to support running with nfs. I've also just found the device-mapper "flakey" driver, which looks fantastic for simulating unreliable I/O with intermittent faults. I've been using the "error" target in a mapping, which lets me remap some of the device to always error, but "flakey" looks very handy for actual PostgreSQL testing. For the sake of Google, these are errors known to be associated with the problem: ext4, and ext3 mounted with ext4 driver: [42084.327345] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error 10 writing to inode 12 (offset 0 size 0 starting block 59393) [42084.327352] Buffer I/O error on device dm-0, logical block 59393 xfs: [42193.771367] XFS (dm-0): writeback error on sector 118784 [42193.784477] XFS (dm-0): writeback error on sector 118784 jfs: (nil, silence in the kernel logs) You should also beware of "lost page write" or "lost write" errors.