On Tue, Apr 24, 2018 at 12:09 PM, Bruce Momjian <br...@momjian.us> wrote: > On Mon, Apr 23, 2018 at 01:14:48PM -0700, Andres Freund wrote: >> Hi, >> >> On 2018-03-28 10:23:46 +0800, Craig Ringer wrote: >> > TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at >> > least on Linux. When fsync() returns success it means "all writes since the >> > last fsync have hit disk" but we assume it means "all writes since the last >> > SUCCESSFUL fsync have hit disk". >> >> > But then we retried the checkpoint, which retried the fsync(). The retry >> > succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*. >> >> Random other thing we should look at: Some filesystems (nfs yes, xfs >> ext4 no) flush writes at close(2). We check close() return code, just >> log it... So close() counts as an fsync for such filesystems(). > > Well, that's interesting. You might remember that NFS does not reserve > space for writes like local file systems like ext4/xfs do. For that > reason, we might be able to capture the out-of-space error on close and > exit sooner for NFS.
It seems like some implementations flush on close and therefore discover ENOSPC problem at that point, unless they have NVSv4 (RFC 3050) "write delegation" with a promise from the server that a certain amount of space is available. It seems like you can't count on that in any way though, because it's the server that decides when to delegate and how much space to promise is preallocated, not the client. So in userspace you always need to be able to handle errors including ENOSPC returned by close(), and if you ignore that and you're using an operating system that immediately incinerates all evidence after telling you that (so that later fsync() doesn't fail), you're in trouble. Some relevant code: https://github.com/torvalds/linux/commit/5445b1fbd123420bffed5e629a420aa2a16bf849 https://github.com/freebsd/freebsd/blob/master/sys/fs/nfsclient/nfs_clvnops.c#L618 It looks like the bleeding edge of the NFS spec includes a new ALLOCATE operation that should be able to support posix_fallocate() (if we were to start using that for extending files): https://tools.ietf.org/html/rfc7862#page-64 I'm not sure how reliable [posix_]fallocate is on NFS in general though, and it seems that there are fall-back implementations of posix_fallocate() that write zeros (or even just feign success?) which probably won't do anything useful here if not also flushed (that fallback strategy might only work on eager reservation filesystems that don't have direct fallocate support?) so there are several layers (libc, kernel, nfs client, nfs server) that'd need to be aligned for that to work, and it's not clear how a humble userspace program is supposed to know if they are. I guess if you could find a way to amortise the cost of extending (like Oracle et al do by extending big container datafiles 10MB at a time or whatever), then simply writing zeros and flushing when doing that might work out OK, so you wouldn't need such a thing? (Unless of course it's a COW filesystem, but that's a different can of worms.) -- Thomas Munro http://www.enterprisedb.com