Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Thomas Munro Thu, 26 Apr 2018 18:20:13 -0700

On Tue, Apr 24, 2018 at 12:09 PM, Bruce Momjian <br...@momjian.us> wrote:
> On Mon, Apr 23, 2018 at 01:14:48PM -0700, Andres Freund wrote:
>> Hi,
>>
>> On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:
>> > TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at
>> > least on Linux. When fsync() returns success it means "all writes since the
>> > last fsync have hit disk" but we assume it means "all writes since the last
>> > SUCCESSFUL fsync have hit disk".
>>
>> > But then we retried the checkpoint, which retried the fsync(). The retry
>> > succeeded, because the prior fsync() *cleared the AS_EIO bad page flag*.
>>
>> Random other thing we should look at: Some filesystems (nfs yes, xfs
>> ext4 no) flush writes at close(2). We check close() return code, just
>> log it... So close() counts as an fsync for such filesystems().
>
> Well, that's interesting.  You might remember that NFS does not reserve
> space for writes like local file systems like ext4/xfs do.  For that
> reason, we might be able to capture the out-of-space error on close and
> exit sooner for NFS.


It seems like some implementations flush on close and therefore
discover ENOSPC problem at that point, unless they have NVSv4 (RFC
3050) "write delegation" with a promise from the server that a certain
amount of space is available.  It seems like you can't count on that
in any way though, because it's the server that decides when to
delegate and how much space to promise is preallocated, not the
client.  So in userspace you always need to be able to handle errors
including ENOSPC returned by close(), and if you ignore that and
you're using an operating system that immediately incinerates all
evidence after telling you that (so that later fsync() doesn't fail),
you're in trouble.

Some relevant code:

https://github.com/torvalds/linux/commit/5445b1fbd123420bffed5e629a420aa2a16bf849
https://github.com/freebsd/freebsd/blob/master/sys/fs/nfsclient/nfs_clvnops.c#L618

It looks like the bleeding edge of the NFS spec includes a new
ALLOCATE operation that should be able to support posix_fallocate()
(if we were to start using that for extending files):

https://tools.ietf.org/html/rfc7862#page-64

I'm not sure how reliable [posix_]fallocate is on NFS in general
though, and it seems that there are fall-back implementations of
posix_fallocate() that write zeros (or even just feign success?) which
probably won't do anything useful here if not also flushed (that
fallback strategy might only work on eager reservation filesystems
that don't have direct fallocate support?) so there are several layers
(libc, kernel, nfs client, nfs server) that'd need to be aligned for
that to work, and it's not clear how a humble userspace program is
supposed to know if they are.

I guess if you could find a way to amortise the cost of extending
(like Oracle et al do by extending big container datafiles 10MB at a
time or whatever), then simply writing zeros and flushing when doing
that might work out OK, so you wouldn't need such a thing?  (Unless of
course it's a COW filesystem, but that's a different can of worms.)

-- 
Thomas Munro
http://www.enterprisedb.com

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Reply via email to