Re: Refactoring the checkpointer's fsync request queue

Thomas Munro Tue, 22 Jan 2019 12:33:15 -0800

On Wed, Jan 23, 2019 at 9:29 AM Kevin Grittner <[email protected]> wrote:
> Can you point to a post explaining how the inode can be evicted?

Hi Kevin,

To recap the (admittedly confusing) list of problems with Linux fsync
or our usage:

1. On Linux < 4.13, the error state can be cleared in various
surprising ways so that we never hear about it. Jeff Layton
identified and fixed this problem for 4.13+ by switching from an error
flag to an error counter that is tracked in such a way that every fd
hears about every error in the file.

2. Layton's changes originally assumed that you only wanted to hear
about errors that happened after you opened the file (ie it set the
fd's counter to the inode's current level at open time). Craig Ringer
complained about this. Everyone complained about this. A fix was
then made so that one fd also reports errors that happened before
opening, if no one else has seen them yet. This is the change that
was back-patched as far as Linux 4.14. So long as no third program
comes along and calls fsync on a file that we don't have open
anywhere, thereby eating the "not seen" flag before the checkpointer
gets around to opening the file, all is well.

3. Regardless of the above changes, we also learned that pages are
unceremoniously dropped from the page cache after write-back errors,
so that calling fsync() again after a failure is a bad idea (it might
report success, but your dirty data written before the previous
fsync() call is gone). We handled that by introducing a PANIC after
any fsync failure:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1

So did MySQL, MongoDB, and probably everyone else who spat out their
cornflakes while reading articles like "PostgreSQL's fsync() surprise"
in the Linux Weekly News that resulted from Craig's report:

https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a

https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96

4. Regardless of all of the above changes, there is still one way to
lose track of an error, as Andres mentioned: during a period of time
when neither the writing backend nor the checkpointer has the file
open, the kernel may choose to evict the inode from kernel memory, and
thereby forget about an error that we haven't received yet.

Problems 1-3 are solved by changes to Linux and PostgreSQL.

Problem 4 would be solved by this "fd-passing" scheme (file
descriptors are never closed until after fsync has been called,
existing in the purgatory of Unix socket ancillary data until the
checkpointer eventually deals with them), but it's complicated and not
quite fully baked yet.

It could also be solved by the kernel agreeing not to evict inodes
that hold error state, or to promote the error to device level, or
something like that. IIUC those kinds of ideas were rejected so far.

(It can also be solved by using FreeBSD and/or ZFS, so you don't have
problem 3 and therefore don't have the other problems.)

I'm not sure how likely that failure mode actually is, but I guess you
need a large number of active files, a low PostgreSQL max_safe_fds so
we close descriptors aggressively, a kernel that is low on memory or
has a high vfs_cache_pressure setting so that it throws out recently
used inodes aggressively, enough time between checkpoints for all of
the above to happen, and then some IO errors when the kernel is
writing back dirty data asynchronously while you don't have the file
open anywhere.

--
Thomas Munro
http://www.enterprisedb.com

Re: Refactoring the checkpointer's fsync request queue

Reply via email to