On Wed, Jan 23, 2019 at 9:29 AM Kevin Grittner <kgri...@gmail.com> wrote: > Can you point to a post explaining how the inode can be evicted?
Hi Kevin, To recap the (admittedly confusing) list of problems with Linux fsync or our usage: 1. On Linux < 4.13, the error state can be cleared in various surprising ways so that we never hear about it. Jeff Layton identified and fixed this problem for 4.13+ by switching from an error flag to an error counter that is tracked in such a way that every fd hears about every error in the file. 2. Layton's changes originally assumed that you only wanted to hear about errors that happened after you opened the file (ie it set the fd's counter to the inode's current level at open time). Craig Ringer complained about this. Everyone complained about this. A fix was then made so that one fd also reports errors that happened before opening, if no one else has seen them yet. This is the change that was back-patched as far as Linux 4.14. So long as no third program comes along and calls fsync on a file that we don't have open anywhere, thereby eating the "not seen" flag before the checkpointer gets around to opening the file, all is well. 3. Regardless of the above changes, we also learned that pages are unceremoniously dropped from the page cache after write-back errors, so that calling fsync() again after a failure is a bad idea (it might report success, but your dirty data written before the previous fsync() call is gone). We handled that by introducing a PANIC after any fsync failure: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1 So did MySQL, MongoDB, and probably everyone else who spat out their cornflakes while reading articles like "PostgreSQL's fsync() surprise" in the Linux Weekly News that resulted from Craig's report: https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96 4. Regardless of all of the above changes, there is still one way to lose track of an error, as Andres mentioned: during a period of time when neither the writing backend nor the checkpointer has the file open, the kernel may choose to evict the inode from kernel memory, and thereby forget about an error that we haven't received yet. Problems 1-3 are solved by changes to Linux and PostgreSQL. Problem 4 would be solved by this "fd-passing" scheme (file descriptors are never closed until after fsync has been called, existing in the purgatory of Unix socket ancillary data until the checkpointer eventually deals with them), but it's complicated and not quite fully baked yet. It could also be solved by the kernel agreeing not to evict inodes that hold error state, or to promote the error to device level, or something like that. IIUC those kinds of ideas were rejected so far. (It can also be solved by using FreeBSD and/or ZFS, so you don't have problem 3 and therefore don't have the other problems.) I'm not sure how likely that failure mode actually is, but I guess you need a large number of active files, a low PostgreSQL max_safe_fds so we close descriptors aggressively, a kernel that is low on memory or has a high vfs_cache_pressure setting so that it throws out recently used inodes aggressively, enough time between checkpoints for all of the above to happen, and then some IO errors when the kernel is writing back dirty data asynchronously while you don't have the file open anywhere. -- Thomas Munro http://www.enterprisedb.com