On 11/27/2015 02:18 PM, Michael Paquier wrote:
On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
So, what's going on? The problem is that while the rename() is atomic, it's
not guaranteed to be durable without an explicit fsync on the parent
directory. And by default we only do fdatasync on the recycled segments,
which may not force fsync on the directory (and ext4 does not do that,
apparently).

Yeah, that seems to be the way the POSIX spec clears things.
"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O completion
state. All I/O operations shall be completed as defined for
synchronized I/O file integrity completion."
http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
If I understand that right, it is guaranteed that the rename() will be
atomic, meaning that there will be only one file even if there is a
crash, but that we need to fsync() the parent directory as mentioned.

FWIW this has nothing to do with storage reliability - you may have good
drives, RAID controller with BBU, reliable SSDs or whatever, and you're
still not safe. This issue is at the filesystem level, not storage.

The POSIX spec authorizes this behavior, so the FS is not to blame,
clearly. At least that's what I get from it.

The spec seems a bit vague to me (but maybe it's not, I'm not a POSIX expert), but we should be prepared for the less favorable interpretation I think.


I think this issue might also result in various other issues, not just data
loss. For example, I wouldn't be surprised by data corruption due to
flushing some of the changes in data files to disk (due to contention for
shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL
segment. Also, while I have only seen 1 to 3 segments getting lost, it might
be possible that more segments can get lost, possibly making the recovery
impossible. And of course, this might cause problems with WAL archiving due
to archiving the same
segment twice (before and after crash).

Possible, the switch to .done is done after renaming the segment in
xlogarchive.c. So this could happen in theory.

Yes. That's one of the suspicious places in my notes (haven't posted all the details, the message was long enough already).

Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure
this needs to be backpatched to all backbranches. I've also attached a patch
that adds pg_current_xlog_flush_location() function, which proved to be
quite useful when debugging this issue.

Agreed. We should be sure as well that the calls to fsync_fname get
issued in a critical section with START/END_CRIT_SECTION(). It does
not seem to be the case with your patch.

Don't know. I've based that on code from replication/logical/ which does fsync_fname() on all the interesting places, without the critical section.

regards

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to