On Thu, Apr 21, 2011 at 8:51 PM, Greg Smith <g...@2ndquadrant.com> wrote: > There's still the "fsync'd a data block but not the directory entry yet" > issue as fall-out from this too. Why doesn't PostgreSQL run into this > problem? Because the exact code sequence used is this one: > > open > write > fsync > close > > And Linux shouldn't ever screw that up, or the similar rename path. Here's > what the close man page says, from http://linux.die.net/man/2/close :
Theodore Ts'o addresses this *exact* sequence of events, and suggests if you want that rename to definitely stick that you must fsync the directory: http://www.linuxfoundation.org/news-media/blogs/browse/2009/03/don%E2%80%99t-fear-fsync """ One argument that has commonly been made on the various comment streams is that when replacing a file by writing a new file and the renaming “file.new” to “file”, most applications don’t need a guarantee that new contents of the file are committed to stable store at a certain point in time; only that either the new or the old contents of the file will be present on the disk. So the argument is essentially that the sequence: fd = open(”foo.new”, O_WRONLY); write(fd, buf, bufsize); fsync(fd); close(fd); rename(”foo.new”, “foo”); … is too expensive, since it provides “atomicity and durability”, when in fact all the application needed was “atomicity” (i.e., either the new or the old contents of foo should be present after a crash), but not durability (i.e., the application doesn’t need to need the new version of foo now, but rather at some intermediate time in the future when it’s convenient for the OS). This argument is flawed for two reasons. First of all, the squence above exactly provides desired “atomicity without durability”. It doesn’t guarantee which version of the file will appear in the event of an unexpected crash; if the application needs a guarantee that the new version of the file will be present after a crash, ***it’s necessary to fsync the containing directory*** """ Emphasis mine. So, all in all, I think the creation of, deletion of, and renaming of files in the write ahead log area should be followed by a pg_xlog fsync. I think it is also necessary to fsync directories in the cluster directory at checkpoint time, also: if a chunk of directory metadata doesn't make it to disk, a checkpoint occurs, and then there's a crash then it's possible that replaying the WAL post-checkpoint won't create/move/delete the file in the cluster. The fact this hasn't been happening (or hasn't triggered an error, which would be scarier) may just be a happy accident of that data being flushed most of the time, meaning that that fsync() on the directory file descriptor won't cost very much anyway. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers