On 02/14/2010 03:49 PM, Andres Freund wrote:
On Sunday 14 February 2010 21:41:02 Mark Mielke wrote:
The widely reported problems, though, did not tend to be a problem with
directory changes written too late - but directory changes being written
too early. That is, the directory change is written to disk, but the
file content is not. This is likely because of the "ordered journal"
mode widely used in ext3/ext4 where metadata changes are journalled, but
file pages are not journalled. Therefore, it is important for some
operations, that the file pages are pushed to disk using fsync(file),
before the metadata changes are journalled.
Well, but thats not a problem with pg as it fsyncs the file contents.

Exactly. Not a problem.

If you are concerned, enable dirsync.
If the filesystem already behaves that way a fsync on it should be fairly
cheap. If it doesnt behave that way doing it is correct...

Well, I disagree, as the whole point of this thread is that fsync() is *not* cheap. :-)

Besides there is no reason to fsync the directory before the checkpoint, so
dirsync would require a higher cost than doing it correctly.

Using "ordered" metadata journaling has approximately the same effect. Provided that the data is fsync()'d before the metadata is required, either the metadata is recorded in the journal, in which case the data is accessible, or the metadata is NOT recorded in the journal, in which case, the files will appear missing. The races that theoretically exist would be in situations where the data of one file references a separate file that does not yet exist.

You said you would try and reproduce - are you going to try and reproduce on ext3/ext4 with ordered journalling enabled? I think reproducing outside of a case such as CREATE DATABASE would be difficult. It would have to be something like:

open(O_CREAT)/write()/fsync()/close() of new data file, where data gets written, but directory data is not yet written out to journal open()/.../write()/fsync()/close() of existing file to point to new data file, but directory data is still not yet written out to journal
    crash

In this case, "dirsync" should be effective at closing this hole.

As for cost? Well, most PostgreSQL data is stored within file content, not directory metadata. I think "dirsync" might slow down some operations like CREATE DATABASE or "rm -fr", but I would not expect it to effect day-to-day performance of the database under real load. Many operating systems enable the equivalent of "dirsync" by default. I believe Solaris does this, for example, and other than slowing down "rm -fr", I don't recall any real complaints about the cost of "dirsync".

After writing the above, I'm seriously considering adding "dirsync" to my /db mounts that hold PostgreSQL and MySQL data.

Cheers,
mark

--
Mark Mielke<m...@mielke.cc>


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to