I'm not even sure why I'm wading into this; glutton for punishment, I guess.


TL;DR: the assumption that a data-journaled file system guarantees the
atomicity of individual write()s is, in my experience, not a valid one.



Unfortunately this isn't really a topic about which one can draw general
conclusions.  In practice, every file system -- especially network file
systems -- provide subtly different semantics.

But if I *had* to make a general statement, I'd say that even with full
data journaling you cannot trust that writes will necessarily be replayed
in full or in order without explicit fdatasync()ing as sqlite does.  There
are many ways in which the ordering may be disrupted.

Most file systems guarantee atomicity of writes only up to a relatively
small block size (4k is popular).  If you write multiple blocks, even in a
single system call, it's usually possible for one to succeed and another to
fail.

Arguably more importantly, there's an OS page cache that sits between your
application (sqlite) and the file system.  Unless you disable the cache --
the equivalent of doing an fdatasync() after every operation anyway -- or
you have an exceptionally clever file system, the OS will combine separate
writes to the same page before they hit the disk.

The purpose of data journaling for most file systems isn't to provide
strict atomicity OR ordering of application writes.  It's to prevent things
like an unallocated data block (containing whatever random rubbish was
already at that point on the disk platter) appearing in your file if a
power loss or system crash occurs shortly after a write().

Consider the case where:

- sqlite transaction A modifies blocks 1 and 2 of the file
- sqlite transaction B modifies blocks 2 and 3 of the file
- you kick out the power cable

Without an intermediate fdatasync(), it's entirely possible with most file
systems that:

- blocks 1 and 2 are written, containing the changes from both
transactions, but not block 3

- blocks 2 and 3 are written, containing the changes from both
transactions, but not block 1

- blocks 1 and 3 are written, but not block 2

- all three blocks are written, but block 2 only is missing the
modification from transaction B

Any of those scenarios result in a corrupted database.

Maybe you're using a file system that protects against these things and
guarantees strict ordering.  Are you sure it does that even in the face of
a power outage?  Of a disk failure?  Of a disk failure that occurs when
restoring from a power failure?  Of the failure of one of the DIMMs in its
cache?

For 6 years I worked with clients who had a nearly-unlimited budget to
throw at hardware, and none of those systems provided those guarantees
unless you disabled the caches (at which point you might as well let sqlite
call fsync).  Are you willing to bet that yours does?

Cheers,

-p

On 28 January 2013 18:57, Shuki Sasson <gur.mons...@gmail.com> wrote:

>
> A *physical journal* logs an advance copy of every block that will later be
> written to the main file system. If there is a crash when the main file
> system is being written to, the write can simply be replayed to completion
> when the file system is next mounted. If there is a crash when the write is
> being logged to the journal, the partial write will have a missing or
> mismatched checksum and can be ignored at next mount.
>
> Physical journals impose a significant performance penalty because every
> changed block must be committed *twice* to storage, but may be acceptable
> when *absolute fault protection is
> required.*
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to