I'm not even sure why I'm wading into this; glutton for punishment, I guess.
TL;DR: the assumption that a data-journaled file system guarantees the atomicity of individual write()s is, in my experience, not a valid one. Unfortunately this isn't really a topic about which one can draw general conclusions. In practice, every file system -- especially network file systems -- provide subtly different semantics. But if I *had* to make a general statement, I'd say that even with full data journaling you cannot trust that writes will necessarily be replayed in full or in order without explicit fdatasync()ing as sqlite does. There are many ways in which the ordering may be disrupted. Most file systems guarantee atomicity of writes only up to a relatively small block size (4k is popular). If you write multiple blocks, even in a single system call, it's usually possible for one to succeed and another to fail. Arguably more importantly, there's an OS page cache that sits between your application (sqlite) and the file system. Unless you disable the cache -- the equivalent of doing an fdatasync() after every operation anyway -- or you have an exceptionally clever file system, the OS will combine separate writes to the same page before they hit the disk. The purpose of data journaling for most file systems isn't to provide strict atomicity OR ordering of application writes. It's to prevent things like an unallocated data block (containing whatever random rubbish was already at that point on the disk platter) appearing in your file if a power loss or system crash occurs shortly after a write(). Consider the case where: - sqlite transaction A modifies blocks 1 and 2 of the file - sqlite transaction B modifies blocks 2 and 3 of the file - you kick out the power cable Without an intermediate fdatasync(), it's entirely possible with most file systems that: - blocks 1 and 2 are written, containing the changes from both transactions, but not block 3 - blocks 2 and 3 are written, containing the changes from both transactions, but not block 1 - blocks 1 and 3 are written, but not block 2 - all three blocks are written, but block 2 only is missing the modification from transaction B Any of those scenarios result in a corrupted database. Maybe you're using a file system that protects against these things and guarantees strict ordering. Are you sure it does that even in the face of a power outage? Of a disk failure? Of a disk failure that occurs when restoring from a power failure? Of the failure of one of the DIMMs in its cache? For 6 years I worked with clients who had a nearly-unlimited budget to throw at hardware, and none of those systems provided those guarantees unless you disabled the caches (at which point you might as well let sqlite call fsync). Are you willing to bet that yours does? Cheers, -p On 28 January 2013 18:57, Shuki Sasson <gur.mons...@gmail.com> wrote: > > A *physical journal* logs an advance copy of every block that will later be > written to the main file system. If there is a crash when the main file > system is being written to, the write can simply be replayed to completion > when the file system is next mounted. If there is a crash when the write is > being logged to the journal, the partial write will have a missing or > mismatched checksum and can be ignored at next mount. > > Physical journals impose a significant performance penalty because every > changed block must be committed *twice* to storage, but may be acceptable > when *absolute fault protection is > required.* _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users