I was under the impression that setting PRAGMA synchronous=NORMAL also removed the fsync() at step 6. I'm pretty sure when we run with synchronous=NORMAL we see only one fsync() call per COMMIT operation, not 2 (and definitely not 3 as we see with synchronous=FULL).
--Bob -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of D. Richard Hipp Sent: Thursday, May 22, 2008 6:06 PM To: General Discussion of SQLite Database Subject: Re: [sqlite] Sqlite 3.5.9 journal_mode vs ext3 journaling On May 22, 2008, at 6:50 PM, Bob Ebert wrote: > I wonder if anyone is in a position to compare and contrast the > journal_mode and synchronous settings with the various ext3 journal > modes? > > Up until now we've been using ext3 with data=ordered, and sqlite3 with > synchronous=normal, journal_mode=delete. We're on an embedded system > with a very high frequency of random power cycling. In this mode, > we've seen an unusually large number of corrupted databases. > > My best theory right now is that we lose power after the ext3 metadata > has written the delete of the -journal file, but before all of the > page overwrites for the db file are fully flushed, since these are > done by two different processes in normal linux. SQLite calls fsync() before it calls unlink() on the journal file. So if fsync() is working as documented, all database data should be safely on oxide prior to the unlink(). We have seen instances before where fsync() returned long before the data was on oxide, so I would not be surprised by this. You can often see this yourself by mounting a flash memory stick, writing a file on the stick, calling fsync() then watching the LED on the end of the stick continue to flash long after the fsync() has returned. Please recognize that there really is nothing that SQLite can do to correct this problem. SQLite depends on the operating system and/or disk controller living up to its end of the contract. If the disk says all data is on oxide, SQLite has no choice but to believe it because SQLite has no way to independently verify the matter. And if the disk controller and/or operating system is lying, and a power failure follows, the database can be corrupted. There isn't much you can do when your hardware starts lying to you. > I believe if we lose power at > this point, then after a restart ext3 will replay the journal delete, > but will leave the main DB in a semi-written state, and thus corrupt > the DB. > > I'm wondering what impact switching to journal_mode=persist will have > on this scenario. I believe this will change how the last step of the > atomic commit (clearing the journal) is written to disk, and thus > changes how ext3 will recover the file after a power loss. Currently > our IO scheduler doesn't guarantee ordered writes, so in theory the > journal header clearing could still make it to disk before all the > main db pages. We're in a position to adjust the IO scheduler if > necessary to prevent this. Once again, SQLite calls fsync() on the database prior to calling write() to clear the journal header. So *if* fsync is doing its job, you should have no problems. *If*. Note the the whole point of these fsync() calls in SQLite is to act as an I/O barrier operation - to guarantee that all I/O operations that are issued prior to the fsync() complete prior to any I/O operations issued afterwards. We must have an I/O barrier in order to preserve database integrity across a power failure. If your fsync() is not work quite right, then all bets are off. I don't know of anything SQLite can do to make the situation better. > > > Are there any other potential holes or races between ext3 journal data > and sqlite file contents that I should worry about? Is > synchronous=full the only way to guarantee atomic commits under these > conditions? > A synchronous=FULL commit goes like this: 1. Write all pages that will change into the journal 2. fsync() the journal 3. Overwrite the journal header to indicate that it is valid 4. fsync() the journal 5. Write changes into the database file. 6. fsync() the database files 7. Delete or truncate or overwrite the journal header (depending on journaling mode) The difference between FULL and NORMAL is that NORMAL omits the fsync() on step 2. That is the only difference. The step-2 fsync is important on some filesystems, but on ext3 it is probably unnecessary. So I don't think that going to synchronous=FULL is going to help you. D. Richard Hipp [EMAIL PROTECTED] _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users