Re: [sqlite] Does disabling synchronous AND shared cache cause "Error: database disk image is malformed"?

David Barrett Tue, 07 Feb 2012 16:59:51 -0800

On 02/07/2012 04:05 PM, Richard Hipp wrote:

This tells me that the error is occurring at
http://www.sqlite.org/src/artifact/5047fb303cdf6?ln=1362 which occurs right
as SQLite is first starting to decode a page that it as loaded from the
disk.  The error indicates that the shell really is seeing a malformed
database file.


Can you tell me more about your "custom distributed transaction layer"?
Might that have something to do with this?  Are you using a custom VFS?
Are you bypassing the built-in locking mechanisms of SQLite and doing some
kind of custom locking?  Are you running this on a network filesystem?

Our custom distributed transaction layer happens a higher layer -- itdoesn't require any modifications to sqlite itself, and uses thestandard sqlite API. So far as each individual sqlite database isconcerned, it should have no awareness of the existence of cloned copiesof itself on other servers.


(I'm not sure if it's helpful, but to give some more detail on this.)

Each server starts up and creates two sqlite databases: "main.db" and"journal.db". Journal contains a simple three column table: ID, query,and hash. ID is an auto-incrementing counter, query is the raw SQL of asingle read/write query (eg, ignoring SELECTs), and hash is anincremental SHA1 hash of all read/write SQL queries ever performed. Soeach time you do a query on "main.db", the same query is also logged in"journal.db".

We operate five realtime replicated servers spread across georedundantdatacenters, where any slave server can perform read operations on itslocal database, but all writes are escalated to the master. The masterprocesses one read/write query at a time, each time negotiating with amajority of slaves in a semi-synchronous two-phase commit manner (wherethe master commits once a majority of slaves approve).

That's all pretty easy and straightforward, but the tricky part is inbringing new nodes into the system (where the necessary queries arereplayed in order to rebuild the database), or handling master/slavefailovers (where the slaves elect a new master, synchronize, andcontinue -- without dropping any of the original queries).

The upshot of all of this is it maintains linear read-scalability andcontinuous uptime in a WAN-replicated manner, in the face of temporarynetwork hiccups or permanent hardware failure. (A lessor benefit is itallows seamless upgrading without needing to reconfigure anything, aseverything automatically detects when the node goes down and repairsitself when it comes back up.)

Anyway, it's neat stuff that I love to talk about -- we've consideredopen sourcing the whole thing as it's actually quite nice and powerful.Even MySQL replication only just got semi-synchronous replication, and*still* doesn't solve the failover cases. But I think all this is at ahigher level than is really relevant to this particular issue.

Or, perhaps you are running the command-line tool on a
different machine where it is not able to access the WAL's database-shm
file in shared memory. So the command-line tool reads a one page of the
database which indicates the the content it is seeking is found on some
other page X.  But by the time the command-line tool has loaded page X, the
server has already shifted the content to someplace else.  The page that
the command-line tool loaded is no longer formatted as the command-line
tool expects it to be, causing exactly the error shown above.

I think this is a very likely candidate. There is nothing particularlyfancy about the deployment: no VFS, no custom locking, etc. But we usesqlite3_enable_shared_cache() to disable the cache (even though it'sdisabled by default) -- combined with "PRAGMA synchronous=OFF" (and addin WAL just to make it more complex). Could that explain the behaviorwe're seeing?

Basically, I'm curious if it's a known issue that the combination of theabove settings creates a situation where the command-line client can'tbe relied upon -- and that the client will incorrectly report amalformed database.


Thanks again for all your help!

-david
Founder and CEO of Expensify
Follow us at http://twitter.com/expensify
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Does disabling synchronous AND shared cache cause "Error: database disk image is malformed"?

Reply via email to