On Sun, 2005-05-15 at 19:48 -0700, Dan Kennedy wrote:
> Anyone have a quick summary of InnoDB algorithm for comparison? I
> suspect whatever trick they use won't work with SQLite because
> there is no server process, but it could be interesting anyway.

There are several ways the extra fsync() might be avoided:

1. using an internal journal, and an operating system guarantee that
earlier write() occurs before later write() when sent to disk at fsync
()-time. I do not know of any operating system that makes this guarantee
[and keeps it!]

2. write changes to "changeblobs" instead of journals and have sqlite
logically merge the changeblobs when loading (thus, original db is
modified less frequently).

However, because multiple files have to be searched, this slows down
reads quite a bit (unless the index were redesigned). This is a
tradeoff- writes _might_ be faster, but reads certainly aren't.

interestingly enough, lucene does this. actually, many real databases do
this because it's very easy and allows you to use structures that are
less mutable in the database, and thus often simplifies implementation.

this wouldn't require database-structure changes (actually), but it
would mean that the database would have to be recovered/vacuumed before
it would be a single file again, and thus usable on older sqlites...

... however, hooking into the changeblob system makes single-master
replication very simple and efficient (without mucking about in the
users' code)


3. block-level versioning. use a buddy system to keep 2 copies of each
block. changes to a block write out to the new spot. each block would
have a chain to the next "changed block" known about [for recovery], and
a block bitmap would be used to maskout the unused blocks.

during recovery, the last changed block (in the header) is scanned and
each change in that set is checked against the bitmap. 

in this way, space is traded; speed remains "about the same", although
because the space is bigger, the kernel would need to be given tricks to
know when a block could be evicted safely. This is similar to the
problem faced by systems like UML and might require kernel patches to
avoid hurting the VMM.

if you allow (say) 8 copies of each block, you can detect when two
simultaneous writers are about to trample over eachother VERY easily,
although you limit yourselves to 4 copies.

it is NOT a modification of this to generalize for "n copies" of each
block as you then have a serialized index of how many blocks each (ahem)
block there is.

Reply via email to