On Sun, 2005-05-15 at 19:48 -0700, Dan Kennedy wrote: > Anyone have a quick summary of InnoDB algorithm for comparison? I > suspect whatever trick they use won't work with SQLite because > there is no server process, but it could be interesting anyway.
There are several ways the extra fsync() might be avoided: 1. using an internal journal, and an operating system guarantee that earlier write() occurs before later write() when sent to disk at fsync ()-time. I do not know of any operating system that makes this guarantee [and keeps it!] 2. write changes to "changeblobs" instead of journals and have sqlite logically merge the changeblobs when loading (thus, original db is modified less frequently). However, because multiple files have to be searched, this slows down reads quite a bit (unless the index were redesigned). This is a tradeoff- writes _might_ be faster, but reads certainly aren't. interestingly enough, lucene does this. actually, many real databases do this because it's very easy and allows you to use structures that are less mutable in the database, and thus often simplifies implementation. this wouldn't require database-structure changes (actually), but it would mean that the database would have to be recovered/vacuumed before it would be a single file again, and thus usable on older sqlites... ... however, hooking into the changeblob system makes single-master replication very simple and efficient (without mucking about in the users' code) 3. block-level versioning. use a buddy system to keep 2 copies of each block. changes to a block write out to the new spot. each block would have a chain to the next "changed block" known about [for recovery], and a block bitmap would be used to maskout the unused blocks. during recovery, the last changed block (in the header) is scanned and each change in that set is checked against the bitmap. in this way, space is traded; speed remains "about the same", although because the space is bigger, the kernel would need to be given tricks to know when a block could be evicted safely. This is similar to the problem faced by systems like UML and might require kernel patches to avoid hurting the VMM. if you allow (say) 8 copies of each block, you can detect when two simultaneous writers are about to trample over eachother VERY easily, although you limit yourselves to 4 copies. it is NOT a modification of this to generalize for "n copies" of each block as you then have a serialized index of how many blocks each (ahem) block there is.