RE: [sqlite] disk locality (and delta storage)

DeMarco, Paul Tue, 07 Feb 2006 14:09:20 -0800

Trac (http://projects.edgewall.com/trac) is a VCS (Subversion) + Wiki + 
Ticketing... And is built on top of a sqlite db.  It has a fair amount of 
installation pre-requisites but is a very clean interface and trys to adhere to 
KISS for their feature set.

Its of course not usable if your still debating your VCS, Trac forces you into 
SVN (mostly).  There is currently work to support varying VCS backends, 
monotone is not on the list of supported systems yet... But there is a ticket 
to have it added.

http://projects.edgewall.com/trac/ticket/1492

--paul

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 07, 2006 7:18 AM
To: sqlite-users@sqlite.org
Cc: monotone-devel@nongnu.org
Subject: Re: [sqlite] disk locality (and delta storage)

Nathaniel Smith <[EMAIL PROTECTED]> wrote:
> 
> So and rows are basically written to the file in the same order that 
> the INSERT statements are executed?

Right.  If there are no free pages in the database file (which is the usual 
case for Monotone, I expect) then new pages are allocated from the end of the 
file.  If the INSERT is small and will fit on an existing page, then no new 
page allocations are required and the data gets inserted in exactly the right 
spot.
But when inserting large blobs, as monotone does, you typically will require a 
least one new page and that page will come at the end.

> 
> Oh, and should I assume that individual row cells are kept together on 
> disk, even if they are (much) larger than a db block?  I assume so, 
> but just want to make sure...

If a table row is too big to fit on a page, then the excess spills onto a 
linked list of overflow pages.  SQLite tries to allocate the base page and the 
overflow pages near each other and in order.

> 
> > After you VACUUM, everything will be on disk in row order.  If
> 
> I assume this means "sorted by primary key"?  (And with tables in some 
> random order relative to each other, but I don't think I care about 
> that at all.)

Tables are always sorted by rowid - which is the same as the INTEGER PRIMARY 
KEY if you have one.

The "true" primary key for every SQLite table is the rowid.  If you specify a 
primary key that is not of type INTEGER, then what SQLite does really is create 
a UNIQUE index on that field.  There is still a rowid which is the "true" 
primary key in the sense that the table is stored in rowid order.

> 
> > you see a big performance improvement after VACUUMing, then the disk 
> > layout is perhaps an optimization worth looking into.  If however 
> > (as I suspect) your performance is similar after vacuuming, then 
> > changing the way information is added to the disk probably will not 
> > help, since after a VACUUM the information is pretty much optimally 
> > ordered for minimum seeking.
> 
> I think you left out the end of the sentence, "...assuming an in-order 
> access pattern".

You have 64 bits of rowid space.  You could assign rowids to deltas in clumps.  
Whenever you encounter a new file, assign it a block of (say) a billion rowids. 
 Each delta to that file goes into successive rowids.  Since the table is 
stored in rowid order, all delta for a particular file are therefore close to 
each other in the table.  This does not guarantee that the btree will be laid 
out on disk in order - it probably will not be unless you run a VACUUM - but it 
will help.  And I suspect it will help a lot.

> 
> Unless you just mean, during the btree traversals involved in each key 
> lookup?  Man, there's a whole 'nother part I don't know much about 
> :-).  I guess walking the btrees can obviously be another source of 
> disk latency; I'm not sure whether I should worry about this or not.

The fanout on tables is typically large - about 50 to 75.  Even more if you 
select a larger page size.  Fanout on indices is much smaller, 10 or 20, 
because index keys are typically larger than the integer rowid keys of tables.  
So to reduce your disk latency, you want to try to always search by rowid.

Something you should experiment with, by the way, is increasing the page size 
so that more records fit on one page and you get larger fanout.  Do you get 
better performance if you rebuild your database with say a 16K or 32K page size 
instead of the default 1K?

> If I do an INSERT of a row that has some indexes on it, where do those 
> index entries get written?  Next to the actual row data, at the end of 
> the file?  (Assuming there are no free blocks earlier in the file.) 
> And then at VACUUM time each index gets groups into one spot on disk?

Indices are stored in completely separate btrees from the tables.  An index has 
key only, and the key is the fields being indexed followed by a the rowid.  So 
to lookup a record by index, you first do a search of the index btree to find 
the entry with the matching fields.
Then you pull the rowid off of the end of the index entry and use that rowid to 
do a separate search in the table btree to get your actual data.  So an index 
search actually does two binary lookups - one on the index and another on the 
table.

> 
> I was actually thinking more about the cost of looking up many items 
> from a table.  Here, unfortunately, our current access pattern is 
> quite pessimal.  The current schema is:
> 
> CREATE TABLE files (id primary key, data not null);
> 
> 'id' is the SHA1 hash of the file; 'data' is a compressed raw file.
> 
> CREATE TABLE file_deltas
>   (id not null, base not null, delta not null,
>    unique(id, base)
>   );
> 
> 'id' is the SHA1 of the file this delta lets us construct, 'base' is 
> the SHA1 of the file that the delta is against, and 'delta' is the 
> compressed xdelta.
> 
> So, when we traverse delta chains, we go wandering all over this table 
> indexing by the SHA1 of intermediate versions.  Our access isn't just 
> random, it's _cryptographically strongly_ random! :-)

I would consider redoing the table like this:

   CREATE TABLE files(
     localid INTEGER PRIMARY KEY,
     id TEXT UNIQUE,
     data BLOB NOT NULL
   );

The files.localid field is the 64-bit integer rowid.  Lookups by rowid are at 
least twice as fast as lookups by files.id because you only have to do a single 
binary search.  Use the localid in the file_deltas table.  Then if you also use 
the trick of assigning continguous blocks of localids to each file, all 
information about a file will be contiguous in the btree - and hopefully closer 
to contiguous on disk.  (Fully contiguous after a VACUUM.)  The name "localid" 
is intended to convey the concept that this ID is used only in the local 
repostory.  Separate repositories to which you might push or pull have their 
own localids that are likely different from yours.  The SHA1 hash id is 
universal.
The localid is not.  

> 
> > Let me propose a radical solution:  I've been experimenting with 
> > adding a VCS component to CVSTrac (http://www.cvstrac.org/) to 
> > replace CVS and thus provide a complete project management system in 
> > a standalone CGI program.  My latest thinking on this (backed up by 
> > experiment) is to
> 
> Entering the VCS game?  Good luck!  It's an interesting (and 
> surprisingly deep) field.
> 
> (Total tangent: I basically know how to make monotone work over a CGI 
> transport; it's some work, but doable, just no-one's picked it up yet.
> It might be worth considering such a solution before trying to build a 
> new system from scratch.  The basic trade-off would be a CGI script 
> plus a statically linked binary instead of just a CGI script, but on 
> the other hand, building Yet Another VCS from scratch is a significant 
> undertaking.  The detailed trade-off would of course be more 
> complicated :-).  Something to throw out there in case it leads to
> discussion...)

What I'm looking for is a VCS + bug-tracker + wiki that I can just drop into a 
cgi-bin of some low cost web hosting site (like Hurricane Electric, which I 
notice we both use) and have a ready-to-roll project management system with no 
setup or other details to worry about.  Kind of a sourceforge-in-a-box, only a 
lot better than sourceforge.  I'm looking for something like monotone with 
CVSTrac enhancements that works over HTTP.  Nothing like that currently exists, 
I'm afraid, so I've been working on my own.

Yes, it is a big job, though perhaps not any bigger than writing an SQL 
database engine ;-)

Monotone is my favorite of the current crop of VCSes by the way.
It has 80% of what I am looking for.  What I'm really after is a better 
monotone.  I have been greatly inspired by your work, as have others, I notice.

> 
> While size is definitely not everything -- I doubt anyone notices 10% 
> here or there -- a factor of 2-3x is probably going to be hard to 
> sell.  Unfortunately, since it's a nice scheme.

The CVS repository for SQLite is 15MiB.  Under a baseline+delta schema it would 
grow to (perhaps) 45MiB.  The cheapest account on Hurricane Electric includes 
1GiB of storage.  Why should I care about the extra 30MiB?

> 
> So, this was a bit of a brain dump, but I hope it helps clarify some 
> of what we're thinking about...
> 

Very helpful.  I am keen to help you get monotone working better.  Let me know 
if there is anything I can do to help or if I can answer any addition questions 
for you.
--
D. Richard Hipp   <[EMAIL PROTECTED]>

RE: [sqlite] disk locality (and delta storage)

Reply via email to