On 11/29/2010 01:05 PM, Markus Wanner wrote:

Each revision typically has 4 certs (author, branch, date, changelog), so the
old index on only rev_id should usually give 5 lookups. Indexing on (rev_id,
name, value) should usually give 2 lookups, and indexing on everything gives
1 lookup.

..at the cost of increasing the size of the index. Which isn't just disk
space, but also costs memory space, bandwidth and affects CPU caching.
These certainly need to be taken into account as well.

All together, forcing it to use a (rev_id, name, value) index was maybe half a second slower than it using a index with everything it needs. It probably helps that it only has to load index data instead of index+table (so one copy of each relevant field, from the index, rather than one copy of each field from the table and a second copy of some from the index).

... strace -c shows 534 calls to read() and 150 to lseek() with everything in
the index, vs 2050 calls to read() and 1666 to lseek() with only (rev_id, name,
value). That doesn't make sense, it shouldn't be more than about twice as much.

Maybe because there are more tuples per block, so you need to read fewer
blocks overall.

On a ~ 380 MiB test database, I measured about 6.1 MiB for the index on
(revision_id, name, value, keypair_id, signature), while the one on just
(revision_id, name, value) weighted 2.2 MiB. So a single block covers
almost 3x as many tuples in the later case.

I think that was actually seeing it using the wrong index and read a quarter of the table.

sqlite>  .schema revision_certs
CREATE TABLE revision_certs
        (
        hash not null unique,   -- hash of remaining fields separated by ":"
        revision_id not null,   -- joins with revisions.id
        name not null,          -- opaque string chosen by user
        value not null,         -- opaque blob
        keypair_id not null,    -- joins with public_keys.id
        signature not null,     -- RSA/SHA1 signature of "[n...@id:val]"
        unique(name, value, revision_id, keypair_id, signature)
        );
CREATE INDEX revision_certs__revnameval ON revision_certs (revision_id,
        name, value, keypair_id, signature);
sqlite>  explain query plan select revision_id, name, value, keypair_id, 
signature from revision_certs where revision_id = 'abc' and name = 'branch';
0|0|TABLE revision_certs WITH INDEX revision_certs__revnameval

sqlite>  explain query plan select revision_id, name, value, keypair_id, 
signature from revision_certs where revision_id = 'abc' and name = 'branch';
0|0|TABLE revision_certs WITH INDEX sqlite_autoindex_revision_certs_2

I cannot reproduce that behavior. In both cases, I see the manually
created index being used - except if I remove all indexes and only keep
the uniqueness constraint. Then it uses the underlying index for that.
That's with sqlite 3.7.3, always using ANALYZE before EXPLAIN.

Monotone never calls ANALYZE. Eventually I suppose we might want to put a rule in place to do this occasionally (every time the revision count has doubled?). Or maybe there aren't enough tables with multiple usable indexes that it would matter much; if revision_certs is the only example...

Looking at the schema it might make sense to simply rearrange the
uniqueness constraint, because the name attribute isn't very selective
(i.e. most tuples have one of few often used values). Instead, we should
put the revision_id at the first place, as that's way more selective.

I don't even think we need an additional index on (name, ...) anymore.
So let's drop that additional index for space and speed!

Some things (such as netsync and the b: selector) need to list all revisions in a particular branch, those look up on (name, value). So we still want an index with those leading.

Using the INDEXED BY with an index on (rev_id, name, value) takes 2.5s for 'mtn 
st',
vs... now I'm seeing 2.0s instead of the 1.1s I got yesterday.

I'd agree with the warning, that 'hinting' the database is messy and not
something we should need (nor want) to deal with.

...time for 'mtn st' varies drastically depending on exactly which revision the
workspace is at (even with inodeprints on and only a small change in the number 
of
files), but INDEXED BY with the 3-field index seems to be consistently .3s - .5s
slower than with everything in the index and the workspace on the same revision.

I bet that's because of the low selectivity of the name attribute using
the uniqueness index.

No, that's using the INDEXED BY requirement to make it use a (rev_id, name, value) index. If it hits the low-selectivity (name, ...) index it takes ~10s.


--
Timothy

Free public monotone hosting: http://mtn-host.prjek.net

_______________________________________________
Monotone-devel mailing list
Monotone-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/monotone-devel

Reply via email to