Re: [sqlite] a string indexing experiment

drh Wed, 12 Apr 2006 17:49:02 -0700

Arun Bhalla <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> I performed a quick benchmark of three different string indexing schemes 
> for SQLite3.
> 
>   * Scheme 0 = indexing on the string field
>   * Scheme 1 = indexing on the MD5 sum (as text in hexadecimal 
> representation) of the string
>   * Scheme 2 = indexing on the high 64 bits of the MD5 sum (as int) of 
> the string
> 
> I varied string size and number of strings and evaluated the schemes on 
> database size and a couple insertion and retrieval tests each.  In 
> general, scheme 2 was quite effective for most cases.  Scheme 0 was the 
> best all-around for short strings (16 bytes or less), but in most cases, 
> scheme 2 was not far behind.  When working with larger strings, scheme 2 
> would dominate, with scheme 1 generally having similar performance. 
> SQLite's indexing mechanism (scheme 0) did not scale well in size or 
> performance for large strings.


No it does not.  When I was originally designing the B-Tree logic
for SQLite (version 2.0 back in the spring of 2001) I looked at as
many real-world databases as I could find and I found that fields
that are indexed were always small.  So I designed the indexing 
mechanism in SQLite for the common case - for small index fields.
When the total size of an index entry gets larger than 240 bytes
or so (I forget the exact cutoff) parts of the index entry spill
into overflow pages and performance begins to decline.   Everything
still works right (in constrast to some other database engines
where there are hard limits on the size of index entries) but
there is a performance hit.  On the other hand, it is very, very
unusual to see an index entry that is larger than 50 or 60 bytes.

For those rare occasions when you need a large field to be
indexed, you can use a hashing scheme such as you describe in
(1) or (2) if efficiency is a concern.

> 
> Strangely, scheme 1 always outperformed scheme 2 in the pure retrieval 
> test, but even when scheme 1 was 50-200% faster than scheme 2, scheme 0 
> was an order of magnitude slower.  Scheme 2 was faster for unique 
> insertion, though, so either scheme 1 or scheme 2 could be useful 
> depending upon the usage model.
> 
> Is there a good reason why retrieval with scheme 1 would be faster than 
> retrieval with scheme 2?  Scheme 1 involves an index on 32 bytes while 
> scheme 2 involves an index on 8 bytes.  I would think that scheme 2 
> would always be faster than scheme 1 because fewer bytes are involved. 
> Does SQLite index INT fields differently than TEXT fields?
> 

SQLite stores TEXT fields as a sequence of bytes (just as you would
expect) and unless you specify an alternative collating sequence, it
compares the strings using memcmp(), which is fast.  64-bit integers
are store in 8 bytes, big-endian.  When comparing two integer values,
SQLite has to extract those 8 bytes into a machine-byte-order 64-bit
integer then subtract the two 64-bit integers to get the comparison
result.  Maybe converting big-endian to little-endian (since you are 
running on Intel) is more expensive than a memcmp() on 32-byte 
strings, especially when you consider that memcmp() can quit as soon
as it finds a difference and probably never gets much beyond the 
4th or 5th byte.
--
D. Richard Hipp   <[EMAIL PROTECTED]>

Re: [sqlite] a string indexing experiment

Reply via email to