Arun Bhalla <[EMAIL PROTECTED]> wrote: > Hi, > > I performed a quick benchmark of three different string indexing schemes > for SQLite3. > > * Scheme 0 = indexing on the string field > * Scheme 1 = indexing on the MD5 sum (as text in hexadecimal > representation) of the string > * Scheme 2 = indexing on the high 64 bits of the MD5 sum (as int) of > the string > > I varied string size and number of strings and evaluated the schemes on > database size and a couple insertion and retrieval tests each. In > general, scheme 2 was quite effective for most cases. Scheme 0 was the > best all-around for short strings (16 bytes or less), but in most cases, > scheme 2 was not far behind. When working with larger strings, scheme 2 > would dominate, with scheme 1 generally having similar performance. > SQLite's indexing mechanism (scheme 0) did not scale well in size or > performance for large strings.
No it does not. When I was originally designing the B-Tree logic for SQLite (version 2.0 back in the spring of 2001) I looked at as many real-world databases as I could find and I found that fields that are indexed were always small. So I designed the indexing mechanism in SQLite for the common case - for small index fields. When the total size of an index entry gets larger than 240 bytes or so (I forget the exact cutoff) parts of the index entry spill into overflow pages and performance begins to decline. Everything still works right (in constrast to some other database engines where there are hard limits on the size of index entries) but there is a performance hit. On the other hand, it is very, very unusual to see an index entry that is larger than 50 or 60 bytes. For those rare occasions when you need a large field to be indexed, you can use a hashing scheme such as you describe in (1) or (2) if efficiency is a concern. > > Strangely, scheme 1 always outperformed scheme 2 in the pure retrieval > test, but even when scheme 1 was 50-200% faster than scheme 2, scheme 0 > was an order of magnitude slower. Scheme 2 was faster for unique > insertion, though, so either scheme 1 or scheme 2 could be useful > depending upon the usage model. > > Is there a good reason why retrieval with scheme 1 would be faster than > retrieval with scheme 2? Scheme 1 involves an index on 32 bytes while > scheme 2 involves an index on 8 bytes. I would think that scheme 2 > would always be faster than scheme 1 because fewer bytes are involved. > Does SQLite index INT fields differently than TEXT fields? > SQLite stores TEXT fields as a sequence of bytes (just as you would expect) and unless you specify an alternative collating sequence, it compares the strings using memcmp(), which is fast. 64-bit integers are store in 8 bytes, big-endian. When comparing two integer values, SQLite has to extract those 8 bytes into a machine-byte-order 64-bit integer then subtract the two 64-bit integers to get the comparison result. Maybe converting big-endian to little-endian (since you are running on Intel) is more expensive than a memcmp() on 32-byte strings, especially when you consider that memcmp() can quit as soon as it finds a difference and probably never gets much beyond the 4th or 5th byte. -- D. Richard Hipp <[EMAIL PROTECTED]>