Hi,
I performed a quick benchmark of three different string indexing schemes
for SQLite3.
* Scheme 0 = indexing on the string field
* Scheme 1 = indexing on the MD5 sum (as text in hexadecimal
representation) of the string
* Scheme 2 = indexing on the high 64 bits of the MD5 sum (as int) of
the string
I varied string size and number of strings and evaluated the schemes on
database size and a couple insertion and retrieval tests each. In
general, scheme 2 was quite effective for most cases. Scheme 0 was the
best all-around for short strings (16 bytes or less), but in most cases,
scheme 2 was not far behind. When working with larger strings, scheme 2
would dominate, with scheme 1 generally having similar performance.
SQLite's indexing mechanism (scheme 0) did not scale well in size or
performance for large strings.
Strangely, scheme 1 always outperformed scheme 2 in the pure retrieval
test, but even when scheme 1 was 50-200% faster than scheme 2, scheme 0
was an order of magnitude slower. Scheme 2 was faster for unique
insertion, though, so either scheme 1 or scheme 2 could be useful
depending upon the usage model.
Is there a good reason why retrieval with scheme 1 would be faster than
retrieval with scheme 2? Scheme 1 involves an index on 32 bytes while
scheme 2 involves an index on 8 bytes. I would think that scheme 2
would always be faster than scheme 1 because fewer bytes are involved.
Does SQLite index INT fields differently than TEXT fields?
Some background information, if necessary: the benchmark program was
written in C/C++ and run on a P4 (x86) Linux box. Inserts were
performed in transactions of 1000 each; pure retrievals were not grouped
together in transactions. String sizes ranged from 16-1048576 bytes;
number of rows in a table ranged from 10K to 10M. Synchronous writes
were disabled.
Thanks,
Arun