I forgot to mention -- I was using SQLite 3.3.4.

Arun Bhalla wrote:
Hi,

I performed a quick benchmark of three different string indexing schemes for SQLite3.

 * Scheme 0 = indexing on the string field
* Scheme 1 = indexing on the MD5 sum (as text in hexadecimal representation) of the string * Scheme 2 = indexing on the high 64 bits of the MD5 sum (as int) of the string

I varied string size and number of strings and evaluated the schemes on database size and a couple insertion and retrieval tests each. In general, scheme 2 was quite effective for most cases. Scheme 0 was the best all-around for short strings (16 bytes or less), but in most cases, scheme 2 was not far behind. When working with larger strings, scheme 2 would dominate, with scheme 1 generally having similar performance. SQLite's indexing mechanism (scheme 0) did not scale well in size or performance for large strings.

Strangely, scheme 1 always outperformed scheme 2 in the pure retrieval test, but even when scheme 1 was 50-200% faster than scheme 2, scheme 0 was an order of magnitude slower. Scheme 2 was faster for unique insertion, though, so either scheme 1 or scheme 2 could be useful depending upon the usage model.

Is there a good reason why retrieval with scheme 1 would be faster than retrieval with scheme 2? Scheme 1 involves an index on 32 bytes while scheme 2 involves an index on 8 bytes. I would think that scheme 2 would always be faster than scheme 1 because fewer bytes are involved. Does SQLite index INT fields differently than TEXT fields?

Some background information, if necessary: the benchmark program was written in C/C++ and run on a P4 (x86) Linux box. Inserts were performed in transactions of 1000 each; pure retrievals were not grouped together in transactions. String sizes ranged from 16-1048576 bytes; number of rows in a table ranged from 10K to 10M. Synchronous writes were disabled.

Thanks,
Arun



Reply via email to