On Jun 20, 2005, at 11:24 AM, Yuriy wrote:

CS> What are you actually trying to do? And can you quantify "very slow" and
CS> tell us what you actually expect or what would be acceptable?
100,000 oll ok 7 seconds
1,000,000 software halt :(

CS> Is this representitive of what you are trying to do? Are you storing IP CS> addresses, and you want to discard duplicates? Using the "on conflict"
CS> resolution is probably your fastest course of action.
I write log analyzer and want to use sqlite as database.

All Operations in my software grouping big list strings.
and need the fastest speed.
if use "group by" it is slow :(

if all he's doing is discarding duplicate strings, with no requirement for
persistent storage, it is easily done with a primitive hash table
implementation. could probably be done efficiently in less than a hundred
lines of c, most of which could be adapted from some example code.

or, a couple of three lines of Perl.

Yes need disk-based hash or btree. But SQLite in the low level it
Disk-Based Btree.


Pre-process the log file, creating a hash with the unique field as the key. Then, loop over the hash and insert it in your db.

If memory is a constraint, don't bother even creating a hash. Loop over the log file, create an array, sort the array, remove the duplicates, then insert it in the db, making sure that you have AutoCommit off and commits every 10k or 100k records.

Should be done in a few seconds. To give you an idea, I once de-duped a file with 320 million rows of duplicate email addresses in about 120 seconds on an ancient, creaking iBook. A million records should be a piece of cake.


--
Puneet Kishor

Reply via email to