Hi Michael,

Two thoughts -- and I hope others will chime in if I'm off-base here:


1) Build just one index after the import:

>From page:
    http://www.sqlite.org/lang_createtable.html
"The UNIQUE constraint causes an unique index to be created on the
specified columns."

I think that if you then create your own index, it will be redundant,
and, with millions of records, time-consuming.

However, you may well want to KEEP your own index (and add the UNIQUE
constrait to it), and *remove* the UNIQUE constraint on the table column
definition.   I hear that creating the index after the data has been
imported is faster (and less fragmented) than creating an implicit one
on-the-fly during import.



2) Hashing

If Vinnie [thev...@yahoo.com] was correct in guessing your goal (to
determine if a given string has been examined before), and if
performance remains a problem, I wondered if the design might benefit
from using hashing to "pre-qualify" a string and/or substitute for a
direct match.

A short hash (perhaps a 64-bit integer?) could be stored in a separate
table or even a separate database from the strings themselves.  If the
hash of a new string does not match any hash in the database, then you
know there is no need to search for the string itself.  If the entire
hash index can be kept in RAM, you might get a big benefit.

You could decide to take this a step further by using a longer hash as a
proxy for the string itself.  The hash would need to be long enough to
make collisions extremely unlikely, yet not so long as to negate the
value of using a proxy for the original strings.  In practice, you'd
probably want to compute a single long hash for each string, then use
the first X bytes as the "pre-qualifying" hash, and the remainder as the
"confirming hash." If your average string length is short, then using
*two* hashes may not be much of a gain. 

(Of course, since this is not a cryptographic application, you don't
need to worry about whether your chosen hash algorithm is "secure" or
not.)

Regards,
   Donald



-----Original Message-----
From: sqlite-users-boun...@sqlite.org
[mailto:sqlite-users-boun...@sqlite.org] On Behalf Of Vinnie
Sent: Sunday, March 29, 2009 6:14 PM
To: sqlite-users@sqlite.org
Subject: Re: [sqlite] creating unique data takes many hours, help


Michael:

While my answer isn't strictly limited to sqlite, the performance nut in
me can't help myself. You have a a table with only one column, a string.
I could be wrong here but it seems like you just want to keep a list of
values that you have already tried. After you insert a bunch of strings
into the table you want to be able to quickly look up a string to see if
it exists, so that you can tell if you already probed that sequence
(taking a guess here).

If my guess is right, and the only thing you are doing is looking up
sorted single-column elements, you probably can get away with your own
quick disk-based binary tree implementation and avoid sqlite for this
particular circumstance altogether. The result would be several orders
of magnitude faster, even after you have followed the suggestions others
have given.




> Hi,
> 
> I am new with sqlite, and I create a program that reads several mllion

> records and puts them into a sqlite db using.
> 
> The table has one column ONLY indexed and unique, but it takes many 
> hours.
> 
> Is it necessary to pre-allocate the space, or is the anything that I 
> can do to reduce the time it takes.
> 
> this is how I create the db, table and index.

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to