Re: [sqlite] creating unique data takes many hours, help

mrobi002 Mon, 30 Mar 2009 05:19:37 -0700

Hi,

Yes, I am dealing with dna sequences, the data is huge, human genome is
about 3.2 billion records, others are of similar size.


Before I found squile, using native C, I extracted the strings, testing
for uniqueness. With sqlite I did the same and  it took about the same
time, once I added the index to sqlite the time was reduced.

I don't need this data to be sorted, just unique, adding the index made it
faster, but I am always looking for ways of reducing processing times

I will try your suggestions and report back.

More suggestions are very welcome

Thanks very much for your help

Michael

> Hi Michael,
>
>
> Two thoughts -- and I hope others will chime in if I'm off-base here:
>
>
> 1) Build just one index after the import:
>
>>From page:
>     http://www.sqlite.org/lang_createtable.html
> "The UNIQUE constraint causes an unique index to be created on the
> specified columns."
>
> I think that if you then create your own index, it will be redundant,
> and, with millions of records, time-consuming.
>
> However, you may well want to KEEP your own index (and add the UNIQUE
> constrait to it), and *remove* the UNIQUE constraint on the table column
> definition.   I hear that creating the index after the data has been
> imported is faster (and less fragmented) than creating an implicit one
> on-the-fly during import.
>
>
>
> 2) Hashing
>
> If Vinnie [thev...@yahoo.com] was correct in guessing your goal (to
> determine if a given string has been examined before), and if
> performance remains a problem, I wondered if the design might benefit
> from using hashing to "pre-qualify" a string and/or substitute for a
> direct match.
>
> A short hash (perhaps a 64-bit integer?) could be stored in a separate
> table or even a separate database from the strings themselves.  If the
> hash of a new string does not match any hash in the database, then you
> know there is no need to search for the string itself.  If the entire
> hash index can be kept in RAM, you might get a big benefit.
>
> You could decide to take this a step further by using a longer hash as a
> proxy for the string itself.  The hash would need to be long enough to
> make collisions extremely unlikely, yet not so long as to negate the
> value of using a proxy for the original strings.  In practice, you'd
> probably want to compute a single long hash for each string, then use
> the first X bytes as the "pre-qualifying" hash, and the remainder as the
> "confirming hash." If your average string length is short, then using
> *two* hashes may not be much of a gain.
>
> (Of course, since this is not a cryptographic application, you don't
> need to worry about whether your chosen hash algorithm is "secure" or
> not.)
>
> Regards,
>    Donald
>
>
>
> -----Original Message-----
> From: sqlite-users-boun...@sqlite.org
> [mailto:sqlite-users-boun...@sqlite.org] On Behalf Of Vinnie
> Sent: Sunday, March 29, 2009 6:14 PM
> To: sqlite-users@sqlite.org
> Subject: Re: [sqlite] creating unique data takes many hours, help
>
>
> Michael:
>
> While my answer isn't strictly limited to sqlite, the performance nut in
> me can't help myself. You have a a table with only one column, a string.
> I could be wrong here but it seems like you just want to keep a list of
> values that you have already tried. After you insert a bunch of strings
> into the table you want to be able to quickly look up a string to see if
> it exists, so that you can tell if you already probed that sequence
> (taking a guess here).
>
> If my guess is right, and the only thing you are doing is looking up
> sorted single-column elements, you probably can get away with your own
> quick disk-based binary tree implementation and avoid sqlite for this
> particular circumstance altogether. The result would be several orders
> of magnitude faster, even after you have followed the suggestions others
> have given.
>
>
>
>
>> Hi,
>>
>> I am new with sqlite, and I create a program that reads several mllion
>
>> records and puts them into a sqlite db using.
>>
>> The table has one column ONLY indexed and unique, but it takes many
>> hours.
>>
>> Is it necessary to pre-allocate the space, or is the anything that I
>> can do to reduce the time it takes.
>>
>> this is how I create the db, table and index.
>
>



Yes, you're right, but if the data is already in index order, you'll
do less I/O when creating the index.  Whether the sort + create DB
time is less than "create DB from random input" time is another
question.

Jim
_______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>


_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] creating unique data takes many hours, help

Reply via email to