Hi, Yes, I am dealing with dna sequences, the data is huge, human genome is about 3.2 billion records, others are of similar size.
Before I found squile, using native C, I extracted the strings, testing for uniqueness. With sqlite I did the same and it took about the same time, once I added the index to sqlite the time was reduced. I don't need this data to be sorted, just unique, adding the index made it faster, but I am always looking for ways of reducing processing times I will try your suggestions and report back. More suggestions are very welcome Thanks very much for your help Michael > Hi Michael, > > > Two thoughts -- and I hope others will chime in if I'm off-base here: > > > 1) Build just one index after the import: > >>From page: > http://www.sqlite.org/lang_createtable.html > "The UNIQUE constraint causes an unique index to be created on the > specified columns." > > I think that if you then create your own index, it will be redundant, > and, with millions of records, time-consuming. > > However, you may well want to KEEP your own index (and add the UNIQUE > constrait to it), and *remove* the UNIQUE constraint on the table column > definition. I hear that creating the index after the data has been > imported is faster (and less fragmented) than creating an implicit one > on-the-fly during import. > > > > 2) Hashing > > If Vinnie [thev...@yahoo.com] was correct in guessing your goal (to > determine if a given string has been examined before), and if > performance remains a problem, I wondered if the design might benefit > from using hashing to "pre-qualify" a string and/or substitute for a > direct match. > > A short hash (perhaps a 64-bit integer?) could be stored in a separate > table or even a separate database from the strings themselves. If the > hash of a new string does not match any hash in the database, then you > know there is no need to search for the string itself. If the entire > hash index can be kept in RAM, you might get a big benefit. > > You could decide to take this a step further by using a longer hash as a > proxy for the string itself. The hash would need to be long enough to > make collisions extremely unlikely, yet not so long as to negate the > value of using a proxy for the original strings. In practice, you'd > probably want to compute a single long hash for each string, then use > the first X bytes as the "pre-qualifying" hash, and the remainder as the > "confirming hash." If your average string length is short, then using > *two* hashes may not be much of a gain. > > (Of course, since this is not a cryptographic application, you don't > need to worry about whether your chosen hash algorithm is "secure" or > not.) > > Regards, > Donald > > > > -----Original Message----- > From: sqlite-users-boun...@sqlite.org > [mailto:sqlite-users-boun...@sqlite.org] On Behalf Of Vinnie > Sent: Sunday, March 29, 2009 6:14 PM > To: sqlite-users@sqlite.org > Subject: Re: [sqlite] creating unique data takes many hours, help > > > Michael: > > While my answer isn't strictly limited to sqlite, the performance nut in > me can't help myself. You have a a table with only one column, a string. > I could be wrong here but it seems like you just want to keep a list of > values that you have already tried. After you insert a bunch of strings > into the table you want to be able to quickly look up a string to see if > it exists, so that you can tell if you already probed that sequence > (taking a guess here). > > If my guess is right, and the only thing you are doing is looking up > sorted single-column elements, you probably can get away with your own > quick disk-based binary tree implementation and avoid sqlite for this > particular circumstance altogether. The result would be several orders > of magnitude faster, even after you have followed the suggestions others > have given. > > > > >> Hi, >> >> I am new with sqlite, and I create a program that reads several mllion > >> records and puts them into a sqlite db using. >> >> The table has one column ONLY indexed and unique, but it takes many >> hours. >> >> Is it necessary to pre-allocate the space, or is the anything that I >> can do to reduce the time it takes. >> >> this is how I create the db, table and index. > > Yes, you're right, but if the data is already in index order, you'll do less I/O when creating the index. Whether the sort + create DB time is less than "create DB from random input" time is another question. Jim _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users