Re: [sqlite] creating unique data takes many hours, help

Ken Mon, 30 Mar 2009 08:53:30 -0700

As others have indicated: 
   1. Create the index at the end after the data is loaded. 
   2. Wrap the inserts in a transaction. Commiting every N transactions.

Also:
   3. Instead of the sqlite3_mprintf/exec, you should prepare,bind, step.
   4. If this is single threaded then maybe recompiling the sqlite library 
disabling threading would likely help some.

      
   

--- On Mon, 3/30/09, [email protected] <[email protected]> wrote:

> From: [email protected] <[email protected]>
> Subject: Re: [sqlite] creating unique data takes many hours, help
> To: "General Discussion of SQLite Database" <[email protected]>
> Date: Monday, March 30, 2009, 7:19 AM
> Hi,
> 
> Yes, I am dealing with dna sequences, the data is huge,
> human genome is
> about 3.2 billion records, others are of similar size.
> 
> Before I found squile, using native C, I extracted the
> strings, testing
> for uniqueness. With sqlite I did the same and  it
> took about the same
> time, once I added the index to sqlite the time was
> reduced.
> 
> I don't need this data to be sorted, just unique, adding
> the index made it
> faster, but I am always looking for ways of reducing
> processing times
> 
> I will try your suggestions and report back.
> 
> More suggestions are very welcome
> 
> Thanks very much for your help
> 
> Michael
> 
> > Hi Michael,
> >
> >
> > Two thoughts -- and I hope others will chime in if I'm
> off-base here:
> >
> >
> > 1) Build just one index after the import:
> >
> >>From page:
> >     http://www.sqlite.org/lang_createtable.html
> > "The UNIQUE constraint causes an unique index to be
> created on the
> > specified columns."
> >
> > I think that if you then create your own index, it
> will be redundant,
> > and, with millions of records, time-consuming.
> >
> > However, you may well want to KEEP your own index (and
> add the UNIQUE
> > constrait to it), and *remove* the UNIQUE constraint
> on the table column
> > definition.   I hear that creating the
> index after the data has been
> > imported is faster (and less fragmented) than creating
> an implicit one
> > on-the-fly during import.
> >
> >
> >
> > 2) Hashing
> >
> > If Vinnie [[email protected]]
> was correct in guessing your goal (to
> > determine if a given string has been examined before),
> and if
> > performance remains a problem, I wondered if the
> design might benefit
> > from using hashing to "pre-qualify" a string and/or
> substitute for a
> > direct match.
> >
> > A short hash (perhaps a 64-bit integer?) could be
> stored in a separate
> > table or even a separate database from the strings
> themselves.  If the
> > hash of a new string does not match any hash in the
> database, then you
> > know there is no need to search for the string
> itself.  If the entire
> > hash index can be kept in RAM, you might get a big
> benefit.
> >
> > You could decide to take this a step further by using
> a longer hash as a
> > proxy for the string itself.  The hash would need
> to be long enough to
> > make collisions extremely unlikely, yet not so long as
> to negate the
> > value of using a proxy for the original strings. 
> In practice, you'd
> > probably want to compute a single long hash for each
> string, then use
> > the first X bytes as the "pre-qualifying" hash, and
> the remainder as the
> > "confirming hash." If your average string length is
> short, then using
> > *two* hashes may not be much of a gain.
> >
> > (Of course, since this is not a cryptographic
> application, you don't
> > need to worry about whether your chosen hash algorithm
> is "secure" or
> > not.)
> >
> > Regards,
> >    Donald
> >
> >
> >
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]]
> On Behalf Of Vinnie
> > Sent: Sunday, March 29, 2009 6:14 PM
> > To: [email protected]
> > Subject: Re: [sqlite] creating unique data takes many
> hours, help
> >
> >
> > Michael:
> >
> > While my answer isn't strictly limited to sqlite, the
> performance nut in
> > me can't help myself. You have a a table with only one
> column, a string.
> > I could be wrong here but it seems like you just want
> to keep a list of
> > values that you have already tried. After you insert a
> bunch of strings
> > into the table you want to be able to quickly look up
> a string to see if
> > it exists, so that you can tell if you already probed
> that sequence
> > (taking a guess here).
> >
> > If my guess is right, and the only thing you are doing
> is looking up
> > sorted single-column elements, you probably can get
> away with your own
> > quick disk-based binary tree implementation and avoid
> sqlite for this
> > particular circumstance altogether. The result would
> be several orders
> > of magnitude faster, even after you have followed the
> suggestions others
> > have given.
> >
> >
> >
> >
> >> Hi,
> >>
> >> I am new with sqlite, and I create a program that
> reads several mllion
> >
> >> records and puts them into a sqlite db using.
> >>
> >> The table has one column ONLY indexed and unique,
> but it takes many
> >> hours.
> >>
> >> Is it necessary to pre-allocate the space, or is
> the anything that I
> >> can do to reduce the time it takes.
> >>
> >> this is how I create the db, table and index.
> >
> >
> 
> 
> 
> Yes, you're right, but if the data is already in index
> order, you'll
> do less I/O when creating the index.  Whether the sort
> + create DB
> time is less than "create DB from random input" time is
> another
> question.
> 
> Jim
> _______________________________________________
> > sqlite-users mailing list
> > [email protected]
> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> > _______________________________________________
> > sqlite-users mailing list
> > [email protected]
> > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> >
> 
> 
> _______________________________________________
> sqlite-users mailing list
> [email protected]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> 
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] creating unique data takes many hours, help

Reply via email to