On 17/09/2007, John Machin <[EMAIL PROTECTED]> wrote:
> On 17/09/2007 1:07 PM, Joe Wilson wrote:
> > --- "D. Richard Hipp" <[EMAIL PROTECTED]> wrote:
> >>>>> I have been struggling with the performance of insertion in sqlite.
> >>>>>
> >>>>> Here we have a very simple case :
> >>>>>
> >>>>> A table with an integer autoincrement primary key and a text
> >>>>> field that is
> >>>>> unique.
> >>>>>
> >>>>> CREATE TABLE my (id PRIMARY KEY, url);
> >>>>>
> >>>>> CREATE UNIQUE INDEX myurl ON my(url);
> >>>>>
> >>>>>
> >>>>> My application requires inserting up to 10 million records in
> >>>>> batches of
> >>>>> 20 thousand records.
> >> For each group of 20000 records, first insert them into a TEMP table.
> >> Call the temp table t1.  Then transfer the records to the main table
> >> as follows:
> >>
> >>       INSERT OR IGNORE INTO my(url) SELECT url FROM t1 ORDER BY url;
> >
> > I had no performance improvement with that temp store staging table
> > technique in my testing - actually it was slower.
> >
> >   http://www.mail-archive.com/sqlite-users@sqlite.org/msg22143.html
> >
> > Mind you, the table I was testing against had 4 indexes, whereas the above
> > table has 2. I also wasn't using "OR IGNORE". There might be a difference.
> >
> > Just setting pragma cache_size to a huge value and inserting into
> > the table normally in large batches resulted in better performance in
> > my case. It may have already been mentioned, but having a big
> > database page_size value helps minimize the disk writes as well.
>
> A couple of thoughts:
>
> OTTOMH, time to search index is approx O(D * log(K)) where D = depth of
> tree and K = number of keys per block, and K ** D is O(N)
>
> So:
>
> (1) Big block means big K and thus small D
>
> (2) Long keys (like URLs!!) means small K and thus big D
>
> Further on point (2), the OP seems unsure about whether his URLs are
> unique or not. Perhaps storing another column containing a 64-bit hash
> with an index on that column might be the way to go -- shorter key might
> might well outweigh the extra cost of checking for duplicates.
>


A thought along the same lines, can sqlite create a unique index that
is hash-based?  this would provide the UNIQUE support, but it wouldn't
provide a sorted index.

That should resolve the massive-insert-too-slow problem, and
afterwards he can create a sorted index on the column if he needs
ordered lookups.



Alternatively, he can go without the UNIQUE index on the initial
inserts, and delete the duplicates later.  off top of head, it might
go something like:

create index (not unique)
select t2.id from table as t1, table as t2 where t1.url = t2.url and
t1.id < t2.id;

that would give you a list of the row ids that have repeated an
earlier row's url.
then just build a id list (comma separated) and

delete from table where id in (id list);
NB: you can do the select and delete step in some DBs, not sure about sqlite.

and THEN drop the above index, and create the final unique index.

see ya

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to