>
>
>
>No indexes: 21 sec.
>Indexing while inserting: 50 sec.
>Indexing after inserting: 37 sec.
>
You're right; creating the index after inserting is faster. Indeed, I
tried that out. With a larger set of data the gap is even more
dramatic.
More benchmarks, taken from the same data dumped from a 3.7GB file being
loaded via sqlite.exe, 10,602,743 rows:
Case 1: Just inserts:
BEGIN TRANSACTION;
CREATE TABLE ...
INSERT ...
-- etc --
INSERT ...
COMMIT;
20 min
Case 2: Inserts first, index after:
BEGIN TRANSACTION;
CREATE TABLE ...
INSERT ...
-- etc --
INSERT ...
CREATE INDEX ...
COMMIT;
2 hrs 30 min
Case 3: Index first, inserts after:
BEGIN TRANSACTION;
CREATE TABLE ...
CREATE INDEX ...
INSERT ...
-- etc --
INSERT ...
COMMIT;
16 hrs 20 min
Case 4: Index first, inserts after, not in a transaction:
CREATE TABLE ...
CREATE INDEX ...
INSERT ...
-- etc --
INSERT ...
26 hrs 58 min
As you may have guessed from the table definition this data is from a
USENET newsreader client. It is storing article header information
coming in from multiple internet servers via multiple TCP connections
run by multiple threads. The same header will come in multiple times
from a particular server, and also from multiple servers. Part of the
record being stored is how many servers it is on, so you have to search
the table to find and update it while data is being inserted. Thus, not
defining the index first is probably not a good idea.
The newsreader client is operating somewhere between case 3 and case 4,
which remember are benchmarks using the single threaded sqlite.exe app.
Normally, every few days you would query the news servers about their
current content, delete the rows that have "expired" and add data for
new articles. Alas, the way it's working now it is so slow it is faster
to delete the database file and reload the whole 3.6GB of data over the
internet at 380 KB/sec cable modem speeds.
I'm still interested in what resource is limiting the operation. My
random access rate for my disk is about 5 MB/sec. Data is being stored
at about 50 KB/sec. Does this mean each INSERT is accessing the index
data 100 times each record? (Who can calculate the number of nodes
touched on the average in a Btree to do this with 10,602,743 records?)
Assuming it is jumping around in such a way that both the disk and DB
caches are being defeated.
Can the index be configured to be kept more in memory? Can it be kept
in a different file and would this help (would the index info be more
localized to fill up whole pages)? The index appears to represent 600
MB of file data.
Has there been any evidence that selecting indexed data is faster if the
records are inserted first and indexed later?