Chris Jones wrote:
I've read elsewhere that this is a data locality issue, which certainly
makes sense.

And in those threads, a suggestion has been made to insert in sorted order. But it's unclear to me exactly what the sorting function would need to be -
it's likely my sorting function (say strcmp()) wouldn't match what sqlite
expects.  It's also a little unclear if I can even sort this many keys
externally with any ease.  After all, that's why I'm trying to use a
database.

Chris,

You are correct that this is a locality of reference issue. Using a presorted list speeds insertions because all the index insertions to be made at end of the index. To make this run faster you need prepare a presorted list. The question is how best to do that with a large file.

You didn't say what OS you are running, but these days most basic unix commands are available on Windows as well. I ams assuming you have your list of 112M lines in a text file. You can prepare a large sorted file pretty efficiently using a merge sort, since it reads the input files in one pass. The trick is the input files must be presorted. So you need to split your original data file into many smaller files, sort each in memory (where locality isn't a problem as long as the data fits in real memory), and then merge the individual sorted files. This is best done is a script.

You can use the split command to split you text file into many separate files. If you split your file into chunks of 1M lines you will have 112 files to sort and merge. Each file will be less than 80 MB, so it should be in memory sortable.

   split -l 1000000 infile part

This will generate 112 files names partaa, partab, partac ...

Now you need to sort each of these files. Note the sort command can also eliminate any duplicate lines in your files (the -u for unique option).

   parts=`ls part*`
   for f in $parts
   do
       sort -o $f -u $f
   done

Now you have 112 sorted files that need to be merged.

   mv partaa sorted
   parts=`ls part*`
   for f in $parts
   do
       sort -m -o sorted sorted $f
done You now have a single large sorted file that contains unique strings. You should be able to insert the strings from this file into your sqlite database much faster than you could before.

If you want you can get fancier with the script and merge pairs of sorted files into increasingly larger files until you have a single file. I'll leave that as an exercise for the reader.

You could also improve the locality in the database file further by running a vacuum command after it has been created. this will move the pages around so that the page of the table are contiguous and so are the pages of the index, rather than having them interspersed with each other.

HTH
Dennis Cote



A shell script that takes your raw text file as an argument.

#!/bin/sh
split -l 1000000 $1 part

parts=`ls part*`
for f in $parts
do
   sort -o $f -u $f
done

mv partaa sorted
parts=`ls part*`
for f in $parts
do
   sort -m -o sorted sorted $f
done



-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to