On Fri, Jun 18, 2010 at 5:24 PM, Eric Smith <eas....@gmail.com> wrote:
> Jay A.  Kreibich wrote:
>> Yes.  Hence the "and this is the important part" comment.  Most of
>> the time when people are building billion-row files, they're building
>> a new DB by importing a static source of data.  If things go wrong,
>> you just throw out the database and try again.
>
> That's kinda like doing it all in a single big transaction, which I
> wanted to avoid.  :)
>
> But my take-away from this is conversation is generally "you're SOL":
> we are backed by a tree, so we have to update existing nodes to point
> to the new ones, so we have to touch existing pages on INSERT.  Is that
> about right?

The problem is that if your incoming data is uniformly unsorted
(essentially random), then for each transaction you'll find yourself
updating approximately every page of the database.  When things are
small and fit in the cash, that's fine, but when they're large, it's
not great because you become entirely seek dominated.

The old-school solution to this problem is an external sort (*).  The
basic idea is that you process the incoming data in memory-sized
chunks (ie, fast), then write out sorted subfiles.  Then you process
the sorted files in parallel, merging to the output.  Since the merge
step only needs to see the next record for each input, this can make a
TREMENDOUS amount of difference for large datasets.

For certain datasets, you can just use GNU sort to accomplish this.
You can also code it up on top of SQLite by loading multiple
sub-tables in a temporary database, and then scanning them in sorted
order merging to an output table in your final database.

-scott

(*) http://en.wikipedia.org/wiki/External_sorting
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to