On Fri, Jun 18, 2010 at 5:24 PM, Eric Smith <eas....@gmail.com> wrote: > Jay A. Kreibich wrote: >> Yes. Hence the "and this is the important part" comment. Most of >> the time when people are building billion-row files, they're building >> a new DB by importing a static source of data. If things go wrong, >> you just throw out the database and try again. > > That's kinda like doing it all in a single big transaction, which I > wanted to avoid. :) > > But my take-away from this is conversation is generally "you're SOL": > we are backed by a tree, so we have to update existing nodes to point > to the new ones, so we have to touch existing pages on INSERT. Is that > about right?
The problem is that if your incoming data is uniformly unsorted (essentially random), then for each transaction you'll find yourself updating approximately every page of the database. When things are small and fit in the cash, that's fine, but when they're large, it's not great because you become entirely seek dominated. The old-school solution to this problem is an external sort (*). The basic idea is that you process the incoming data in memory-sized chunks (ie, fast), then write out sorted subfiles. Then you process the sorted files in parallel, merging to the output. Since the merge step only needs to see the next record for each input, this can make a TREMENDOUS amount of difference for large datasets. For certain datasets, you can just use GNU sort to accomplish this. You can also code it up on top of SQLite by loading multiple sub-tables in a temporary database, and then scanning them in sorted order merging to an output table in your final database. -scott (*) http://en.wikipedia.org/wiki/External_sorting _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users