Creation time would be a bit faster if you don't create the index until AFTER importing the data. That way it doesn't have to worry about keeping things in their proper order on every insert.
-Andrea -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Brooks, Phil Sent: Thursday, June 26, 2008 1:58 PM To: sqlite-users@sqlite.org Subject: [sqlite] sqlite DB creation performance Hi, I am new to the list and to using sqlite - I am looking at using sqlite for a 2 way lookup table for my application. I have a home grown file based solution, but was hoping to find better performance and more functionality by putting the lookup tables into sqlite. I am looking for the fastest way possible to create a large 2 way lookup dictionary for an application. While I am creating the database, I don't need the database to be in a consistent or usable state, I just need it to be that way when I get done. If something bad happens along the way, that is OK. In that case, the database can be recreated from scratch. I have been playing around with options (page size, turn off journaling, different indexes) that have improved the database creation phase quite a bit and am wondering if the group can provide additional insight: More Detail ----------- The tables simply provide fast lookup to return a string on the left-side when presented with its right-side pair and visa versa. Both strings (left and right) belong to a group of entries I'll call 'group'. The groups are all independent of one another. The database contains 2 of these cross reference lookup tables and a bunch of smaller stuff that helps make sense of things, but doesn't really matter a lot in terms of size and performance. The two dictionary tables are pretty big -- the testcase I am using has one table with 43 million entries and another table with 7 million entries. To figure out the performance of this system under sqlite, I am just putting the two dictionaries into a sqlite database that looks like this: First Try --------- /* Create the tables: */ CREATE TABLE foo(left string, right string, group int, flags int); CREATE TABLE bar(left string, right string, group int); /* Create the indices: */ CREATE INDEX left_foo_dict on foo ( left ); CREATE INDEX right_foo_dict on foo ( right ); CREATE INDEX left_bar_dict on bar( left ); CREATE INDEX right_bar_dict on bar( right ); /* Import the data: */ .separator , .import foo.csv foo .import bar.csv bar The strings in the data range in size from about 2 characters up to about 60, but they are, theoretically, unlimited in size. My ascii input csv files are about 5GB for foo.csv and about 500MB for bar.csv. My old database format requires about 7GB to store the database and it takes about 30-40 minutes to create. So I ran the above script: $ time sqlite3 db.sql < script1 real 2h26m47.68s user 49m59.09s sys 4m47.75s sql file size is about 14GB. OK, that is a starting point 5-6x the creation time and twice the space of my original format. On the bright side, opening the sqlite database is vastly faster than on my old format (instant vs. ~16min) and lookups are about the same, maybe slightly faster in some cases. Second Try I looked at the docs and found a few intersting pragmas - since the csv data is naturally ordered on the group field, I decided to see if creating a dual key would help in any way: /* First I tried some pragmas that looked promising */ PRAGMA journal_mode = OFF; PRAGMA legacy_file_format = OFF; PRAGMA page_size=32768; /* Create the tables: */ CREATE TABLE foo(left string, right string, group int, flags int); CREATE TABLE bar( left string, right string, group int ); /* Next, I tried adding an ascending key on group */ CREATE INDEX left_foo_dict on foo ( group asc, left ); CREATE INDEX right_foo_dict on foo ( group asc, right ); CREATE INDEX left_bar_dict on bar( group asc, left ); CREATE INDEX right_bar_dict on bar( group asc, right ); /* import the data */ .separator , .import foo.csv foo .import bar.csv bar Slight improvement in size - down to about 13GB now. Run time improved a lot - down to about 1 hour, though I don't really know which of the things I tried was responsible for the improvement. -- I'll try them one at a time at some point to see which actually helped. Now the Question Can anyone suggest additional things I might try to: 1) reduce db creation time - or - 2) reduce the size of the file? I could actually live with a larger than 7GB file size if I could beat the 30-40 minute runtime of my original solution by a resonable margin. Also, I am assuming that importing a .csv file using .import will be similar in speed to creating the database using the c++ interface using the sqlite3_prepare/sqlite3_bind/sqlite3_step/sqlite3_reset interfaces -- is that a valid assumption? Phil _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users