Hi,  I am new to the list and to using sqlite - I am looking at 
using sqlite for a 2 way lookup table for my application.   
I have a home grown file based solution, but was hoping to 
find better performance and more functionality by putting 
the lookup tables into sqlite.
 
I am looking for the fastest way possible to create a large 2 
way lookup dictionary for an application.   While I am 
creating the database, I don't need the database to be 
in a consistent or usable state, I just need it to be 
that way when I get done.  If something bad happens 
along the way, that is OK.  In that case, the database 
can be recreated from scratch.  I have been playing 
around with options (page size, turn off journaling, 
different indexes) that have improved the database 
creation phase quite a bit and am wondering if the group 
can provide additional insight:
 
More Detail
-----------
The tables simply provide fast lookup to return a string 
on the left-side when presented with its right-side pair 
and visa versa. Both strings (left and right) belong to 
a group of entries I'll call 'group'.  The groups are all 
independent of one another.  The database contains 2 of 
these cross reference lookup tables and a bunch of smaller 
stuff that helps make sense of things, but doesn't really 
matter a lot in terms of size and performance.
 
The two dictionary tables are pretty big -- the 
testcase I am using has one table with 43 million entries 
and another table with 7 million entries.
 
To figure out the performance of this system under sqlite, 
I am just putting the two dictionaries into a 
sqlite database that looks like this:
 
First Try
---------
/* Create the tables: */
CREATE TABLE foo(left string, right string, group int, flags int);
CREATE TABLE bar(left string, right string, group int);

/* Create the indices: */
CREATE INDEX left_foo_dict on foo ( left );
CREATE INDEX right_foo_dict on foo ( right );
CREATE INDEX left_bar_dict on bar( left );
CREATE INDEX right_bar_dict on bar( right );
 
/* Import the data: */
.separator ,
.import foo.csv foo
.import bar.csv bar
 
The strings in the data range in size from about 2 characters 
up to about 60, but they are, theoretically, unlimited in size.

My ascii input csv files are about 5GB for foo.csv and about 
500MB for bar.csv.  My old database format requires about 7GB 
to store the database and it takes about 30-40 minutes to create.
 
So I ran the above script:
 
$ time sqlite3 db.sql < script1

real    2h26m47.68s
user    49m59.09s
sys     4m47.75s

sql file size is about 14GB.  OK, that is a starting point 
5-6x the creation time and twice the space of my original 
format.  On the bright side, opening the sqlite database 
is vastly faster than on my old format (instant vs. ~16min) 
and lookups are about the same, maybe slightly faster in 
some cases.
 
Second Try
I looked at the docs and found a few intersting pragmas 
- since the csv data is naturally ordered on the group field, 
I decided to see if creating a dual key would help in any way:
 
/* First I tried some pragmas that looked promising */
PRAGMA journal_mode = OFF;
PRAGMA legacy_file_format = OFF;
PRAGMA page_size=32768;
 
/* Create the tables: */
CREATE TABLE foo(left string, right string, group int, flags int);
CREATE TABLE bar( left string, right string, group int );

/* Next, I tried adding an ascending key on group */
CREATE INDEX left_foo_dict on foo ( group asc, left );
CREATE INDEX right_foo_dict on foo ( group asc, right );
CREATE INDEX left_bar_dict on bar( group asc, left );
CREATE INDEX right_bar_dict on bar( group asc, right );
 
/* import the data */
.separator ,
.import foo.csv foo
.import bar.csv bar
 
Slight improvement in size - down to about 13GB now.  Run 
time improved a lot - down to about 1 hour, though I don't 
really know which of the things I tried was responsible for 
the improvement. -- I'll try them one at a time at some point 
to see which actually helped.
 
Now the Question
Can anyone suggest additional things I might try to: 

    1) reduce db creation time 
- or - 
    2) reduce the size of the file?  

I could actually live with a larger than 7GB file size if 
I could beat the 30-40 minute runtime of my original solution 
by a resonable margin.   Also, I am assuming that importing 
a .csv file using .import will be similar in speed to 
creating the database using the c++ interface using the 
sqlite3_prepare/sqlite3_bind/sqlite3_step/sqlite3_reset 
interfaces -- is that a valid assumption?
 
Phil
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to