Hi.
I have a database with a table having a text column that averages 7.1
characters per row; there are 11 million rows in this table. In one
version of the source file that is used to populate this table, the file
is laid out such that the column data is populated in sorted order; in
another version it is not sorted. In the sorted version, creating the
index takes about a minute; in the unsorted version it takes over 8
hours (I killed it before it finished). I've tried this with a couple
different versions of SQLite (3.6.1 and 3.6.7) and with various PRAGMA
options (cache_size up to 1M, synchronous OFF, etc). I've tried this on
linux 2.6.24 and Darwin 9.6 (MacOS X 10.5.6), and Windows XP with the
same result. Interestingly, on Vista, it's much faster, perhaps 10
minutes to create the index. I suspect this is because of it's superior
disk caching. By increasing the cache_size to 1M, I can get it to index
in about 2.5 minutes, but the process consumes about 600 MB of RAM, and
decreasing the cache_size after doesn't cause sqlite to release all of
the memory it has used.
This TODO item is of course interesting:
* Develop a new sort implementation that does much less disk seeking.
Use to improve indexing performance on large tables.
Is there any thought as to exactly how/when this might happen?
Any other ideas?
Thanks a bunch!
-c
Accessions is the table of interest:
CREATE TABLE AccessionToGoID (accession INT, goid INT);
CREATE TABLE Accessions (id INT IDENTITY, accession VARCHAR(30) NOT
NULL, taxonomy INT);
CREATE TABLE GoID (id INT IDENTITY, qualifier VARCHAR(20), goid INT,
evidence TINYINT, aspect CHAR(1), source INT);
CREATE TABLE Location (url TEXT);
CREATE TABLE Sources (id INT, name VARCHAR(20));
CREATE INDEX ACESSIONS_STRING ON ACCESSIONS("accession");
CREATE INDEX ACESSIONS_TO_GOID ON ACCESSIONTOGOID("accession");
CREATE INDEX GOID_ID ON GOID("id");
Page size in bytes.................... 1024
Pages in the whole file (measured).... 2201999
Pages in the whole file (calculated).. 2201998
Pages that store data................. 2201998 100.000%
Pages on the freelist (per header).... 0 0.0%
Pages on the freelist (calculated).... 1 0.0%
Pages of auto-vacuum overhead......... 0 0.0%
Number of tables in the database...... 6
Number of indices..................... 0
Number of named indices............... 0
Automatically generated indices....... 0
Size of the file in bytes............. 2254846976
Bytes of user payload stored.......... 1406889293 62.4%
*** Page counts for all tables with their indices ********************
ACCESSIONTOGOID....................... 1159789 52.7%
GOID.................................. 775451 35.2%
ACCESSIONS............................ 266755 12.1%
LOCATION.............................. 1 0.0%
SOURCES............................... 1 0.0%
SQLITE_MASTER......................... 1 0.0%
*** All tables *******************************************************
Percentage of total database.......... 100.000%
Number of entries..................... 113096707
Bytes of storage consumed............. -2040121344
Bytes of payload...................... 1406889777 -69.0%
Average payload per entry............. 12.44
Average unused bytes per entry........ 0.21
Average fanout........................ 90.00
Fragmentation......................... 7.8%
Maximum payload per entry............. 135
Entries that use overflow............. 0 0.0%
Index pages used...................... 24343
Primary pages used.................... 2177655
Overflow pages used................... 0
Total pages used...................... 2201998
Unused bytes on index pages........... 2982024 12.0%
Unused bytes on primary pages......... 20271723 -1.0%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 23253747 -1.1%
*** Table ACCESSIONTOGOID ********************************************
Percentage of total database.......... 52.7%
Number of entries..................... 68037552
Bytes of storage consumed............. 1187623936
Bytes of payload...................... 678849900 57.2%
Average payload per entry............. 9.98
Average unused bytes per entry........ 0.20
Average fanout........................ 90.00
Fragmentation......................... 6.2%
Maximum payload per entry............. 11
Entries that use overflow............. 0 0.0%
Index pages used...................... 12856
Primary pages used.................... 1146933
Overflow pages used................... 0
Total pages used...................... 1159789
Unused bytes on index pages........... 1572145 11.9%
Unused bytes on primary pages......... 12284825 1.0%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 13856970 1.2%
*** Table ACCESSIONS *************************************************
Percentage of total database.......... 12.1%
Number of entries..................... 11040350
Bytes of storage consumed............. 273157120
Bytes of payload...................... 189809909 69.5%
Average payload per entry............. 17.19
Average unused bytes per entry........ 0.31
Average fanout........................ 91.00
Fragmentation......................... 15.7%
Maximum payload per entry............. 41
Entries that use overflow............. 0 0.0%
Index pages used...................... 2916
Primary pages used.................... 263839
Overflow pages used................... 0
Total pages used...................... 266755
Unused bytes on index pages........... 360811 12.1%
Unused bytes on primary pages......... 3081726 1.1%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 3442537 1.3%
*** Table GOID *******************************************************
Percentage of total database.......... 35.2%
Number of entries..................... 34018776
Bytes of storage consumed............. 794061824
Bytes of payload...................... 538229202 67.8%
Average payload per entry............. 15.82
Average unused bytes per entry........ 0.17
Average fanout........................ 90.00
Fragmentation......................... 7.6%
Maximum payload per entry............. 34
Entries that use overflow............. 0 0.0%
Index pages used...................... 8571
Primary pages used.................... 766880
Overflow pages used................... 0
Total pages used...................... 775451
Unused bytes on index pages........... 1049068 12.0%
Unused bytes on primary pages......... 4903107 0.62%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 5952175 0.75%
--
Christopher Mason Proteome Software (503) 244-6027
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users