Can someone explain to me why, after populating a new table
in a new database file, with no indexes, that the "overhead"
is around 50%?
That's the question.. to understand my justification for the
figure, I've explained it below. Not a complaint, just would
like to understand why and how I may possibly help the
situation.
Thanks for any help!
// CHRIS
..
..
****************
* *
* BACKGROUND *
* *
****************
I've written a program called Power-Grab that downloads binary
files from usenet newsgroups. I keep all my data in basically
flat files of custom-serialized objects. For the most part,
it's simple and suffices.
One problem I face is the sorting of headers (by Subject:) after
I scan a newsgroup. I wrote my own implementation of QuickSort
that sorts objects that actually point to records on disk; so it's
basically a "disk-based" QuickSort, with some very simple caching
to help performance.
I've always wanted to find a good relatively fast database engine
that I could embed into my program without a lot of overhead or cost.
Offerings I found were either free and not useful or useful and
expensive (eg, CodeBase.) Looks like SQLite is exactly what I've
been waiting for (over 4 years now!)
I just started playing around with it, and I'm trying to determine
if I can really use it (performance-wise.)
For Reference:
System: Windows 2000 Pro (With SP4), P4 @ 2.4 GHz, 1GB RAM
Language: Visual C++ 7.0
SQLite: Version 2.8.13, using sqlite.dll from www.sqlite.org
****************
* *
* THE PROBLEM *
* *
****************
I have a data file that represents the real data from one of the larger
newsgroups:
Filename: alt.binaries.cd.image.xbox.XOV
Size: 376,878,446
Records: 1,932,225
The file is basically the raw output from an "XOVER" command to a news
server.
It consists of records like this:
{ int FLAGS, int ID, int LINES, int BYTES, char[] SUBJECT, char[] SENDER,
char[] MSGID }
The goal is relatively simple:
(1) Create a database file
(2) Dump all records to the database file
(3) Create an index
****************
* *
* FIRST TRY *
* *
****************
< Created Database File "XOVER.DB" >
CREATE TABLE XOVER (ID INT, FLAGS INT, LINES INT, BYTES INT, SUBJECT TEXT,
SENDER TEXT, MSGID TEXT);
PRAGMA synchronous = OFF;
(repeat 1,932,225 times)
INSERT INTO XOVER (ID, FLAGS, LINES, BYTES, SUBJECT, SENDER, MSGID)
VALUES (%d,%d,%d,%d,'%s','%s','%s');
CREATE INDEX IDX_SUBJ ON XOVER (SUBJECT);
****************
* *
* RESULTS *
* *
****************
First, reading the XOV data file and doing all the formatting but NOT
calling SQLite,
it took between 48 and 51 seconds (I tested 3 times.) So we can safely
subtract that
from the run-time to get an estimate of how long it takes SQLite to
populate the table.
INSERT 1,932,225 RECORDS
TIME: 3 Hours, 39 Minutes, 24 Seconds (13,164 Sec is about 146.78
INSERT/sec)
CREATE INDEX IDX_SUBJ ON XOVER (SUBJECT)
TIME: 10 Minutes, 24 Seconds
SELECT * FROM XOVER; (using Callback function that only does
"++nRecords;")
TIME: 4 Minutes, 31 Seconds
The SQLite database file was a bit larget than I expected, after all said
an done:
728,733,696 XOVER.DB (Before Index Built)
980,514,816 XOVER.DB (After Index Built)
Compared to the original file:
376,878,446 alt.binaries.cd.image.xbox.XOV
Not excited about the time to create the table;
about 3.5 hours to add about 2 million records.
****************
* *
* SECOND TRY *
* *
****************
Same as the FIRST TRY, but wrapped in a transaction:
< Created Database File "XOVER.DB" >
CREATE TABLE XOVER (ID INT, FLAGS INT, LINES INT, BYTES INT, SUBJECT TEXT,
SENDER TEXT, MSGID TEXT);
PRAGMA synchronous = OFF;
BEGIN;
(repeat 1,932,225 times)
INSERT INTO XOVER (ID, FLAGS, LINES, BYTES, SUBJECT, SENDER, MSGID)
VALUES (%d,%d,%d,%d,'%s','%s','%s');
COMMIT;
BEGIN;
CREATE INDEX IDX_SUBJ ON XOVER (SUBJECT);
COMMIT;
****************
* *
* RESULTS *
* *
****************
Great- Much better performance!
INSERT 1,932,225 RECORDS
TIME: 3 Minutes, 49 Seconds (229 Sec ==> 8,437 INSERT/sec)
CREATE INDEX IDX_SUBJ ON XOVER (SUBJECT)
TIME: 7 Minutes, 14 Seconds
SELECT * FROM XOVER; (using Callback function that only does
"++nRecords;")
TIME: 2 Minutes, 36 Seconds
****************
* *
* CONCLUSION *
* *
****************
As I said, the SQLite database file was a bit larger than I expected,
can anyone shed some light on this?
The original file:
376,878,446 alt.binaries.cd.image.xbox.XOV
Is composed mainly of text. I realize that SQLite stores everything
as text (eg, numbers are in ascii text format). But I didn't expect
such a large increase in size.
Some rough calculations..
The original (XOV) file consists of 1,932,225 records of the following
format:
{
DWORD dwFlags ; Flags
DWORD dwID ; News Server Generated Message Number
DWORD dwLines ; Number of Lines In Article
DWORD dwBytes ; Number of Bytes in Article
DWORD dwSL_Subj (=J) ; String Length Of Subject (No NULL Termination)
DWORD dwSL_Sender (=K) ; String Length Of Sender (No NULL Termination)
DWORD dwSL_MSGID (=L) ; String Length Of Message-ID (No NULL
Termination)
CHAR[J] szSubject ; Article Subject
CHAR[K] szSender ; Sender Name
CHAR[L] szMSGID ; Value of "Message-ID:" Header
}
The file starts with two DWORDS, which give the MIN(dwID) and MAX(dwID) of
all the records in the file. With a file size of 376,878,446, we take
away
8 bytes for the file header and have 376,878,438 bytes of record data.
There are 1,932,225 records so, counting the 7 DWORDS in each record,
there
are 28 bytes of binary DWORD data per record. That's 1,932,225 * 28 is
54,102,300 bytes of DWORD data... and the rest, 322,776,138 bytes, is
TEXT.
So from the SQLite XOVER.DB file, WITHOUT the index built, we have a file
size of 728,733,696 bytes. Take away the known TEXT data of 322,776,138
and we have 405,957,558 bytes. I am only inserting the first four DWORD
values into the SQLite table, so expressed as text, it uses up to
10 characters each - or 40 characters. So that's 40 characters times
1,932,225 records and we get at most 77,289,000 bytes of "numeric" text.
So the original SQLite DB file size of 728,733,696 minutes the known data
sizes (322,776,138 bytes for text and at most 77,289,000 bytes for numeric
values as text) leaves leaves 328,668,558 bytes. I'm not exactly sure
what all that space is used for; I assume B-Tree data/nodes and the like.
It does seem like quite a bit, though -- of the total file size, the
data takes at most ~55% and the SQLite meta-data takes at least ~45% (in
my case, at least.)
Over 45% overhead? Actually, I would say 50% or more.. I know that
most of the DWORD values are not large.. dwFlags is always 0, 1 or 3.
dwLines is almost guaranteed to be 5 digits or less, and bytes is almost
guaranteed to be 7 digits or less. The ID is usually 6 to 8 digits.
//
// THE END!
//
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]