Hi Mark,

have you tried to do a VACUUM on the database?
It helps a lot when it comes to the 'read ahead'
feature of the database.

Michael


Mark Drago wrote:
Hello,

I'm writing a web cache and I want to use SQLite to store the log of all
of the accesses made through the web cache.  The idea is to install this
web cache in large institutions (1000-5000 workstations).  The log
database can grow in size very quickly and can reach in to the gigabytes
after just a few days.

Writing to the database is speedy enough that I haven't seen much of a
problem.  I collect the data for 1000 web requests and then insert them
all in a single transaction using a prepared statement.  This works
rather well.

The problem that I'm encountering has to do with generating reports on
the data in the log database.  SQLite is showing good performance on
some simple queries, but that is not the case once something more
advanced is involved, like an aggregate function for example.  More
over, once the SQLite file is cached in memory it is really quick.
However, I can't count on this file being cached at all when a user goes
to run the report.  So, I've been clearing my file cache before running
a test, and then running the same test again now that everything has
been loaded in to the cache.  Like I said, for most cases SQLite is
fine, but here is one example where it doesn't fare as well.

The system that I'm running these tests on is a P4 2.8GHz HT with 1 GB
of RAM running Fedora Core 5 and using SQLite version 3.3.3 (being as
that is what comes with FC5).  I'm doing my tests with a database that
is 732M in size and contains 1,280,881 records (the DB schema is
included below).

I clear the file cache by running the following command. I wait until it
consumes all of memory and then I kill it:
perl -e '@f[0..100000000]=0'

I'm running the tests by running the following script:
#!/bin/bash
echo "$1;" | sqlite3 log.db > /dev/null

The query I'm running is the following:
select count(host), host from log group by host;

The results include the first time the query is run (when the file is
not cached) and then the times of a few runs after that (when the file
is cached).

SQLite: 221.9s, 1.6s, 1.6s, 1.6s
 MySQL:   2.2s, 1.8s, 1.8s, 1.8s

The MySQL tests were done with the following script:
#!/bin/bash
mysql -u root --database=log -e "$1" > /dev/null

It is apparent that SQLite is reading the entire database off of the
disk and MySQL somehow is not.  The MySQL query cache is not in use on
this machine and MySQL does not claim very much memory for itself before
the test is conducted (maybe 30M).

I've tried looking in to the output from 'explain' to see if SQLite was
using the index that I have on the 'host' column, but I don't think it
is.  The output from 'explain' is included below.  Note that the
'explain' output is from a different machine which is running SQLite
3.3.5 compiled from source as the SQLite on FC5 kept Segfaulting when I
tried to use 'explain'.

Any information or ideas on how to speed up this query are greatly
appreciated.  The only un-implemented idea I have right now is to remove
some of the duplicated data from the schema in an attempt to reduce the
size of the average row in the table.  In some cases I can store just an
integer where I'm storing both the integer and a descriptive string
(category_name and category_no for example).  Some of the other
information in the schema holds data about things that are internal to
the web cache (profile*, ad*, etc.).

Thank you very much for any ideas,
Mark.

TABLE SCHEMA:
CREATE TABLE log(
log_no integer primary key,
add_dte datetime,
profile_name varchar(255),
workstation_ip integer,
workstation_ip_txt varchar(20),
verdict integer,
verdict_reason varchar(255),
category_name varchar(80),
category_no integer,
set_user_name varchar(255),
profile_zone varchar(40),
profile_zone_no integer,
author_user_name varchar(255),
workstation_name varchar(255),
workstation_group_name varchar(255),
profile_association varchar(255),
profile_association_no integer,
protocol varchar(40),
connection_type varchar(255),
connection_type_no integer,
host varchar(255),
url text,
ad_username varchar(255),
ad_groups text,
ad_domain varchar(255),
ad_workstation_name varchar(255),
ad_last_update_dte datetime);

INDEXES:
CREATE INDEX add_dte ON log (add_dte);
CREATE INDEX profile_name ON log(profile_name);
CREATE INDEX workstation_ip ON log(workstation_ip);
CREATE INDEX verdict ON log (verdict);
CREATE INDEX research_zone_no ON log(research_zone_no);
CREATE INDEX profile_zone_no ON log(profile_zone_no);
CREATE INDEX workstation_name ON log(workstation_name);
CREATE INDEX workstation_group_name ON log(workstation_group_name);
CREATE INDEX profile_association_no ON log(profile_association_no);
CREATE INDEX connection_type_no ON log(connection_type_no);
CREATE INDEX host ON log(host);
CREATE INDEX ad_username on log(ad_username);
CREATE INDEX ad_domain on log(ad_domain);
CREATE INDEX ad_workstation_name on log(ad_workstation_name);

EXPLAIN OUTPUT:
sqlite> explain select count(host), host from log group by host;
0|Noop|0|0|
1|MemInt|0|3|
2|MemInt|0|2|
3|Goto|0|16|
4|MemInt|1|3|
5|Return|0|0|
6|IfMemPos|2|8|
7|Return|0|0|
8|AggFinal|0|1|count(1)
9|MemLoad|0|0|
10|MemLoad|1|0|
11|Callback|2|0|
12|Return|0|0|
13|MemNull|1|0|
14|MemNull|0|0|
15|Return|0|0|
16|Gosub|0|13|
17|Goto|0|41|
18|Integer|0|0|
19|OpenRead|2|20|keyinfo(1,BINARY)
20|SetNumColumns|2|2|
21|Rewind|2|38|
22|RowKey|2|0|
23|IdxIsNull|0|37|
24|Column|2|0|
25|MemStore|5|0|
26|MemLoad|4|0|
27|Eq|512|32|collseq(BINARY)
28|MemMove|4|5|
29|Gosub|0|6|
30|IfMemPos|3|40|
31|Gosub|0|13|
32|Column|2|0|
33|AggStep|0|1|count(1)
34|Column|2|0|
35|MemStore|1|1|
36|MemInt|1|2|
37|Next|2|22|
38|Close|2|0|
39|Gosub|0|6|
40|Halt|0|0|
41|Transaction|0|0|
42|VerifyCookie|0|19|
43|Goto|0|18|
44|Noop|0|0|

Reply via email to