I'm using sqlite to do some analysis on very large web application log
files - approaching 3 million lines per day.  And what a wonderful tool
it is!  It has saved me from writing lots of custom scripts.

I have a perl script that parses an rsynced copy of this huge log file,
munges, converts, and adds new entries to my database.

The schema currently looks something like this - some of the columns
are there as a aid to development/debugging so I can see what's
going on:

CREATE TABLE 'log' (
 ip VARCHAR,
 datetimestring VARCHAR COLLATE NOCASE,
 uid VARCHAR COLLATE NOCASE,
 timestamp INTEGER,
 operation VARCHAR COLLATE NOCASE,
 target VARCHAR,
 response INTEGER,
 client VARCHAR COLLATE NOCASE,
 completion VARCHAR COLLATE NOCASE,
 port INTEGER,
 connections INTEGER,
 date INTEGER,
 hour INTEGER,
 min INTEGER,
 sec INTEGER
)

I've been experimenting with indices as far as when the tradeoff in
size and speed of adding data is offset sufficiently by how fast I
need to get the results.  For testing purposes, I created indexes
for all the columns used in the selects below.

Most of the queries I've done so far have been pretty straightforward
and it's worked very well.  But, now I need to do one that's taking
too long.  There's probably a better way than the one I'm using.

The problem is that I need to produce information for the timeslots
that are missing as well as those that are there.

For example, I need to be able to evaluate the effect of different
load balancers, so I want to compare the numbers of transactions on
all the ports.   I need to do so on a second by second basis so I can
see the most detail and not an average.

I thought one way do do this would be with a self-join because
there are so many transactions that I'm pretty sure that all seconds
will be present in the table.  I also considered creating a second
table derived from the first whenever it's updated that would have a
unix epoch entry for each second within the log file. 

Here's what I tried for the first approach:

select
distinct t.timestamp,
t.port,
count(u.timestamp)
from
log t
left join
log u
on
t.timestamp = u.timestamp
and
t.timestamp >= 1246266000
and u.uid != "-"
and (u.response = 207
     or u.response = 200)
and u.port = 8444;

The purpose of the join is to give me all of the timestamps - one for
each second - even those seconds that had no activity on that port -
and then use those for the query.  But this query is taking a very,
very, very long time.  I'm probably making some newbie mistake,
because that's what I am!

Is this a valid approach?  Is there a better solution?

I am able to get results extremely quickly by using something like
this:

select
date,
hour,
min,
sec,
count(port)
from log
where
uid != "-"
and (response = 207
     or response = 200)
and port = 8444
group by
timestamp
order by
timestamp;

but the problem is that I also need to know when the ports are not
busy and there are no log entries.  Here are the last few lines of the
result: 

2009-06-29|17|42|0|2
2009-06-29|17|42|7|1
2009-06-29|17|42|8|4

What I need to have in the results are entries for seconds 1-6 with a
value of 0, like this:

2009-06-29|17|42|0|2
2009-06-29|17|42|1|0
2009-06-29|17|42|2|0
2009-06-29|17|42|3|0
2009-06-29|17|42|4|0
2009-06-29|17|42|5|0
2009-06-29|17|42|6|0
2009-06-29|17|42|7|1
2009-06-29|17|42|8|4

Am I on the right path with the self-join?  I also considered adding a
new table to the database that contained all the possible time values
in epoch seconds - I could easily do so with the script that creates
the database and adds the data.  I could then left join this table
with the data table.  But would doing so be substantially faster than
the self-join?

The examples above are for only one port - I need to do the same for
16 ports.  I wrote a perl script to do queries for each second, but it was
taking about 1 minute to do queries for the 16 ports for a 1 second
interval - way too slow to be used to produce results for 24 hour periods.

Thanks for any advice!  I learn a lot from this list.
Leo
-- 
View this message in context: 
http://www.nabble.com/3-million-rows%2C-query-speeds%2C-and-returning-zero-for-rows-that-don%27t-exist-tp24299431p24299431.html
Sent from the SQLite mailing list archive at Nabble.com.

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to