Seems to me you might need a master and detail tables.
One with the dates, timestamp and one with the data....

CREATE TABLE 'log_time' (
 id integer primary key
 datetimestring VARCHAR COLLATE NOCASE,
 timestamp INTEGER,
 date INTEGER,
 hour INTEGER,
 min INTEGER,
 sec INTEGER
)

CREATE TABLE 'log_data' (
 id integer primary key
 ip VARCHAR,
 uid VARCHAR COLLATE NOCASE,
 operation VARCHAR COLLATE NOCASE,
 target VARCHAR,
 response INTEGER,
 client VARCHAR COLLATE NOCASE,
 completion VARCHAR COLLATE NOCASE,
 port INTEGER,
 connections INTEGER
 )

Then your queries can be left joins based upon "ID" 


Also you might take a look into simplifying all of the timestring/timestamp and 
min sec hour stuff to just one or two fields and using the sqlite time functions

http://www.sqlite.org/lang_datefunc.html




--- On Wed, 7/1/09, yaconsult <sql...@unixguy.fastmail.fm> wrote:

> From: yaconsult <sql...@unixguy.fastmail.fm>
> Subject: [sqlite]  3 million rows, query speeds, and returning zero for rows 
> that don't exist
> To: sqlite-users@sqlite.org
> Date: Wednesday, July 1, 2009, 8:00 PM
> 
> I'm using sqlite to do some analysis on very large web
> application log
> files - approaching 3 million lines per day.  And what
> a wonderful tool
> it is!  It has saved me from writing lots of custom
> scripts.
> 
> I have a perl script that parses an rsynced copy of this
> huge log file,
> munges, converts, and adds new entries to my database.
> 
> The schema currently looks something like this - some of
> the columns
> are there as a aid to development/debugging so I can see
> what's
> going on:
> 
> CREATE TABLE 'log' (
>  ip VARCHAR,
>  datetimestring VARCHAR COLLATE NOCASE,
>  uid VARCHAR COLLATE NOCASE,
>  timestamp INTEGER,
>  operation VARCHAR COLLATE NOCASE,
>  target VARCHAR,
>  response INTEGER,
>  client VARCHAR COLLATE NOCASE,
>  completion VARCHAR COLLATE NOCASE,
>  port INTEGER,
>  connections INTEGER,
>  date INTEGER,
>  hour INTEGER,
>  min INTEGER,
>  sec INTEGER
> )
> 
> I've been experimenting with indices as far as when the
> tradeoff in
> size and speed of adding data is offset sufficiently by how
> fast I
> need to get the results.  For testing purposes, I
> created indexes
> for all the columns used in the selects below.
> 
> Most of the queries I've done so far have been pretty
> straightforward
> and it's worked very well.  But, now I need to do one
> that's taking
> too long.  There's probably a better way than the one
> I'm using.
> 
> The problem is that I need to produce information for the
> timeslots
> that are missing as well as those that are there.
> 
> For example, I need to be able to evaluate the effect of
> different
> load balancers, so I want to compare the numbers of
> transactions on
> all the ports.   I need to do so on a second
> by second basis so I can
> see the most detail and not an average.
> 
> I thought one way do do this would be with a self-join
> because
> there are so many transactions that I'm pretty sure that
> all seconds
> will be present in the table.  I also considered
> creating a second
> table derived from the first whenever it's updated that
> would have a
> unix epoch entry for each second within the log file. 
> 
> Here's what I tried for the first approach:
> 
> select
> distinct t.timestamp,
> t.port,
> count(u.timestamp)
> from
> log t
> left join
> log u
> on
> t.timestamp = u.timestamp
> and
> t.timestamp >= 1246266000
> and u.uid != "-"
> and (u.response = 207
>      or u.response = 200)
> and u.port = 8444;
> 
> The purpose of the join is to give me all of the timestamps
> - one for
> each second - even those seconds that had no activity on
> that port -
> and then use those for the query.  But this query is
> taking a very,
> very, very long time.  I'm probably making some newbie
> mistake,
> because that's what I am!
> 
> Is this a valid approach?  Is there a better
> solution?
> 
> I am able to get results extremely quickly by using
> something like
> this:
> 
> select
> date,
> hour,
> min,
> sec,
> count(port)
> from log
> where
> uid != "-"
> and (response = 207
>      or response = 200)
> and port = 8444
> group by
> timestamp
> order by
> timestamp;
> 
> but the problem is that I also need to know when the ports
> are not
> busy and there are no log entries.  Here are the last
> few lines of the
> result: 
> 
> 2009-06-29|17|42|0|2
> 2009-06-29|17|42|7|1
> 2009-06-29|17|42|8|4
> 
> What I need to have in the results are entries for seconds
> 1-6 with a
> value of 0, like this:
> 
> 2009-06-29|17|42|0|2
> 2009-06-29|17|42|1|0
> 2009-06-29|17|42|2|0
> 2009-06-29|17|42|3|0
> 2009-06-29|17|42|4|0
> 2009-06-29|17|42|5|0
> 2009-06-29|17|42|6|0
> 2009-06-29|17|42|7|1
> 2009-06-29|17|42|8|4
> 
> Am I on the right path with the self-join?  I also
> considered adding a
> new table to the database that contained all the possible
> time values
> in epoch seconds - I could easily do so with the script
> that creates
> the database and adds the data.  I could then left
> join this table
> with the data table.  But would doing so be
> substantially faster than
> the self-join?
> 
> The examples above are for only one port - I need to do the
> same for
> 16 ports.  I wrote a perl script to do queries for
> each second, but it was
> taking about 1 minute to do queries for the 16 ports for a
> 1 second
> interval - way too slow to be used to produce results for
> 24 hour periods.
> 
> Thanks for any advice!  I learn a lot from this list.
> Leo
> -- 
> View this message in context: 
> http://www.nabble.com/3-million-rows%2C-query-speeds%2C-and-returning-zero-for-rows-that-don%27t-exist-tp24299431p24299431.html
> Sent from the SQLite mailing list archive at Nabble.com.
> 
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> 
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to