Seems to me you might need a master and detail tables. One with the dates, timestamp and one with the data....
CREATE TABLE 'log_time' ( id integer primary key datetimestring VARCHAR COLLATE NOCASE, timestamp INTEGER, date INTEGER, hour INTEGER, min INTEGER, sec INTEGER ) CREATE TABLE 'log_data' ( id integer primary key ip VARCHAR, uid VARCHAR COLLATE NOCASE, operation VARCHAR COLLATE NOCASE, target VARCHAR, response INTEGER, client VARCHAR COLLATE NOCASE, completion VARCHAR COLLATE NOCASE, port INTEGER, connections INTEGER ) Then your queries can be left joins based upon "ID" Also you might take a look into simplifying all of the timestring/timestamp and min sec hour stuff to just one or two fields and using the sqlite time functions http://www.sqlite.org/lang_datefunc.html --- On Wed, 7/1/09, yaconsult <sql...@unixguy.fastmail.fm> wrote: > From: yaconsult <sql...@unixguy.fastmail.fm> > Subject: [sqlite] 3 million rows, query speeds, and returning zero for rows > that don't exist > To: sqlite-users@sqlite.org > Date: Wednesday, July 1, 2009, 8:00 PM > > I'm using sqlite to do some analysis on very large web > application log > files - approaching 3 million lines per day. And what > a wonderful tool > it is! It has saved me from writing lots of custom > scripts. > > I have a perl script that parses an rsynced copy of this > huge log file, > munges, converts, and adds new entries to my database. > > The schema currently looks something like this - some of > the columns > are there as a aid to development/debugging so I can see > what's > going on: > > CREATE TABLE 'log' ( > ip VARCHAR, > datetimestring VARCHAR COLLATE NOCASE, > uid VARCHAR COLLATE NOCASE, > timestamp INTEGER, > operation VARCHAR COLLATE NOCASE, > target VARCHAR, > response INTEGER, > client VARCHAR COLLATE NOCASE, > completion VARCHAR COLLATE NOCASE, > port INTEGER, > connections INTEGER, > date INTEGER, > hour INTEGER, > min INTEGER, > sec INTEGER > ) > > I've been experimenting with indices as far as when the > tradeoff in > size and speed of adding data is offset sufficiently by how > fast I > need to get the results. For testing purposes, I > created indexes > for all the columns used in the selects below. > > Most of the queries I've done so far have been pretty > straightforward > and it's worked very well. But, now I need to do one > that's taking > too long. There's probably a better way than the one > I'm using. > > The problem is that I need to produce information for the > timeslots > that are missing as well as those that are there. > > For example, I need to be able to evaluate the effect of > different > load balancers, so I want to compare the numbers of > transactions on > all the ports. I need to do so on a second > by second basis so I can > see the most detail and not an average. > > I thought one way do do this would be with a self-join > because > there are so many transactions that I'm pretty sure that > all seconds > will be present in the table. I also considered > creating a second > table derived from the first whenever it's updated that > would have a > unix epoch entry for each second within the log file. > > Here's what I tried for the first approach: > > select > distinct t.timestamp, > t.port, > count(u.timestamp) > from > log t > left join > log u > on > t.timestamp = u.timestamp > and > t.timestamp >= 1246266000 > and u.uid != "-" > and (u.response = 207 > or u.response = 200) > and u.port = 8444; > > The purpose of the join is to give me all of the timestamps > - one for > each second - even those seconds that had no activity on > that port - > and then use those for the query. But this query is > taking a very, > very, very long time. I'm probably making some newbie > mistake, > because that's what I am! > > Is this a valid approach? Is there a better > solution? > > I am able to get results extremely quickly by using > something like > this: > > select > date, > hour, > min, > sec, > count(port) > from log > where > uid != "-" > and (response = 207 > or response = 200) > and port = 8444 > group by > timestamp > order by > timestamp; > > but the problem is that I also need to know when the ports > are not > busy and there are no log entries. Here are the last > few lines of the > result: > > 2009-06-29|17|42|0|2 > 2009-06-29|17|42|7|1 > 2009-06-29|17|42|8|4 > > What I need to have in the results are entries for seconds > 1-6 with a > value of 0, like this: > > 2009-06-29|17|42|0|2 > 2009-06-29|17|42|1|0 > 2009-06-29|17|42|2|0 > 2009-06-29|17|42|3|0 > 2009-06-29|17|42|4|0 > 2009-06-29|17|42|5|0 > 2009-06-29|17|42|6|0 > 2009-06-29|17|42|7|1 > 2009-06-29|17|42|8|4 > > Am I on the right path with the self-join? I also > considered adding a > new table to the database that contained all the possible > time values > in epoch seconds - I could easily do so with the script > that creates > the database and adds the data. I could then left > join this table > with the data table. But would doing so be > substantially faster than > the self-join? > > The examples above are for only one port - I need to do the > same for > 16 ports. I wrote a perl script to do queries for > each second, but it was > taking about 1 minute to do queries for the 16 ports for a > 1 second > interval - way too slow to be used to produce results for > 24 hour periods. > > Thanks for any advice! I learn a lot from this list. > Leo > -- > View this message in context: > http://www.nabble.com/3-million-rows%2C-query-speeds%2C-and-returning-zero-for-rows-that-don%27t-exist-tp24299431p24299431.html > Sent from the SQLite mailing list archive at Nabble.com. > > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users