I'm also curious what solutions others have arrived at for this.

I currently have a similar situation to what you are describing: I've got a
large database of timestamped log entries, and need to generate periodic
summaries over them, but don't want to actually mine the entire huge
database every time.
I thus have additional tables that I think of as "caches" of summary info.
In the situation where I need to efficiently compute values such as:

SELECT COUNT(*) FROM main_log WHERE entry_time >= [begin date] AND
entry_time < [end date]

I would construct a (permanent) table along the lines of:

CREATE TABLE summary_cache (num INT, min_time DATETIME, max_time DATETIME)

and then run commands along the lines of

LOCK TABLES summary_cache
SELECT MAX(max_time) FROM summary_cache
INSERT INTO summary_cache SELECT COUNT(*), [the max time just computed],
NOW() FROM main_log WHERE entry_time >= [the max time just computed] AND
entry_time < NOW()
UNLOCK TABLES

at various times- maybe on a regular schedule; maybe just before doing
computations. It depends on the application and the specifics of the summary
I'm computing.
(Note that there are also a few issues in here with NOW() value consistency,
as well as NOW() value coherency.)

Instead of the original query over the entire date range, I can then run

SELECT SUM(num), MIN(min_time), MAX(max_time) FROM summary_cache WHERE 
min_time >= [begin date] AND max_time <= [end date]
SELECT COUNT(*) FROM main_log WHERE entry_time >= [begin date] AND
entry_time < [min date just computed]
SELECT COUNT(*) FROM main_log WHERE entry_time >= [max date just computed]
AND entry_time < [end_date]

and UNION the results together.
Note that if your the MIN and MAX you get out of your summary table are
equal to [begin date] and [end date] the main_log WHERE clauses are
impossible so you don't even need to run them.
One important point about the above is that it intentionally uses intervals
which are closed at the bottom and open at the top: the "time < NOW()",
specifically, won't work correctly if this is changed because additional
entries may come in with timestamps of NOW(). For various reasons, I
actually don't make direct use of NOW() as described here; I get a single
NOW() value from somewhere and use that explicitly. This also makes it
easier to adjust the INSERT INTO to an explicitly specified end time-- the
start of a month, for example.

You can use additional columns in the summary table to compute additional
summaries over the same data set (mins and maxes of various fields in
addition to counts, as a trivial example), and you can slap as complex or
specific GROUP BY onto your INSERT INTO summary_table as you like, possibly
storing some of the grouping info in still more columns in the summary
table. You can then use these columns to mimic complex additional where
clauses in your original query.

Not only is this a useful and flexible technique for performance
optimization, it can also be very well encapsulated programmatically, so a
nicely coded architecture allows you to add this kind of system to *any*
summary queries you run.

I'd love any feedback people have on this system. Am I being stupid anywhere
here? Any gotchas I maybe haven't foreseen? Further improvements or
extensions?

-Rob


On 2/5/02 at 7:29 pm, Nick Arnett <[EMAIL PROTECTED]> wrote:

> I'm finding that it's not quite as simple as I had imagined to
> maintain a table whose values are calculated by analyzing other
> tables.  The source tables contain time series data, which can is
> updated several times a day. To calculate totals by day, week,
> month, etc., I don't want to have to start over from scratch, so I
> want to keep track of which records in the source tables have
> already been analyzed.  Further complicating things, there are
> several analyses that take place and I may add others later.  I
> don't always want to have to do all of them at the same time.
> 
> So, at the moment, I've added some Boolean enum columns to the
> source data tables, showing which analyses have been done.  Do the
> analysis, set the Booleans to true.  I'm not sure I really like that
> approach, since it can mean adding columns to rather large tables at
> times.  I suppose those fields could be in an entirely separate
> table, joined by the primary key, but I don't like having to depart
> from normal form that way.
> 
> I also considered having a table that keeps track of what analysis
> has been done, using a timestamp or auto-increment column, but
> realized that would get messed up if I don't process data in the
> order it arrived.  I could keep track of a range, which would let me
> find earlier stuff that hadn't been done.
> 
> Just to make it all a bit more complicated, the initial analysis is
> done in a temporary table, for huge performance gain, then inserted
> into its final resting spot.
> 
> Any suggestions on how others have solved this sort of problem?


---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Reply via email to