Perrin, I like #3 the best.

#1 - it's not a good approach to "hope" your database keeps up.  There are
fairly common situations that can come up where you never know how long
something will take - unusually high traffic, table check and repair, a bulk
load into the same or another database on that db host, etc.

#2 - the flag is a good idea on the face of it, but in reality your process
may end up doing large numbers of table scans to find rows with the flag
set.

#3 allows you to keep track of exactly which rows need post-processing, and
potentially even track when the request came in vs. when it was processed,
if desired.

You could also get even closer to the "near real-time" desire by running the
process constantly and having it idle for 30 seconds, check for new rows,
idle, etc.

One interesting gotcha would be trying to ensure that whatever updates your
batch process does - do not cause additional entries in the "needs to be
post processed" table, causing an endless loop...  I'm sure there's a way
around it, like an extra column called "is_post_process" and then your
trigger doesn't do its thing if that equals 1 in the update or something
like that.  Or perhaps you only need an insert trigger - then you don't have
that problem.

Best,
Dan




On 5/30/07, Perrin Harkins <[EMAIL PROTECTED]> wrote:

Hi,

I'm working on a rewrite of a batch process that operates on a large
InnoDB database.  In the past, it would process the entire database
every night, but now the size of the data is making that impossible,
and there is a desire for the process to operate in near real-time, so
I'm rewriting it to work incrementally.  The idea is to run it from
cron every 5 minutes and process the data from the last 5 minutes.

The problem I'm having is identifying which rows have changed.  I'm
not concerned about deletes in this application, so I thought it would
be enough to just use automatic timestamps and keep track of the last
run time of the program, but then I realized that an uncommitted
transaction could come in later with a timestamp from earlier.  I
haven't seen any way to make the timestamp reflect the actual commit
time.

I have a few ideas of how to handle this:

1) Put in a 5-minute delay and hope this is long enough for all
transactions to commit.  This is simple, but the delay is not
desirable and there's no guarantee that transactions will all be
finished in this time window (although if they didn't it would
certainly indicate a problem for this system).

2) Use a boolean flag on every row in every table to indicate if it
has been seen yet.  This seems like a bad idea, since it would require
the batch process to do tons of updates to the source data as it
processes the rows.

3) Triggers to fill in a logging table.  This seems to be a common
approach.  I'm not sure what the best way to structure the logging
table is, since some of these tables have multi-column primary keys.
Suggestions here would be welcome.  This one is a lot of work, but
sounds pretty safe.

Can anyone offer advice or anecdotes about how you have handled
similar situations?

- Perrin

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]


Reply via email to