Hi,

I'm working on a rewrite of a batch process that operates on a large
InnoDB database.  In the past, it would process the entire database
every night, but now the size of the data is making that impossible,
and there is a desire for the process to operate in near real-time, so
I'm rewriting it to work incrementally.  The idea is to run it from
cron every 5 minutes and process the data from the last 5 minutes.

The problem I'm having is identifying which rows have changed.  I'm
not concerned about deletes in this application, so I thought it would
be enough to just use automatic timestamps and keep track of the last
run time of the program, but then I realized that an uncommitted
transaction could come in later with a timestamp from earlier.  I
haven't seen any way to make the timestamp reflect the actual commit
time.

I have a few ideas of how to handle this:

1) Put in a 5-minute delay and hope this is long enough for all
transactions to commit.  This is simple, but the delay is not
desirable and there's no guarantee that transactions will all be
finished in this time window (although if they didn't it would
certainly indicate a problem for this system).

2) Use a boolean flag on every row in every table to indicate if it
has been seen yet.  This seems like a bad idea, since it would require
the batch process to do tons of updates to the source data as it
processes the rows.

3) Triggers to fill in a logging table.  This seems to be a common
approach.  I'm not sure what the best way to structure the logging
table is, since some of these tables have multi-column primary keys.
Suggestions here would be welcome.  This one is a lot of work, but
sounds pretty safe.

Can anyone offer advice or anecdotes about how you have handled
similar situations?

- Perrin

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to