Perrin, I like #3 the best. #1 - it's not a good approach to "hope" your database keeps up. There are fairly common situations that can come up where you never know how long something will take - unusually high traffic, table check and repair, a bulk load into the same or another database on that db host, etc.
#2 - the flag is a good idea on the face of it, but in reality your process may end up doing large numbers of table scans to find rows with the flag set. #3 allows you to keep track of exactly which rows need post-processing, and potentially even track when the request came in vs. when it was processed, if desired. You could also get even closer to the "near real-time" desire by running the process constantly and having it idle for 30 seconds, check for new rows, idle, etc. One interesting gotcha would be trying to ensure that whatever updates your batch process does - do not cause additional entries in the "needs to be post processed" table, causing an endless loop... I'm sure there's a way around it, like an extra column called "is_post_process" and then your trigger doesn't do its thing if that equals 1 in the update or something like that. Or perhaps you only need an insert trigger - then you don't have that problem. Best, Dan On 5/30/07, Perrin Harkins <[EMAIL PROTECTED]> wrote:
Hi, I'm working on a rewrite of a batch process that operates on a large InnoDB database. In the past, it would process the entire database every night, but now the size of the data is making that impossible, and there is a desire for the process to operate in near real-time, so I'm rewriting it to work incrementally. The idea is to run it from cron every 5 minutes and process the data from the last 5 minutes. The problem I'm having is identifying which rows have changed. I'm not concerned about deletes in this application, so I thought it would be enough to just use automatic timestamps and keep track of the last run time of the program, but then I realized that an uncommitted transaction could come in later with a timestamp from earlier. I haven't seen any way to make the timestamp reflect the actual commit time. I have a few ideas of how to handle this: 1) Put in a 5-minute delay and hope this is long enough for all transactions to commit. This is simple, but the delay is not desirable and there's no guarantee that transactions will all be finished in this time window (although if they didn't it would certainly indicate a problem for this system). 2) Use a boolean flag on every row in every table to indicate if it has been seen yet. This seems like a bad idea, since it would require the batch process to do tons of updates to the source data as it processes the rows. 3) Triggers to fill in a logging table. This seems to be a common approach. I'm not sure what the best way to structure the logging table is, since some of these tables have multi-column primary keys. Suggestions here would be welcome. This one is a lot of work, but sounds pretty safe. Can anyone offer advice or anecdotes about how you have handled similar situations? - Perrin -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]