On Sep 21, 2005, at 5:23 AM, Jeff wrote:

I am interested in how you go about doing a "delayed replication" to
protect against operator error.  We've already fallen victim to that
situation here.


The long story short is we use the fact that MySQL has the ability to run the SQL thread and the IO thread of replication separately, and control them individually. In practice we use cron and a whole bunch of scripts to stop the I/O thread (the one reading from the master) most of the time, and manage when the SQL thread replicates... eg at 4:00 cron stops the SQL thread. At 4:01 we start the I/O thread (this can read a lot of changes very quickly from the master, so only need a short time to catch up with all the changes). At 4:05 we stop the I/ O thread. Then we wait a few minutes to give ourselves a buffer... then finally at 4:15 we start the SQL thread.... and repeat the cycle every two hours.

The upshot is at the small end we are 10 minutes behind (the time between we stop I/O at 4:05 and the time when we start SQL at 4:15), and at the long end we are 2 hours behind (at 4:07 for example the last query that the SQL thread could have executed came from the master at 2:05).

Our scripts are a little more complicated to marry into our monitoring system without setting off alerts that replication has stopped and so on (and of course the machine that runs this speaks to many masters using many instances of MySQL, so we need to manage this for every instance of MySQL). We also manage things to allow an emergency stop by having the scripts do an existence check on a specific file, and if the file isn't there don't start any replication processes. We then have a stop script which tells the instances to stop whatever they are doing and deletes the file. At that point replication can't resume until we replace the file manually - we tie that emergency script to a TCP port and hey presto... in the event of an emergency all someone needs to do is hit the right tcp port on the server (telnet to it, hit it with a browser, anything that will cause the port to see some activity) and all the replication comes to a stop.

Also as part of our 2 hourly cycle we do a lot of binary log flushing on the slave and the masters, so if we ever need to roll back we can roll back to a specific point in time and only have to deal with fixing problems in the logs form that point in time onwards. if an operator error gets by before we can stop we can go to yesterdays backup and only execute those binary logs from before the incident, and then deal with the issue in question.

This process has reduced our downtime in the event of a total database corruption from four hours to recover from yesterdays data and be missing everything since, to 30 minutes and be only missing the data since the last 2 hourly roll over. And it doesn't take long to dump the last set of binary logs to a text file, find and fix/ remove the corrupting command and apply that whole log into the database, effectively giving us almost zero lost data and back online in no time (although when clients are screaming even 30 minutes feels like an eternity). This is all of course so much better than the four hour downtime we had before this system.

And there are side benefits... for example backups are easier to do because the data isn't being changed except for a few minutes every 2 hours. Instead of co-ordinating timing scripts and locking tables and doing dumps and so on we can do simple file system duplication of the data directories.

Best Regards, Bruce

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to