Re: Circular Replication

Bruce Dembecki Thu, 22 Sep 2005 09:03:14 -0700

On Sep 21, 2005, at 5:23 AM, Jeff wrote:

I am interested in how you go about doing a "delayed replication" to
protect against operator error.  We've already fallen victim to that
situation here.

The long story short is we use the fact that MySQL has the ability torun the SQL thread and the IO thread of replication separately, andcontrol them individually. In practice we use cron and a whole bunchof scripts to stop the I/O thread (the one reading from the master)most of the time, and manage when the SQL thread replicates... eg at4:00 cron stops the SQL thread. At 4:01 we start the I/O thread (thiscan read a lot of changes very quickly from the master, so only needa short time to catch up with all the changes). At 4:05 we stop the I/O thread. Then we wait a few minutes to give ourselves a buffer...then finally at 4:15 we start the SQL thread.... and repeat the cycleevery two hours.

The upshot is at the small end we are 10 minutes behind (the timebetween we stop I/O at 4:05 and the time when we start SQL at 4:15),and at the long end we are 2 hours behind (at 4:07 for example thelast query that the SQL thread could have executed came from themaster at 2:05).

Our scripts are a little more complicated to marry into ourmonitoring system without setting off alerts that replication hasstopped and so on (and of course the machine that runs this speaks tomany masters using many instances of MySQL, so we need to manage thisfor every instance of MySQL). We also manage things to allow anemergency stop by having the scripts do an existence check on aspecific file, and if the file isn't there don't start anyreplication processes. We then have a stop script which tells theinstances to stop whatever they are doing and deletes the file. Atthat point replication can't resume until we replace the filemanually - we tie that emergency script to a TCP port and heypresto... in the event of an emergency all someone needs to do is hitthe right tcp port on the server (telnet to it, hit it with abrowser, anything that will cause the port to see some activity) andall the replication comes to a stop.

Also as part of our 2 hourly cycle we do a lot of binary log flushingon the slave and the masters, so if we ever need to roll back we canroll back to a specific point in time and only have to deal withfixing problems in the logs form that point in time onwards. if anoperator error gets by before we can stop we can go to yesterdaysbackup and only execute those binary logs from before the incident,and then deal with the issue in question.

This process has reduced our downtime in the event of a totaldatabase corruption from four hours to recover from yesterdays dataand be missing everything since, to 30 minutes and be only missingthe data since the last 2 hourly roll over. And it doesn't take longto dump the last set of binary logs to a text file, find and fix/remove the corrupting command and apply that whole log into thedatabase, effectively giving us almost zero lost data and back onlinein no time (although when clients are screaming even 30 minutes feelslike an eternity). This is all of course so much better than the fourhour downtime we had before this system.

And there are side benefits... for example backups are easier to dobecause the data isn't being changed except for a few minutes every 2hours. Instead of co-ordinating timing scripts and locking tables anddoing dumps and so on we can do simple file system duplication of thedata directories.


Best Regards, Bruce

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Circular Replication

Reply via email to