Something to ask yourself in all this is, how fast can I change out a piece of hardware? If I needed a high availability system, on the cheap (ie, not a million dollars US worth of Sun hardware) I'd probably go with a bunch of SuperMicro 2U rack mount servers with the hot swap SCSI drives. You can rack up a bunch of thses and in the event of hardware failure pull the drives from one machine and put them in another, and be online in under 15 minutes (that's a lot of 9s if it happens once a year, your going to be down longer than that rebooting each time there's a critical OS update). I've had a lot of success of bringing a server up on different hardware by switching drives in to new units quickly. If your racked up and ready to go it could be as quick as the time to swap drives and power up. This presumes human presence in your datacenter 24/7 and good alarms. The nice thing is that it doesn't require someone to log in and do things, it doesn't require custom scripts, it doesn't require esoteric high availability software. It requires someone to pull some drives and plug them in somewhere else and turn the unit on. Is 15 minutes acceptable over the life of one of these systems (hint: I've only ever seen one of these systems fail, and I've got a lot of them here (and yes, I pulled the drives and put them in a spare unit and voila)).


If you have a lot of data you can look at fibre channel solutions for your data drives. The new unit can attach to the same disks over a fabric (if I'm not using outdated buzz words here) and voila, a few terrabytes of data is on a new system. It's also handy if you need to switch off masters. Work out a system where your data is on a FC array to where you can switch which system handles it. Down the server, run a script to change which systems attach that data, bring up the server on the other machine (complete with IP addresses). Sure, there's some small downtime, but you can usually get away with a well planned couple of seconds at 3 AM.

Lots of time/money are put in to software solutions where an igor would do well (or a NOC tech, and for your NOC techs out there, I've got a lot of respect for igors, they are good with a needle).

Something else to consider in high availability systems is regression testing. Think about what people can/will do to your systems and test against it. This is a good way to get a lot of extra hardware around in your office/lab. Think of everything you might do to a production system and write a test plan for it (I once did a 1800 line interactive shell script that had 900 test plans for each hardware platform it worked on, of which there were 12). In any case, when I upgrade the version of the OS, what happens. When the code does X (for every X) what happens. When I buy a new switch, what happens. When I upgrade MySQL, what happens. When I introduce code changes, what happens. While it's not directly related to MySQL it's important, and you should at least be thinking in terms of OS, Hardware and Database Server and have a good set of automated test plans from the developers you can run against your hardware that includes load testing. You can put a ton in to hardware failover, but it won't mean squat when the code locks all your other queries out for a couple of hours. I had a situation where upgrading the clients sytems to using INNODB tables caused a problem for one of his scripts that bulk loaded information in to the system. It turned out to be a nice little switch in my.cnf, but I had no way to test this before I did a alter table on his stuff to know that his updates would take *that* long before (in the end) failing. Also consider that hte default for the option that needed to be switched had changed between versions... ...anyway, get your self a nice testing lab out of all this if you can, I'm sure we'd all like to have more hardware to play with :)

--
Michael "Suspenders and Belt" Conlen



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]



Reply via email to