I've just got myself a new job; I'm in a group with something like a 10,000 to 1 ratio of servers to administrators. This is by far the largest installation I've ever had root on.

What concerns me is that while I see some powerful automation, I don't see any safety nets. "Don't screw it up" is the order of the day. That's fine if you have people that are good enough; but bay area labor is tight right now; (speaking of which; I get referral bonuses) they've hired me, and I see no evidence that I am an anomaly. I need safety nets. and we're loosing the graybeards, we really need to put more effort into preventing a fat finger from doing an incredible amount of damage. That, and I want to do some development on our tools, and I am a little frightened of cowboying it on this scale. Now, my background is in virtualization; I've been running a vps provider for the last two years, and I've spent the last year doing Xen consulting.

so this might be partly a "I've got a hammer, all problems look like nails" thing, but it is a pretty nice hammer, so I think it's at least worth looking into.

does anyone else use virtualization to simulate massive systems for the purpose of testing configs? What other approaches do other people use to install 'safety nets' that prevent an undercaffinated admin from taking out a huge number of servers? what is the state of the art here?
_______________________________________________
lssconf-discuss mailing list
lssconf-discuss@inf.ed.ac.uk
http://lists.inf.ed.ac.uk/mailman/listinfo/lssconf-discuss

Reply via email to