On Sat, Aug 3, 2013 at 9:28 PM, John Lauro <john.la...@covenanteyes.com> wrote: > ----- Original Message ----- >> From: "Nico Kadel-Garcia" <nka...@gmail.com> >> >> It's exceedingly dangerous in a production environment. I've helped >> run, and done OS specifications and installers for a system over >> 10,000 hosts. and you *never*, *never*, *never* auto-update them >> without warning or outside the maintenance windows. *Never*. If I >> caught someone else on the team doing that as a matter of policy, I >> would have campaigned to have them fired ASAP. > > > If you have to manage 10,000 hosts then you are lucky you never had to learn > to deal with no maintenance window and 0 downtime, and so most of your > maintenance had to be possible outside of a maintenance window. That is how > many IT shops with thousands of machines have to operate
No, you schedule the updates. A maintenance window is not the same as scheduled downtime, and for larger environments, you can schedule them for a planned set of well defined updates. For example, before allowing the system wide changes, you test them in a lab with a variety of the services and hardware you use in the field. And you don't test "whatever the upstream vendor happened to publish lately, sight unseen, plus whatever they added between the test and the permitted update". You set up a defined set of updates, such as using a yum mirror snapshot (for Scientific Linux or CentOS) or a well defined RHN configuration. > these days. You might even want to read up on Netflix's thoughts on chaos > monkey. I'm familiar with the concept, and it has uses. However, having a chaos monkey in place does not reduce the risk of *network wide* auto-updates of corrupting all operating system configurations. I'm afraid I've had that happen: A kernel update introduced a regression, and took down over 1000 systems all the same night. (A vendor had changed hardware without notifying us, and the new kernel didn't have the right drivers for it: the old kernel did). Fortunately, it happened at a well defined maintenance window. And also fortunately, I'd taken advantage of the old LILO "default" and "boot once oncly with a different setting" tools to boot new kernels in test mode, and retain the old kernel as default after a power cycle if the new kernel failed to boot. > Autoupgrades are just another form of random outage you might have to deal > with. As long as And you deal with them by *turning them off* and scheduling the updates, with the chance to assess the updates. That's because leaving them enabled by default is not "random". It's scheduled arbitrarily by the upstream vendor, and even *they* publish release notes and provide their own entire system (RHN, or spacewalk if you use the free versions) to schedule them. > you have different hosts upgrading on different days and times, and you have > automated routines that test and take servers out of service automatically if > things fail, then autogrades is perfectly fine. If things break from the > autoupgrades, it becomes real obvious based on the update history of which > machines broke from it. Gee, you mean that you don't let systems automatically update without planning, and update different members at different scheduled and times? Why didn't I think of something like that? You must be smart! > Campaigning to have someone fired without even hearing their reason for > upgrading, or even warning them first that at your location is is standard > practice not to ever autoupgrade because you have a separate QA process that > even critical security patches must go through is a very bad practice on your > part. Oh, he or she would get a chance to talk. If they spouted the "auto-updating is safe" mantra and refused to budge, I'd be like them like white on rice. Touching production servers unannounced is a serious no-no in large networks or any large network. > I am not going to state what patch policy I use, only that different policies > work for different environments. Based on your statement, it sounds like you > could be loosing some valuable co-workers by lobbying to get people fired > that have a different opinion from you instead of trying to educate and/or > learn from each other. If you feel you can not learn from your peers, you > have already proven you are correct in that respect, but you have also shown > there is much you don't know by being incapable of learning new things. Oh, if they're *trainable*, they might get a shot. But leaving out the "schedule the updates so they don't all occur at once" part, as you did at first, is pretty dangerous. > (Personally I would hate to use Nagios for 10,000 hosts. It didn't really > scale that well IMHO, but to be honest I haven't bothered looking at it in > over 4 years, and maybe it's improved. Not familiar with Icinga, but I have > had good luck with Zabbix for large scale) Oh, you split it for a network that big!!!! It handles a thousand hosts reasonably well, even 10 years ago, if you don't go overboard with too frequent sampling and computationally expensive checks. And for the "nagios-plugin-check-updates", you only really need to run it daily. (And maybe re-run it across a set of hosts after the updates are installed, to catch any that got missed.