Re: Bug in yum-autoupdate
On Sat, Aug 3, 2013 at 9:28 PM, John Lauro wrote: > - Original Message - >> From: "Nico Kadel-Garcia" >> >> It's exceedingly dangerous in a production environment. I've helped >> run, and done OS specifications and installers for a system over >> 10,000 hosts. and you *never*, *never*, *never* auto-update them >> without warning or outside the maintenance windows. *Never*. If I >> caught someone else on the team doing that as a matter of policy, I >> would have campaigned to have them fired ASAP. > > > If you have to manage 10,000 hosts then you are lucky you never had to learn > to deal with no maintenance window and 0 downtime, and so most of your > maintenance had to be possible outside of a maintenance window. That is how > many IT shops with thousands of machines have to operate No, you schedule the updates. A maintenance window is not the same as scheduled downtime, and for larger environments, you can schedule them for a planned set of well defined updates. For example, before allowing the system wide changes, you test them in a lab with a variety of the services and hardware you use in the field. And you don't test "whatever the upstream vendor happened to publish lately, sight unseen, plus whatever they added between the test and the permitted update". You set up a defined set of updates, such as using a yum mirror snapshot (for Scientific Linux or CentOS) or a well defined RHN configuration. > these days. You might even want to read up on Netflix's thoughts on chaos > monkey. I'm familiar with the concept, and it has uses. However, having a chaos monkey in place does not reduce the risk of *network wide* auto-updates of corrupting all operating system configurations. I'm afraid I've had that happen: A kernel update introduced a regression, and took down over 1000 systems all the same night. (A vendor had changed hardware without notifying us, and the new kernel didn't have the right drivers for it: the old kernel did). Fortunately, it happened at a well defined maintenance window. And also fortunately, I'd taken advantage of the old LILO "default" and "boot once oncly with a different setting" tools to boot new kernels in test mode, and retain the old kernel as default after a power cycle if the new kernel failed to boot. > Autoupgrades are just another form of random outage you might have to deal > with. As long as And you deal with them by *turning them off* and scheduling the updates, with the chance to assess the updates. That's because leaving them enabled by default is not "random". It's scheduled arbitrarily by the upstream vendor, and even *they* publish release notes and provide their own entire system (RHN, or spacewalk if you use the free versions) to schedule them. > you have different hosts upgrading on different days and times, and you have > automated routines that test and take servers out of service automatically if > things fail, then autogrades is perfectly fine. If things break from the > autoupgrades, it becomes real obvious based on the update history of which > machines broke from it. Gee, you mean that you don't let systems automatically update without planning, and update different members at different scheduled and times? Why didn't I think of something like that? You must be smart! > Campaigning to have someone fired without even hearing their reason for > upgrading, or even warning them first that at your location is is standard > practice not to ever autoupgrade because you have a separate QA process that > even critical security patches must go through is a very bad practice on your > part. Oh, he or she would get a chance to talk. If they spouted the "auto-updating is safe" mantra and refused to budge, I'd be like them like white on rice. Touching production servers unannounced is a serious no-no in large networks or any large network. > I am not going to state what patch policy I use, only that different policies > work for different environments. Based on your statement, it sounds like you > could be loosing some valuable co-workers by lobbying to get people fired > that have a different opinion from you instead of trying to educate and/or > learn from each other. If you feel you can not learn from your peers, you > have already proven you are correct in that respect, but you have also shown > there is much you don't know by being incapable of learning new things. Oh, if they're *trainable*, they might get a shot. But leaving out the "schedule the updates so they don't all occur at once" part, as you did at first, is pretty dangerous. > (Personally I would hate to use Nagios for 10,000 hosts. It didn't really > scale that well IMHO, but to be honest I haven't bothered looking at it in > over 4 years, and maybe it's improved. Not familiar with Icinga, but I have > had good luck with Zabbix for large scale) Oh, you split it for a network that big It handles a thousand hosts reasonably well, even 10 years ag
Re: Bug in yum-autoupdate
- Original Message - > From: "Nico Kadel-Garcia" > > It's exceedingly dangerous in a production environment. I've helped > run, and done OS specifications and installers for a system over > 10,000 hosts. and you *never*, *never*, *never* auto-update them > without warning or outside the maintenance windows. *Never*. If I > caught someone else on the team doing that as a matter of policy, I > would have campaigned to have them fired ASAP. If you have to manage 10,000 hosts then you are lucky you never had to learn to deal with no maintenance window and 0 downtime, and so most of your maintenance had to be possible outside of a maintenance window. That is how many IT shops with thousands of machines have to operate these days. You might even want to read up on Netflix's thoughts on chaos monkey. Autoupgrades are just another form of random outage you might have to deal with. As long as you have different hosts upgrading on different days and times, and you have automated routines that test and take servers out of service automatically if things fail, then autogrades is perfectly fine. If things break from the autoupgrades, it becomes real obvious based on the update history of which machines broke from it. Campaigning to have someone fired without even hearing their reason for upgrading, or even warning them first that at your location is is standard practice not to ever autoupgrade because you have a separate QA process that even critical security patches must go through is a very bad practice on your part. I am not going to state what patch policy I use, only that different policies work for different environments. Based on your statement, it sounds like you could be loosing some valuable co-workers by lobbying to get people fired that have a different opinion from you instead of trying to educate and/or learn from each other. If you feel you can not learn from your peers, you have already proven you are correct in that respect, but you have also shown there is much you don't know by being incapable of learning new things. (Personally I would hate to use Nagios for 10,000 hosts. It didn't really scale that well IMHO, but to be honest I haven't bothered looking at it in over 4 years, and maybe it's improved. Not familiar with Icinga, but I have had good luck with Zabbix for large scale)
Re: Bug in yum-autoupdate
On Thu, Aug 1, 2013 at 6:07 PM, Steven Haigh wrote: > On 02/08/13 02:26, Vincent Liggio wrote: >> >> On 08/01/2013 12:16 PM, Elias Persson wrote: >>> >>> >>> All the more reason to read up on the differences, and if it's >>> only one system 'yum remove yum-autoupdate' is hardly a big deal. >>> If it's 1200 systems, what difference would an option in anaconda >>> make? It's not like you'll be stepping through that hundreds of >>> times, right? >> >> >> No, when I have to migrate to a new OS (which won't be a 6.4 derivative, >> it will be a 7.0 one, so probably 8-9 months from now), then I'll worry >> about the differences. When I'm testing a piece of hardware that >> requires a specific kernel release on an OS I don't run, whether a new >> option is installed by default or not is not on the top of my list of >> things to worry about. > > > If you really do have 1200 systems to worry about, I'd be looking at things > like satellite. I have ~20-25 systems and yum-autoupdate is fantastic. It > does what it says on the box and relieves me of having to watch / check for > updates every day. I get an email in the morning that tells me what was > updated and if there were any problems. It's exceedingly dangerous in a production environment. I've helped run, and done OS specifications and installers for a system over 10,000 hosts. and you *never*, *never*, *never* auto-update them without warning or outside the maintenance windows. *Never*. If I caught someone else on the team doing that as a matter of policy, I would have campaigned to have them fired ASAP. A simple "yum check-update" cron job reporting to root or to a designated email address is far, far, far safer. Safer yet, if you have Nagios or Icinga up and running for production, is to install and use the "nagios-plugins-check-updates" package from EPEL, which allows graceful remote Nagios monitoring of the system status in an organized fashion across the network. In a production environment, unannounced or unplanned restarts of critical daemons such as httpd, mysql, named, snmpd, nagios, mrtg, or especially the Java based services such as tomcat6 can cause cascading failures across the whole environment. You may have the spare resources to do "monkeywrench" failures across your environment all the time to try to avoid this sort of thing, but very few facilities do. It's also nasty when you have software that is incompatible with contemporary versions of upstream published software. Take a look over at https://github.com/opentusk/OpenTUSK-SRPMS for some work I did last year. Some of the components listed there were more modern than those in SL 6, but many of those whose names start with "tusk-", were built to avoid the automatic updates to contemporary release of of that software which was incompatible with the existing codebase. And some software updates, such as database updates, are *not reversible* without enormous engineering pain. When SL 5 updated from subversion-1.4.2 to subversion-1.6.11, it auto-updated the local Subversion repositories the next time you opened them, and *they can't be turned back!!!* and are incompatible with older versions of Subversion.