Re: Bug in yum-autoupdate

2013-08-03 Thread Nico Kadel-Garcia
On Sat, Aug 3, 2013 at 9:28 PM, John Lauro  wrote:
> - Original Message -
>> From: "Nico Kadel-Garcia" 
>>
>> It's exceedingly dangerous in a production environment. I've helped
>> run, and done OS specifications and installers for a system over
>> 10,000 hosts. and you *never*, *never*, *never* auto-update them
>> without warning or outside the maintenance windows. *Never*. If I
>> caught someone else on the team doing that as a matter of policy, I
>> would have campaigned to have them fired ASAP.
>
>
> If you have to manage 10,000 hosts then you are lucky you never had to learn 
> to deal with no maintenance window and 0 downtime, and so most of your 
> maintenance had to be possible outside of a maintenance window.  That is how 
> many IT shops with thousands of machines have to operate

No, you schedule the updates. A maintenance window is not the same as
scheduled downtime, and for larger environments, you can schedule them
for a planned set of well defined updates.

For example,  before allowing the system wide changes, you test them
in a lab with a variety of the services and hardware you use in the
field. And you don't test "whatever the upstream vendor happened to
publish lately, sight unseen, plus whatever they added between the
test and the permitted update". You set up a defined set of updates,
such as using a yum mirror snapshot (for Scientific Linux or CentOS)
or a well defined RHN configuration.

> these days.  You might even want to read up on Netflix's thoughts on chaos 
> monkey.

I'm familiar with the concept, and it has uses. However, having a
chaos monkey in place does not reduce the risk of *network wide*
auto-updates of corrupting all operating system configurations. I'm
afraid I've had that happen: A kernel update introduced a regression,
and took down over 1000 systems all the same night. (A vendor had
changed hardware without notifying us, and the new kernel didn't have
the right drivers for it: the old kernel did). Fortunately, it
happened at a well defined maintenance window.  And also fortunately,
I'd taken advantage of the old LILO "default" and "boot once oncly
with a different setting" tools to boot new kernels in test mode, and
retain the old kernel as default after a power cycle if the new kernel
failed to boot.

> Autoupgrades are just another form of random outage you might have to deal 
> with.  As long as

And you deal with them by *turning them off* and scheduling the
updates, with the chance to assess the updates. That's because leaving
them enabled by default is not "random". It's scheduled arbitrarily by
the upstream vendor, and even *they* publish release notes and provide
their own entire system (RHN, or spacewalk if you use the free
versions) to schedule them.

> you have different hosts upgrading on different days and times, and you have 
> automated routines that test and take servers out of service automatically if 
> things fail, then autogrades is perfectly fine. If things break from the 
> autoupgrades, it becomes real obvious based on the update history of which 
> machines broke from it.

Gee, you mean that you don't let systems automatically update without
planning, and update different members at different scheduled and
times? Why didn't I think of something like that? You must be smart!

> Campaigning to have someone fired without even hearing their reason for 
> upgrading, or even warning them first that at your location is is standard 
> practice not to ever autoupgrade because you have a separate QA process that 
> even critical security patches must go through is a very bad practice on your 
> part.

Oh, he or she would get a chance to talk. If they spouted the
"auto-updating is safe" mantra and refused to budge, I'd be like them
like white on rice.  Touching production servers unannounced is a
serious no-no in large networks or any large network.

> I am not going to state what patch policy I use, only that different policies 
> work for different environments.  Based on your statement, it sounds like you 
> could be loosing some valuable co-workers by lobbying to get people fired 
> that have a different opinion from you instead of trying to educate and/or 
> learn from each other.  If you feel you can not learn from your peers, you 
> have already proven you are correct in that respect, but you have also shown 
> there is much you don't know by being incapable of learning new things.

Oh, if they're *trainable*, they might get a shot. But leaving out the
"schedule the updates so they don't all occur at once" part, as you
did at first, is pretty dangerous.

> (Personally I would hate to use Nagios for 10,000 hosts.  It didn't really 
> scale that well IMHO, but to be honest I haven't bothered looking at it in 
> over 4 years, and maybe it's improved.  Not familiar with Icinga, but I have 
> had good luck with Zabbix for large scale)

Oh, you split it for a network that big It handles a thousand
hosts reasonably well, even 10 years ag

Re: Bug in yum-autoupdate

2013-08-03 Thread John Lauro
- Original Message -
> From: "Nico Kadel-Garcia" 
>
> It's exceedingly dangerous in a production environment. I've helped
> run, and done OS specifications and installers for a system over
> 10,000 hosts. and you *never*, *never*, *never* auto-update them
> without warning or outside the maintenance windows. *Never*. If I
> caught someone else on the team doing that as a matter of policy, I
> would have campaigned to have them fired ASAP.


If you have to manage 10,000 hosts then you are lucky you never had to learn to 
deal with no maintenance window and 0 downtime, and so most of your maintenance 
had to be possible outside of a maintenance window.  That is how many IT shops 
with thousands of machines have to operate these days.  You might even want to 
read up on Netflix's thoughts on chaos monkey.  Autoupgrades are just another 
form of random outage you might have to deal with.  As long as you have 
different hosts upgrading on different days and times, and you have automated 
routines that test and take servers out of service automatically if things 
fail, then autogrades is perfectly fine. If things break from the autoupgrades, 
it becomes real obvious based on the update history of which machines broke 
from it.

Campaigning to have someone fired without even hearing their reason for 
upgrading, or even warning them first that at your location is is standard 
practice not to ever autoupgrade because you have a separate QA process that 
even critical security patches must go through is a very bad practice on your 
part.

I am not going to state what patch policy I use, only that different policies 
work for different environments.  Based on your statement, it sounds like you 
could be loosing some valuable co-workers by lobbying to get people fired that 
have a different opinion from you instead of trying to educate and/or learn 
from each other.  If you feel you can not learn from your peers, you have 
already proven you are correct in that respect, but you have also shown there 
is much you don't know by being incapable of learning new things.


(Personally I would hate to use Nagios for 10,000 hosts.  It didn't really 
scale that well IMHO, but to be honest I haven't bothered looking at it in over 
4 years, and maybe it's improved.  Not familiar with Icinga, but I have had 
good luck with Zabbix for large scale)


Re: Bug in yum-autoupdate

2013-08-03 Thread Nico Kadel-Garcia
On Thu, Aug 1, 2013 at 6:07 PM, Steven Haigh  wrote:
> On 02/08/13 02:26, Vincent Liggio wrote:
>>
>> On 08/01/2013 12:16 PM, Elias Persson wrote:
>>>
>>>
>>> All the more reason to read up on the differences, and if it's
>>> only one system 'yum remove yum-autoupdate' is hardly a big deal.
>>> If it's 1200 systems, what difference would an option in anaconda
>>> make? It's not like you'll be stepping through that hundreds of
>>> times, right?
>>
>>
>> No, when I have to migrate to a new OS (which won't be a 6.4 derivative,
>> it will be a 7.0 one, so probably 8-9 months from now), then I'll worry
>> about the differences. When I'm testing a piece of hardware that
>> requires a specific kernel release on an OS I don't run, whether a new
>> option is installed by default or not is not on the top of my list of
>> things to worry about.
>
>
> If you really do have 1200 systems to worry about, I'd be looking at things
> like satellite. I have ~20-25 systems and yum-autoupdate is fantastic. It
> does what it says on the box and relieves me of having to watch / check for
> updates every day. I get an email in the morning that tells me what was
> updated and if there were any problems.

It's exceedingly dangerous in a production environment. I've helped
run, and done OS specifications and installers for a system over
10,000 hosts. and you *never*, *never*, *never* auto-update them
without warning or outside the maintenance windows. *Never*. If I
caught someone else on the team doing that as a matter of policy, I
would have campaigned to have them fired ASAP.

A simple "yum check-update" cron job reporting to root or to a
designated email address  is far, far, far safer. Safer yet, if you
have Nagios or Icinga up and running for production, is to install and
use the "nagios-plugins-check-updates" package from EPEL, which allows
graceful remote Nagios monitoring of the system status in an organized
fashion across the network.

In a production environment, unannounced or unplanned restarts of
critical daemons such as httpd, mysql, named, snmpd, nagios, mrtg, or
especially the Java based services such as tomcat6 can cause cascading
failures across the whole environment. You may have the spare
resources to do "monkeywrench" failures across your environment all
the time to try to avoid this sort of thing, but very few facilities
do.

It's also nasty when you have software that is incompatible with
contemporary versions of upstream published software. Take a look over
at https://github.com/opentusk/OpenTUSK-SRPMS  for some work I did
last year. Some of the components listed there were more modern than
those in SL 6, but many of those whose names start with "tusk-",  were
built to avoid the automatic updates to contemporary release of of
that software which was incompatible with the existing codebase.

And some software updates, such as database updates, are *not
reversible* without enormous engineering pain. When SL 5 updated from
subversion-1.4.2 to subversion-1.6.11, it auto-updated the local
Subversion repositories the next time you opened them, and *they can't
be turned back!!!* and are incompatible with older versions of
Subversion.