On Sun, 15 Jun 2014, ammul...@gmail.com wrote:
I'm just looking at putting in Ansible and wanting to get some use cases
in. The one thing I really want to do is get Linux updates via Ansible
going. We have a few servers that need to have things killed/run before and
after reboot (which, for some reason won't work via rc). So basically I'm
just wondering if it's possible for Ansible to have a playbook that will:
1. Execute script (which kills off processes)
2. Run yum update
3. Reboot
4. Execute script (start processes)
Would be great to see any examples. Can't seem to find anything like it on
the web... Really looking forward to getting Ansible working, the things
that I've been doing so far look really positive.
The following playbook is what I use at a few customers, in one case
we patch about 2700 servers each month. Before we were able to do this on
a monthly basis, we had quite some things to clean up and standardize.
The very first time, we used smaller batches so that we could make sure
that all init-scripts were present and to communicate with the various
(internal) customers wrt. problems. Once all systems are alligned to the
same baseline, things become a lot easier, the set of updates is very
tangible and we do batches of 50 systems and execute multiple runs in
parallel.
In summary, we do:
- Check if redhat-lsb is installed
- Clean up stale repository metadata (optional, we needed to remove leftover
Satellite channel data)
- Check free space in /var/yum/cache and /usr (optional, it prevents failures
that require to login to find what's going on)
- Update all packages using yum
- Propose to reboot the systems that have had updates
- Check if the system comes back correctly (we also plan to check the uptime,
pull-request in queue)
All our systems are connected to the same frozen channels in a Satellite,
which makes it a lot easier to manage. Every month we start to update the
frozen channel with the latest updates, we then test the process and
updates on about 150 internal systems (some of these are crucial
infrastructure, so they get the security updates earlier).
The next day we have a meeting with Change Management, Security Governance
and Linux Operations and we go through the list of updates (we have
a custom tool to compile a list of updates, and the distribution over our
2700 Linux servers of each update). Based on this list and discussion, we
decide if patching is useful and rebooting is necessary.
Then we have spread the patching of all systems over 4 days (2 non-prod
the first week and 2 prod the second week), in about 12 different
timeframes. This is useful to ensure that systems in a complex setup are
not patched/rebooted at the same time, and in case of issues we can
reduce the impact and have sufficient time to troubleshoot and resolve.
Each "wave" takes about 20 minutes, so in essence we patch 2700 servers in
roughly 5 hours.
Essential is that all services are properly scripted using init-scripts
and clean shutdowns work well, and everything is started correctly. In
case of MySQL e.g. it may mean tuning the timeout of the init-script,
etc.
Also essential is to get your customers involved in the process and give
them control over what systems are part of what wave, whether they control
the reboots themselves, etc. Key is to not allow any exceptions, but look
for solutions together. We had very little opposition, and once we had
proven this mechanism worked, only small changes were made in iterations.
We plan to integrate our firmware-patching playbook into this one as well,
twice a year. But this coincides with minor OS updates and patching takes
in this case longer than 20 minutes anyway.
----
- name: Check pre-requirements
hosts: all
tasks:
- name: Safeguard - Test if system has a working LSB
action: fail msg="System is lacking working redhat-lsb-core -- FIX THIS
YOURSELF"
when: ansible_lsb is not defined
- name: Group systems by distribution (e.g. Debian, Ubuntu or RedHat)
action: group_by key={{ ansible_os_family }}
changed_when: no
- name: Clean up yum cache and check disk-space
hosts: RedHat
tasks:
- name: Ensure we have a directory /var/cache/yum
action: file dest=/var/cache/yum state=directory
- name: Remove old yum cache to free disk space
action: command find /var/cache/yum/ -depth -mindepth 1 -type d \! -mtime 0
-exec rm -rvf {} \;
register: remove
changed_when: remove.stdout
- name: Collect /var/cache/yum free space on target system
action: shell df -P /var/cache/yum | awk 'END { print $4 }'
register: cachesize
changed_when: no
- name: Safeguard - Check if /var/cache/yum is large enough to continue
action: fail msg="Not enough free space on filesystem /var/cache/yum (got
{{cachesize.stdout|int/1024|int}}M, need at least 400M)"
when: cachesize.stdout|int < 400 * 1024
- name: Collect /usr free space on target system
action: shell df -P /usr | awk 'END { print $4 }'
register: usrsize
changed_when: no
- name: Safeguard - Check if /usr is large enough to continue
action: fail msg="Not enough free space on filesystem /usr (got
{{usrsize.stdout|int/1024|int}}M, need at least 200M)"
when: usrsize.stdout|int < 200 * 1024
- name: Update RHEL and schedule reboot
hosts: RedHat
tasks:
- name: Updating system(s) -- PLEASE DO NOT INTERRUPT
action: yum name=* state=latest disable_gpg_check=yes
register: update
- name: Group systems that require a reboot
action: group_by key=reboot_{{ update.changed }}
changed_when: no
- name: Reboot systems with updates
hosts: reboot_True
tasks:
- name: Waiting for approval to reboot -- ABORT NOW IF NEEDED
action: pause
- name: Performing reboot -- PLEASE DO NOT INTERRUPT
action: command shutdown -r now "REASON -- Security patch management"
- name: Waiting for system(s) to go down
local_action: wait_for host={{ansible_ssh_host}} port=22 state=stopped
timeout=360
- name: Waiting for system(s) to come back up
local_action: wait_for host={{ansible_ssh_host}} port=22 state=started
timeout=900
- name: Testing whether system is working fine
action: ping
----
--
-- dag wieers, d...@wieers.com, http://dag.wieers.com/
-- dagit linux solutions, cont...@dagit.net, http://dagit.net/
[Any errors in spelling, tact or fact are transmission errors]
--
You received this message because you are subscribed to the Google Groups "Ansible
Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to ansible-project+unsubscr...@googlegroups.com.
To post to this group, send email to ansible-project@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/ansible-project/alpine.LRH.2.02.1406201008020.20841%40pikachu.3ti.be.
For more options, visit https://groups.google.com/d/optout.