On Sun, 15 Jun 2014, ammul...@gmail.com wrote:

I'm just looking at putting in Ansible and wanting to get some use cases
in. The one thing I really want to do is get Linux updates via Ansible
going. We have a few servers that need to have things killed/run before and
after reboot (which, for some reason won't work via rc). So basically I'm
just wondering if it's possible for Ansible to have a playbook that will:

1. Execute script (which kills off processes)
2. Run yum update
3. Reboot
4. Execute script (start processes)

Would be great to see any examples. Can't seem to find anything like it on
the web... Really looking forward to getting Ansible working, the things
that I've been doing so far look really positive.

The following playbook is what I use at a few customers, in one case we patch about 2700 servers each month. Before we were able to do this on a monthly basis, we had quite some things to clean up and standardize.

The very first time, we used smaller batches so that we could make sure that all init-scripts were present and to communicate with the various (internal) customers wrt. problems. Once all systems are alligned to the same baseline, things become a lot easier, the set of updates is very tangible and we do batches of 50 systems and execute multiple runs in parallel.

In summary, we do:

 - Check if redhat-lsb is installed
 - Clean up stale repository metadata (optional, we needed to remove leftover 
Satellite channel data)
 - Check free space in /var/yum/cache and /usr (optional, it prevents failures 
that require to login to find what's going on)
 - Update all packages using yum
 - Propose to reboot the systems that have had updates
 - Check if the system comes back correctly (we also plan to check the uptime, 
pull-request in queue)

All our systems are connected to the same frozen channels in a Satellite, which makes it a lot easier to manage. Every month we start to update the frozen channel with the latest updates, we then test the process and updates on about 150 internal systems (some of these are crucial infrastructure, so they get the security updates earlier).

The next day we have a meeting with Change Management, Security Governance and Linux Operations and we go through the list of updates (we have a custom tool to compile a list of updates, and the distribution over our 2700 Linux servers of each update). Based on this list and discussion, we decide if patching is useful and rebooting is necessary.

Then we have spread the patching of all systems over 4 days (2 non-prod the first week and 2 prod the second week), in about 12 different timeframes. This is useful to ensure that systems in a complex setup are not patched/rebooted at the same time, and in case of issues we can reduce the impact and have sufficient time to troubleshoot and resolve. Each "wave" takes about 20 minutes, so in essence we patch 2700 servers in roughly 5 hours.

Essential is that all services are properly scripted using init-scripts and clean shutdowns work well, and everything is started correctly. In case of MySQL e.g. it may mean tuning the timeout of the init-script, etc.

Also essential is to get your customers involved in the process and give them control over what systems are part of what wave, whether they control the reboots themselves, etc. Key is to not allow any exceptions, but look for solutions together. We had very little opposition, and once we had proven this mechanism worked, only small changes were made in iterations.

We plan to integrate our firmware-patching playbook into this one as well, twice a year. But this coincides with minor OS updates and patching takes in this case longer than 20 minutes anyway.

----
- name: Check pre-requirements
  hosts: all
  tasks:
  - name: Safeguard - Test if system has a working LSB
    action: fail msg="System is lacking working redhat-lsb-core -- FIX THIS 
YOURSELF"
    when: ansible_lsb is not defined
  - name: Group systems by distribution (e.g. Debian, Ubuntu or RedHat)
    action: group_by key={{ ansible_os_family }}
    changed_when: no


- name: Clean up yum cache and check disk-space
  hosts: RedHat
  tasks:
  - name: Ensure we have a directory /var/cache/yum
    action: file dest=/var/cache/yum state=directory
  - name: Remove old yum cache to free disk space
    action: command find /var/cache/yum/ -depth -mindepth 1 -type d \! -mtime 0 
-exec rm -rvf {} \;
    register: remove
    changed_when: remove.stdout
  - name: Collect /var/cache/yum free space on target system
    action: shell df -P /var/cache/yum | awk 'END { print $4 }'
    register: cachesize
    changed_when: no
  - name: Safeguard - Check if /var/cache/yum is large enough to continue
    action: fail msg="Not enough free space on filesystem /var/cache/yum (got 
{{cachesize.stdout|int/1024|int}}M, need at least 400M)"
    when: cachesize.stdout|int < 400 * 1024
  - name: Collect /usr free space on target system
    action: shell df -P /usr | awk 'END { print $4 }'
    register: usrsize
    changed_when: no
  - name: Safeguard - Check if /usr is large enough to continue
    action: fail msg="Not enough free space on filesystem /usr (got 
{{usrsize.stdout|int/1024|int}}M, need at least 200M)"
    when: usrsize.stdout|int < 200 * 1024


- name: Update RHEL and schedule reboot
  hosts: RedHat
  tasks:
  - name: Updating system(s) -- PLEASE DO NOT INTERRUPT
    action: yum name=* state=latest disable_gpg_check=yes
    register: update
  - name: Group systems that require a reboot
    action: group_by key=reboot_{{ update.changed }}
    changed_when: no


- name: Reboot systems with updates
  hosts: reboot_True
  tasks:
  - name: Waiting for approval to reboot -- ABORT NOW IF NEEDED
    action: pause
  - name: Performing reboot -- PLEASE DO NOT INTERRUPT
    action: command shutdown -r now "REASON -- Security patch management"
  - name: Waiting for system(s) to go down
    local_action: wait_for host={{ansible_ssh_host}} port=22 state=stopped 
timeout=360
  - name: Waiting for system(s) to come back up
    local_action: wait_for host={{ansible_ssh_host}} port=22 state=started 
timeout=900
  - name: Testing whether system is working fine
    action: ping
----

--
-- dag wieers, d...@wieers.com, http://dag.wieers.com/
-- dagit linux solutions, cont...@dagit.net, http://dagit.net/

[Any errors in spelling, tact or fact are transmission errors]

--
You received this message because you are subscribed to the Google Groups "Ansible 
Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to ansible-project+unsubscr...@googlegroups.com.
To post to this group, send email to ansible-project@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/alpine.LRH.2.02.1406201008020.20841%40pikachu.3ti.be.
For more options, visit https://groups.google.com/d/optout.

Reply via email to