On Wed, Aug 15, 2018 at 2:41 PM Michael Scherer <msche...@redhat.com> wrote:
> Hi folks, > > So Gluster jenkins disk was full today (cause outages do not respect > public holiday in India (Independance day) and France(Assumption)), > here is the post mortem for your reading pleasure > > Date: 15/08/2018 > > Service affected: > Jenkins for Gluster (jenkins-el7.rht.gluster.org) > > Impact: > > No jenkins job could be triggered. > > Root cause: > > A disk full mainly because we got new jobs and more patches, so > regular growth. > > Resolution: > > Increased the disk by 30G, and investigating if cleanup could be > improved. This did require a reboot. > > > Involved people: > - misc > - nigel > > Lessons learned > - What went well: > - we had a documented process for that, and good enough to be used by > a tired admin. > > - What went bad: > - we weren't proactive enough to see that before it caused a outage > - 15 of August is a holiday for both France and India. Technically, > none of the infra team should have been up. > > - When we were lucky > - It was a day off in India, so few people were affected, except > folks who continue to work on days off > - Misc decided to go to work while being in Brno to take days off > later > > > Timeline (in UTC) > > - 05:58 Amar post a mail to say "smoke job fail" on gluster-infra: > https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.ht > ml > <https://lists.gluster.org/pipermail/gluster-infra/2018-August/004795.html> > > - 06:23 Nigel ping Misc on Telegram to deal with it, since Nigel is > away from laptop for Independence day celebration. > > - 06:24 Misc do not hear the ding since he is asleep > > - 06:55 Sankarshan open a bug on it, https://bugzilla.redhat.com/show_b > ug.cgi?id=1616160 <https://bugzilla.redhat.com/show_bug.cgi?id=1616160> > > - 06:56 Misc do not see the email since he is still asleep > > - 07:13 Misc wake up, see a blinking light on the phone and ponder > about closing his eyes again. He look at it, and start to swear. > > - 07:14 Investigation reveal that Jenkins partition is full (100%). A > quick investigation do not yield any particular issues. The Jenkins > jobs are taking space and that's it. > > - 07:19 After discussion with Nigel, it is decided to increase the size > of the partition. Misc take a look at it, try to increase without any > luck. The server is rebooted in case that's what was needed. Still not > enough. > > - 07:25 Misc go quickly shower to wake him up. The warm embrace of > water make him remember that a documentation on that process do exist: > > https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partitio > n.html > <https://gluster-infra-docs.readthedocs.io/procedures/resize_vm_partition.html> > > - 07:30 Following the documentation, we discover that the hypervisor > is now out of space for future increase. Looking at that will be done > after the post mortem. > > - 07:37 Jenkins is being restarted, with more space, and seems to work > ok. > > - 07:38 Misc rush to his hotel breakfast who close at 10. > > - 09:09 Post mortem is finished and being sent > > > Action items: > - (misc) see what can be done for myrmicinae (the hypervisor where > jenkins is running) since there is no more space. > > Potential improvement to make: > - we still need to have monitoring in place > - we need to move munin in the internal lan for looking at the graph > for jenkins > - documentation regarding resizing could be clearer, notably on volume > resizing part > This is highlighting that we need to solve https://bugzilla.redhat.com/show_bug.cgi?id=1564372 on priority. The lack of monitoring is affecting day to day work. -- nigelb
_______________________________________________ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra