Hi,

so we had 3 incidents in the last 24h, and while all of them are
different, they are also linked.

So we did face several issues, starting by gerrit showing error 500
last night, around 23h Paris. 

That was https://bugzilla.redhat.com/show_bug.cgi?id=1620243 , and did
result in a memory upgrade this morning.


Then we started to look at others issues that were uncovered while
investigating the first, and i tried to look at the size of the mail
queue. Usually, this is not a problem, but after adding swap, it did
become a issue. 

So I started to look for a way to blacklist mail sent to
jenk...@build.gluster.org, first by routing this mail domain to
supercolony, then by changing postifx to drop the mail.

And then we got 2 issue at once, timeline in UTC 


Timeline
--------

13:42  misc add a MX for build.gluster.org in the zone. To do that, the
dns zone was changed and build.gluster.org could no longer be a CNAME. 

14:56  kaleb ping misc/nigel saying "there is a message about disk full
on that job"

15:00  misc click on the link to build.gluster.org, is greeted by a ssl
error about certificat. Seems the DNS now resolve build.gluster.org to
2 IP instead of 1

15:04  misc revert the DNS, cause no time to investigate. 

15:05  misc figure the server has a full disk because the logs are
stored on /

15:07  misc also start to swear in 2 languages

15:18  a new partition with more space is created on
http.int.rht.gluster.org data is copied, httpd restarted, situation is
back to normal

Impact:
- some build logs were lost (likely not much)
- for 1h, some people could have been randomly directed to the wrong
server when going to build.gluster.org


Root cause:
- for DNS, a wrong commit. The syntax did look correct (and was
verified), so I need to check why it did more than required.

- for the disk full, a increase of patches and a oversight on that
server installation.


Resolution:
- dns got reverted
- new partition was added and data were copied

What went well:
- we were quickly able to resolve the issue thanks to automation

When we were lucky:
- the issue got detected fast by the same person who made the change
(DNS), and people (Kaleb) notified us as soon as something seemed weird
(disk)
- none of us were in Vancouver facing a measle outbreak

What went bad
- still no monitoring


Potential improvement to make:
- add monitoring
- revise ressources usage 
- prepare a template for post mortem

-- 
Michael Scherer
Sysadmin, Community Infrastructure and Platform, OSAS

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Reply via email to