britney outage from 10:42 to 20:55 (UTC)

At 17:47 (all times UTC) today I was informed by RikMills
that britney had not been running successfully since about
11:00, because autopkgtest.com/results was eventually 503-ing.

By 18:15, I had mistakingly identified this as a swift issue
and asked IS to investigate.

By 18:40 I heard back from IS that they did not see errors,
and looked at the autopkgtest-web logs a second. Investigation
showed that the issue was haproxy disabling our cloud workers
which both failed at the same time (well, regularly) due to
"database disk image is malformed" errors from sqlite.

By 19:00 Laney started working on moving the /results proxying
out of the Apache servers and directly into haproxy, and released
that at 20:34.

Meanwhile I started working on fixing the error by replacing
our simple file copying code with the SQLite online backup
API, as waveform had suggested earlier. That work was finished
at 20:56, after we finally figured out how to give me charm
store access (grant --channel unpublished did the trick).

I identified some Work items for the future:

* some alerts other for britney failing, as relying on community
  members to report 7 hours after the first failure is not super helpful.

* We probably also need some monitoring that alerts us of the high
  failure rate we had on the web servers.
-- 
debian developer - deb.li/jak | jak-linux.org - free software dev
ubuntu core developer                              i speak de, en

-- 
Ubuntu-release mailing list
Ubuntu-release@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-release

Reply via email to