britney outage from 10:42 to 20:55 (UTC) At 17:47 (all times UTC) today I was informed by RikMills that britney had not been running successfully since about 11:00, because autopkgtest.com/results was eventually 503-ing.
By 18:15, I had mistakingly identified this as a swift issue and asked IS to investigate. By 18:40 I heard back from IS that they did not see errors, and looked at the autopkgtest-web logs a second. Investigation showed that the issue was haproxy disabling our cloud workers which both failed at the same time (well, regularly) due to "database disk image is malformed" errors from sqlite. By 19:00 Laney started working on moving the /results proxying out of the Apache servers and directly into haproxy, and released that at 20:34. Meanwhile I started working on fixing the error by replacing our simple file copying code with the SQLite online backup API, as waveform had suggested earlier. That work was finished at 20:56, after we finally figured out how to give me charm store access (grant --channel unpublished did the trick). I identified some Work items for the future: * some alerts other for britney failing, as relying on community members to report 7 hours after the first failure is not super helpful. * We probably also need some monitoring that alerts us of the high failure rate we had on the web servers. -- debian developer - deb.li/jak | jak-linux.org - free software dev ubuntu core developer i speak de, en -- Ubuntu-release mailing list Ubuntu-release@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-release