Thanks for the quick response and mitigation! On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu <marco.g.ab...@googlemail.com.invalid> wrote:
> Hello, > > today, CI had some issues and I had to cancel all jobs a few minutes ago. > This was basically caused by the high load that is currently being put on > our CI system due to the pre-release efforts for this Friday. > > It's really unfortunate that we just had outages of three core components > within the last two days - sorry about that!. To recap, we had the > following outages (which are unrelated to the parallel refactor of the > Jenkins pipeline): > - (yesterday evening) The Jenkins master ran out of disk space and thus > processed requests at reduced capacity > - (this morning) The Jenkins master got updated which broke our > autoscalings upscaling capabilities. > - (new, this evening) Jenkins API was irresponsive: Due to the high number > of jobs and a bad API design in the Jenkins REST API, the time-complexity > of a simple create or delete request was quadratic which resulted in all > requests timing out (that was the current outage). This resulted in our > auto scaling to be unable to interface with the Jenkins master. > > I have now made improvements to our REST API calls which reduced the > complexity from O(N^2) to O(1). The reason was an underlying redirect loop > in the Jenkins createNode and deleteNode REST API in combination with > unrolling the entire slave and job graph (which got quite huge during > extensive load) upon every single request. Since we had about 150 > registered slaves and 1000 jobs in the queue, the duration for a single > REST API call rose to up to 45 seconds (we execute up to a few hundred > queries per auto scaling loop). This lead to our auto scaling timing out. > > Everything should be back to normal now. I'm closely observing the > situation and I'll let you know if I encounter any additional issues. > > Again, sorry for any caused inconveniences. > > Best regards, > Marco > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <gavin.max.b...@gmail.com> > wrote: > > > Yes, let me add to the kudos, very nice work Marco. > > > > > > "I'm trying real hard to be the shepherd." -Jules Winnfield > > > > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen > > <kell...@amazon.de.INVALID> wrote: > > > > > > Appreciate the big effort in bring the CI back so quickly. Thanks > Marco. > > > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <marco.g.ab...@googlemail.com > .INVALID> > > wrote: > > > Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated > to > > > that incident. > > > > > > If somebody is interested in the details around the outage: > > > > > > Due to a required maintenance (disk running full), we had to upgrade > our > > > Jenkins master because it was running on Ubuntu 17.04 (for an unknown > > > reason, it used to be 16.04) and we needed to install some packages. > > Since > > > the support for Ubuntu 17.04 was stopped, this resulted in all package > > > updates and installations to fail because the repositories were taken > > > offline. Due to the unavailable maintenance package and other issues > with > > > the installed OpenJDK8 version, we made the decision to upgrade the > > Jenkins > > > master to Ubuntu 18.04 LTS in order to get back to a supported version > > with > > > maintenance tools. During this upgrade, Jenkins was automatically > updated > > > by APT as part of the dist-upgrade process. > > > > > > In the latest version of Jenkins, some labels have been changed which > we > > > depend on for our auto scaling. To be more specific: > > >> Waiting for next available executor on mxnetlinux-gpu > > > has been changed to > > >> Waiting for next available executor on ‘mxnetlinux-gpu’ > > > Notice the quote characters. > > > > > > Jenkins does not offer a better way than to parse these messages > > > unfortunately - there's no standardized way to express queue items. > Since > > > our parser expected the above message without quote signs, this message > > was > > > discarded. > > > > > > We support various queue reasons (5 of them to be exact) that indicate > > > resource starvation. If we run super low on capacity, the queue reason > is > > > different and we would still be able to scale up, but most of the cases > > > would have printed the unsupported message. This resulted in reduced > > > capacity (to be specific, the limit during that time was 1 slave per > > type). > > > > > > We have now fixed our autoscaling to automatically strip these > characters > > > and added that message to our test suite. > > > > > > Best regards, > > > Marco > > > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham < > aaron.s.mark...@gmail.com > > > > > > wrote: > > > > > >> Marco, thanks for your hard work on this. I'm super excited about the > > new > > >> Jenkins jobs. This is going to be very helpful and improve sanity for > > our > > >> PRs and ourselves! > > >> > > >> Cheers, > > >> Aaron > > >> > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu > > >> <marco.g.ab...@googlemail.com.invalid wrote: > > >> > > >>> Hello, > > >>> > > >>> the CI is now back up and running. Auto scaling is working as > expected > > >> and > > >>> it passed our load tests. > > >>> > > >>> Please excuse the caused inconveniences. > > >>> > > >>> Best regards, > > >>> Marco > > >>> > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu < > > >>> marco.g.ab...@googlemail.com> > > >>> wrote: > > >>> > > >>>> Hello, > > >>>> > > >>>> I'd like to let you know that our CI was impaired and down for the > > last > > >>>> few hours. After getting the CI back up, I noticed that our auto > > >> scaling > > >>>> broke due to a silent update of Jenkins which broke our > > >>> upscale-detection. > > >>>> Manual scaling is currently not possible and stopping the scaling > > won't > > >>>> help either because there are currently no p3 instances available, > > >> which > > >>>> means that all jobs will fail none the less. In a few hours, the > auto > > >>>> scaling will have recycled all slaves through the down-scale > mechanism > > >>> and > > >>>> we will be out of capacity. This will lead to resource starvation > and > > >>> thus > > >>>> timeouts. > > >>>> > > >>>> Your PRs will be properly registered by Jenkins, but please expect > the > > >>>> jobs to time out and thus fail your PRs. > > >>>> > > >>>> I will fix the auto scaling as soon as I'm awake again. > > >>>> > > >>>> Sorry for the caused inconveniences. > > >>>> > > >>>> Best regards, > > >>>> Marco > > >>>> > > >>>> > > >>>> P.S. Sorry for the brief email and my lack of further fixes, but > it's > > >>>> 5:30AM now and I've been working for 17 hours. > > >>>> > > >>> > > >> > > >