Thanks for the quick response and mitigation!

On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
<marco.g.ab...@googlemail.com.invalid> wrote:

> Hello,
>
> today, CI had some issues and I had to cancel all jobs a few minutes ago.
> This was basically caused by the high load that is currently being put on
> our CI system due to the pre-release efforts for this Friday.
>
> It's really unfortunate that we just had outages of three core components
> within the last two days - sorry about that!. To recap, we had the
> following outages (which are unrelated to the parallel refactor of the
> Jenkins pipeline):
> - (yesterday evening) The Jenkins master ran out of disk space and thus
> processed requests at reduced capacity
> - (this morning) The Jenkins master got updated which broke our
> autoscalings upscaling capabilities.
> - (new, this evening) Jenkins API was irresponsive: Due to the high number
> of jobs and a bad API design in the Jenkins REST API, the time-complexity
> of a simple create or delete request was quadratic which resulted in all
> requests timing out (that was the current outage). This resulted in our
> auto scaling to be unable to interface with the Jenkins master.
>
> I have now made improvements to our REST API calls which reduced the
> complexity from O(N^2) to O(1). The reason was an underlying redirect loop
> in the Jenkins createNode and deleteNode REST API in combination with
> unrolling the entire slave and job graph (which got quite huge during
> extensive load) upon every single request. Since we had about 150
> registered slaves and 1000 jobs in the queue, the duration for a single
> REST API call rose to up to 45 seconds (we execute up to a few hundred
> queries per auto scaling loop). This lead to our auto scaling timing out.
>
> Everything should be back to normal now. I'm closely observing the
> situation and I'll let you know if I encounter any additional issues.
>
> Again, sorry for any caused inconveniences.
>
> Best regards,
> Marco
>
> On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <gavin.max.b...@gmail.com>
> wrote:
>
> > Yes, let me add to the kudos, very nice work Marco.
> >
> >
> > "I'm trying real hard to be the shepherd." -Jules Winnfield
> >
> >
> > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> > <kell...@amazon.de.INVALID> wrote:
> > >
> > > Appreciate the big effort in bring the CI back so quickly.  Thanks
> Marco.
> > >
> > > On Nov 21, 2018 5:52 AM, Marco de Abreu <marco.g.ab...@googlemail.com
> .INVALID>
> > wrote:
> > > Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated
> to
> > > that incident.
> > >
> > > If somebody is interested in the details around the outage:
> > >
> > > Due to a required maintenance (disk running full), we had to upgrade
> our
> > > Jenkins master because it was running on Ubuntu 17.04 (for an unknown
> > > reason, it used to be 16.04) and we needed to install some packages.
> > Since
> > > the support for Ubuntu 17.04 was stopped, this resulted in all package
> > > updates and installations to fail because the repositories were taken
> > > offline. Due to the unavailable maintenance package and other issues
> with
> > > the installed OpenJDK8 version, we made the decision to upgrade the
> > Jenkins
> > > master to Ubuntu 18.04 LTS in order to get back to a supported version
> > with
> > > maintenance tools. During this upgrade, Jenkins was automatically
> updated
> > > by APT as part of the dist-upgrade process.
> > >
> > > In the latest version of Jenkins, some labels have been changed which
> we
> > > depend on for our auto scaling. To be more specific:
> > >> Waiting for next available executor on mxnetlinux-gpu
> > > has been changed to
> > >> Waiting for next available executor on ‘mxnetlinux-gpu’
> > > Notice the quote characters.
> > >
> > > Jenkins does not offer a better way than to parse these messages
> > > unfortunately - there's no standardized way to express queue items.
> Since
> > > our parser expected the above message without quote signs, this message
> > was
> > > discarded.
> > >
> > > We support various queue reasons (5 of them to be exact) that indicate
> > > resource starvation. If we run super low on capacity, the queue reason
> is
> > > different and we would still be able to scale up, but most of the cases
> > > would have printed the unsupported message. This resulted in reduced
> > > capacity (to be specific, the limit during that time was 1 slave per
> > type).
> > >
> > > We have now fixed our autoscaling to automatically strip these
> characters
> > > and added that message to our test suite.
> > >
> > > Best regards,
> > > Marco
> > >
> > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
> aaron.s.mark...@gmail.com
> > >
> > > wrote:
> > >
> > >> Marco, thanks for your hard work on this. I'm super excited about the
> > new
> > >> Jenkins jobs. This is going to be very helpful and improve sanity for
> > our
> > >> PRs and ourselves!
> > >>
> > >> Cheers,
> > >> Aaron
> > >>
> > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> > >> <marco.g.ab...@googlemail.com.invalid wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> the CI is now back up and running. Auto scaling is working as
> expected
> > >> and
> > >>> it passed our load tests.
> > >>>
> > >>> Please excuse the caused inconveniences.
> > >>>
> > >>> Best regards,
> > >>> Marco
> > >>>
> > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > >>> marco.g.ab...@googlemail.com>
> > >>> wrote:
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> I'd like to let you know that our CI was impaired and down for the
> > last
> > >>>> few hours. After getting the CI back up, I noticed that our auto
> > >> scaling
> > >>>> broke due to a silent update of Jenkins which broke our
> > >>> upscale-detection.
> > >>>> Manual scaling is currently not possible and stopping the scaling
> > won't
> > >>>> help either because there are currently no p3 instances available,
> > >> which
> > >>>> means that all jobs will fail none the less. In a few hours, the
> auto
> > >>>> scaling will have recycled all slaves through the down-scale
> mechanism
> > >>> and
> > >>>> we will be out of capacity. This will lead to resource starvation
> and
> > >>> thus
> > >>>> timeouts.
> > >>>>
> > >>>> Your PRs will be properly registered by Jenkins, but please expect
> the
> > >>>> jobs to time out and thus fail your PRs.
> > >>>>
> > >>>> I will fix the auto scaling as soon as I'm awake again.
> > >>>>
> > >>>> Sorry for the caused inconveniences.
> > >>>>
> > >>>> Best regards,
> > >>>> Marco
> > >>>>
> > >>>>
> > >>>> P.S. Sorry for the brief email and my lack of further fixes, but
> it's
> > >>>> 5:30AM now and I've been working for 17 hours.
> > >>>>
> > >>>
> > >>
> >
>

Reply via email to