Thanks for your efforts, Marco!

On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <anirudh2...@gmail.com>
wrote:

> Thanks for the quick response and mitigation!
>
> On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
> <marco.g.ab...@googlemail.com.invalid> wrote:
>
> > Hello,
> >
> > today, CI had some issues and I had to cancel all jobs a few minutes ago.
> > This was basically caused by the high load that is currently being put on
> > our CI system due to the pre-release efforts for this Friday.
> >
> > It's really unfortunate that we just had outages of three core components
> > within the last two days - sorry about that!. To recap, we had the
> > following outages (which are unrelated to the parallel refactor of the
> > Jenkins pipeline):
> > - (yesterday evening) The Jenkins master ran out of disk space and thus
> > processed requests at reduced capacity
> > - (this morning) The Jenkins master got updated which broke our
> > autoscalings upscaling capabilities.
> > - (new, this evening) Jenkins API was irresponsive: Due to the high
> number
> > of jobs and a bad API design in the Jenkins REST API, the time-complexity
> > of a simple create or delete request was quadratic which resulted in all
> > requests timing out (that was the current outage). This resulted in our
> > auto scaling to be unable to interface with the Jenkins master.
> >
> > I have now made improvements to our REST API calls which reduced the
> > complexity from O(N^2) to O(1). The reason was an underlying redirect
> loop
> > in the Jenkins createNode and deleteNode REST API in combination with
> > unrolling the entire slave and job graph (which got quite huge during
> > extensive load) upon every single request. Since we had about 150
> > registered slaves and 1000 jobs in the queue, the duration for a single
> > REST API call rose to up to 45 seconds (we execute up to a few hundred
> > queries per auto scaling loop). This lead to our auto scaling timing out.
> >
> > Everything should be back to normal now. I'm closely observing the
> > situation and I'll let you know if I encounter any additional issues.
> >
> > Again, sorry for any caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <gavin.max.b...@gmail.com>
> > wrote:
> >
> > > Yes, let me add to the kudos, very nice work Marco.
> > >
> > >
> > > "I'm trying real hard to be the shepherd." -Jules Winnfield
> > >
> > >
> > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> > > <kell...@amazon.de.INVALID> wrote:
> > > >
> > > > Appreciate the big effort in bring the CI back so quickly.  Thanks
> > Marco.
> > > >
> > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
> marco.g.ab...@googlemail.com
> > .INVALID>
> > > wrote:
> > > > Thanks Aaron! Just for the record, the new Jenkins jobs were
> unrelated
> > to
> > > > that incident.
> > > >
> > > > If somebody is interested in the details around the outage:
> > > >
> > > > Due to a required maintenance (disk running full), we had to upgrade
> > our
> > > > Jenkins master because it was running on Ubuntu 17.04 (for an unknown
> > > > reason, it used to be 16.04) and we needed to install some packages.
> > > Since
> > > > the support for Ubuntu 17.04 was stopped, this resulted in all
> package
> > > > updates and installations to fail because the repositories were taken
> > > > offline. Due to the unavailable maintenance package and other issues
> > with
> > > > the installed OpenJDK8 version, we made the decision to upgrade the
> > > Jenkins
> > > > master to Ubuntu 18.04 LTS in order to get back to a supported
> version
> > > with
> > > > maintenance tools. During this upgrade, Jenkins was automatically
> > updated
> > > > by APT as part of the dist-upgrade process.
> > > >
> > > > In the latest version of Jenkins, some labels have been changed which
> > we
> > > > depend on for our auto scaling. To be more specific:
> > > >> Waiting for next available executor on mxnetlinux-gpu
> > > > has been changed to
> > > >> Waiting for next available executor on ‘mxnetlinux-gpu’
> > > > Notice the quote characters.
> > > >
> > > > Jenkins does not offer a better way than to parse these messages
> > > > unfortunately - there's no standardized way to express queue items.
> > Since
> > > > our parser expected the above message without quote signs, this
> message
> > > was
> > > > discarded.
> > > >
> > > > We support various queue reasons (5 of them to be exact) that
> indicate
> > > > resource starvation. If we run super low on capacity, the queue
> reason
> > is
> > > > different and we would still be able to scale up, but most of the
> cases
> > > > would have printed the unsupported message. This resulted in reduced
> > > > capacity (to be specific, the limit during that time was 1 slave per
> > > type).
> > > >
> > > > We have now fixed our autoscaling to automatically strip these
> > characters
> > > > and added that message to our test suite.
> > > >
> > > > Best regards,
> > > > Marco
> > > >
> > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
> > aaron.s.mark...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > >> Marco, thanks for your hard work on this. I'm super excited about
> the
> > > new
> > > >> Jenkins jobs. This is going to be very helpful and improve sanity
> for
> > > our
> > > >> PRs and ourselves!
> > > >>
> > > >> Cheers,
> > > >> Aaron
> > > >>
> > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> > > >> <marco.g.ab...@googlemail.com.invalid wrote:
> > > >>
> > > >>> Hello,
> > > >>>
> > > >>> the CI is now back up and running. Auto scaling is working as
> > expected
> > > >> and
> > > >>> it passed our load tests.
> > > >>>
> > > >>> Please excuse the caused inconveniences.
> > > >>>
> > > >>> Best regards,
> > > >>> Marco
> > > >>>
> > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > > >>> marco.g.ab...@googlemail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> Hello,
> > > >>>>
> > > >>>> I'd like to let you know that our CI was impaired and down for the
> > > last
> > > >>>> few hours. After getting the CI back up, I noticed that our auto
> > > >> scaling
> > > >>>> broke due to a silent update of Jenkins which broke our
> > > >>> upscale-detection.
> > > >>>> Manual scaling is currently not possible and stopping the scaling
> > > won't
> > > >>>> help either because there are currently no p3 instances available,
> > > >> which
> > > >>>> means that all jobs will fail none the less. In a few hours, the
> > auto
> > > >>>> scaling will have recycled all slaves through the down-scale
> > mechanism
> > > >>> and
> > > >>>> we will be out of capacity. This will lead to resource starvation
> > and
> > > >>> thus
> > > >>>> timeouts.
> > > >>>>
> > > >>>> Your PRs will be properly registered by Jenkins, but please expect
> > the
> > > >>>> jobs to time out and thus fail your PRs.
> > > >>>>
> > > >>>> I will fix the auto scaling as soon as I'm awake again.
> > > >>>>
> > > >>>> Sorry for the caused inconveniences.
> > > >>>>
> > > >>>> Best regards,
> > > >>>> Marco
> > > >>>>
> > > >>>>
> > > >>>> P.S. Sorry for the brief email and my lack of further fixes, but
> > it's
> > > >>>> 5:30AM now and I've been working for 17 hours.
> > > >>>>
> > > >>>
> > > >>
> > >
> >
>

Reply via email to