Re: CI impaired

Qing Lan Wed, 21 Nov 2018 16:40:24 -0800

Appreciated for your effort and help to make CI a better place!

Qing


On 11/21/18, 4:38 PM, "Lin Yuan" <apefor...@gmail.com> wrote:

    Thanks for your efforts, Marco!
    
    On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <anirudh2...@gmail.com>
    wrote:
    
    > Thanks for the quick response and mitigation!
    >
    > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
    > <marco.g.ab...@googlemail.com.invalid> wrote:
    >
    > > Hello,
    > >
    > > today, CI had some issues and I had to cancel all jobs a few minutes 
ago.
    > > This was basically caused by the high load that is currently being put 
on
    > > our CI system due to the pre-release efforts for this Friday.
    > >
    > > It's really unfortunate that we just had outages of three core 
components
    > > within the last two days - sorry about that!. To recap, we had the
    > > following outages (which are unrelated to the parallel refactor of the
    > > Jenkins pipeline):
    > > - (yesterday evening) The Jenkins master ran out of disk space and thus
    > > processed requests at reduced capacity
    > > - (this morning) The Jenkins master got updated which broke our
    > > autoscalings upscaling capabilities.
    > > - (new, this evening) Jenkins API was irresponsive: Due to the high
    > number
    > > of jobs and a bad API design in the Jenkins REST API, the 
time-complexity
    > > of a simple create or delete request was quadratic which resulted in all
    > > requests timing out (that was the current outage). This resulted in our
    > > auto scaling to be unable to interface with the Jenkins master.
    > >
    > > I have now made improvements to our REST API calls which reduced the
    > > complexity from O(N^2) to O(1). The reason was an underlying redirect
    > loop
    > > in the Jenkins createNode and deleteNode REST API in combination with
    > > unrolling the entire slave and job graph (which got quite huge during
    > > extensive load) upon every single request. Since we had about 150
    > > registered slaves and 1000 jobs in the queue, the duration for a single
    > > REST API call rose to up to 45 seconds (we execute up to a few hundred
    > > queries per auto scaling loop). This lead to our auto scaling timing 
out.
    > >
    > > Everything should be back to normal now. I'm closely observing the
    > > situation and I'll let you know if I encounter any additional issues.
    > >
    > > Again, sorry for any caused inconveniences.
    > >
    > > Best regards,
    > > Marco
    > >
    > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <gavin.max.b...@gmail.com>
    > > wrote:
    > >
    > > > Yes, let me add to the kudos, very nice work Marco.
    > > >
    > > >
    > > > "I'm trying real hard to be the shepherd." -Jules Winnfield
    > > >
    > > >
    > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
    > > > <kell...@amazon.de.INVALID> wrote:
    > > > >
    > > > > Appreciate the big effort in bring the CI back so quickly.  Thanks
    > > Marco.
    > > > >
    > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
    > marco.g.ab...@googlemail.com
    > > .INVALID>
    > > > wrote:
    > > > > Thanks Aaron! Just for the record, the new Jenkins jobs were
    > unrelated
    > > to
    > > > > that incident.
    > > > >
    > > > > If somebody is interested in the details around the outage:
    > > > >
    > > > > Due to a required maintenance (disk running full), we had to upgrade
    > > our
    > > > > Jenkins master because it was running on Ubuntu 17.04 (for an 
unknown
    > > > > reason, it used to be 16.04) and we needed to install some packages.
    > > > Since
    > > > > the support for Ubuntu 17.04 was stopped, this resulted in all
    > package
    > > > > updates and installations to fail because the repositories were 
taken
    > > > > offline. Due to the unavailable maintenance package and other issues
    > > with
    > > > > the installed OpenJDK8 version, we made the decision to upgrade the
    > > > Jenkins
    > > > > master to Ubuntu 18.04 LTS in order to get back to a supported
    > version
    > > > with
    > > > > maintenance tools. During this upgrade, Jenkins was automatically
    > > updated
    > > > > by APT as part of the dist-upgrade process.
    > > > >
    > > > > In the latest version of Jenkins, some labels have been changed 
which
    > > we
    > > > > depend on for our auto scaling. To be more specific:
    > > > >> Waiting for next available executor on mxnetlinux-gpu
    > > > > has been changed to
    > > > >> Waiting for next available executor on ‘mxnetlinux-gpu’
    > > > > Notice the quote characters.
    > > > >
    > > > > Jenkins does not offer a better way than to parse these messages
    > > > > unfortunately - there's no standardized way to express queue items.
    > > Since
    > > > > our parser expected the above message without quote signs, this
    > message
    > > > was
    > > > > discarded.
    > > > >
    > > > > We support various queue reasons (5 of them to be exact) that
    > indicate
    > > > > resource starvation. If we run super low on capacity, the queue
    > reason
    > > is
    > > > > different and we would still be able to scale up, but most of the
    > cases
    > > > > would have printed the unsupported message. This resulted in reduced
    > > > > capacity (to be specific, the limit during that time was 1 slave per
    > > > type).
    > > > >
    > > > > We have now fixed our autoscaling to automatically strip these
    > > characters
    > > > > and added that message to our test suite.
    > > > >
    > > > > Best regards,
    > > > > Marco
    > > > >
    > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
    > > aaron.s.mark...@gmail.com
    > > > >
    > > > > wrote:
    > > > >
    > > > >> Marco, thanks for your hard work on this. I'm super excited about
    > the
    > > > new
    > > > >> Jenkins jobs. This is going to be very helpful and improve sanity
    > for
    > > > our
    > > > >> PRs and ourselves!
    > > > >>
    > > > >> Cheers,
    > > > >> Aaron
    > > > >>
    > > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
    > > > >> <marco.g.ab...@googlemail.com.invalid wrote:
    > > > >>
    > > > >>> Hello,
    > > > >>>
    > > > >>> the CI is now back up and running. Auto scaling is working as
    > > expected
    > > > >> and
    > > > >>> it passed our load tests.
    > > > >>>
    > > > >>> Please excuse the caused inconveniences.
    > > > >>>
    > > > >>> Best regards,
    > > > >>> Marco
    > > > >>>
    > > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
    > > > >>> marco.g.ab...@googlemail.com>
    > > > >>> wrote:
    > > > >>>
    > > > >>>> Hello,
    > > > >>>>
    > > > >>>> I'd like to let you know that our CI was impaired and down for 
the
    > > > last
    > > > >>>> few hours. After getting the CI back up, I noticed that our auto
    > > > >> scaling
    > > > >>>> broke due to a silent update of Jenkins which broke our
    > > > >>> upscale-detection.
    > > > >>>> Manual scaling is currently not possible and stopping the scaling
    > > > won't
    > > > >>>> help either because there are currently no p3 instances 
available,
    > > > >> which
    > > > >>>> means that all jobs will fail none the less. In a few hours, the
    > > auto
    > > > >>>> scaling will have recycled all slaves through the down-scale
    > > mechanism
    > > > >>> and
    > > > >>>> we will be out of capacity. This will lead to resource starvation
    > > and
    > > > >>> thus
    > > > >>>> timeouts.
    > > > >>>>
    > > > >>>> Your PRs will be properly registered by Jenkins, but please 
expect
    > > the
    > > > >>>> jobs to time out and thus fail your PRs.
    > > > >>>>
    > > > >>>> I will fix the auto scaling as soon as I'm awake again.
    > > > >>>>
    > > > >>>> Sorry for the caused inconveniences.
    > > > >>>>
    > > > >>>> Best regards,
    > > > >>>> Marco
    > > > >>>>
    > > > >>>>
    > > > >>>> P.S. Sorry for the brief email and my lack of further fixes, but
    > > it's
    > > > >>>> 5:30AM now and I've been working for 17 hours.
    > > > >>>>
    > > > >>>
    > > > >>
    > > >
    > >
    >

Re: CI impaired

Reply via email to