Re: CI impaired

Steffen Rochel Sat, 24 Nov 2018 18:49:40 -0800

Thanks Marco for the updates and resolving the issues.
However, I do see a number of PR waiting to be merged with inconsistent PR
validation status check.
E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9 pending
checks being queued. However, when you look at the details, either the
checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu,
windows-gpu failed, required pr-merge which includes edge, gpu tests
passed).
Similar also for other PR with label pr-awaiting-merge (
https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge
)
Please advice on resolution.


Regards,
Steffen

On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu
<marco.g.ab...@googlemail.com.invalid> wrote:

> Thanks everybody, I really appreciate it!
>
> Today was a good day, there were no incidents and everything appears to be
> stable. In the meantime I did a deep dive on why we has such a significant
> performance decrease with of our compilation jobs - which then clogged up
> the queue and resulted in 1000 jobs waiting to be scheduled.
>
> The reason was the way how we use ccache to speed up our compilation jobs.
> Usually, this yields us a huge performance improvement (CPU openblas, for
> example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes down to
> ~1.5min, etc.). Unfortunately in this case, ccache was our limiting factor.
> Here's some background about how we operate our cache:
>
> We use EFS to have a distributed ccache between all of our
> unrestricted-prod-slaves. EFS is classified for almost unlimited
> scalability (being consumed by thousands of instances in parallel [1]) with
> a theoretical throughput of over 10Gbps. One thing I didn't know when I
> designed this approach was the method how throughput is being granted.
> Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher
> throughput (default is 50MiB/s) [2]. Due to the high load, we consumed all
> of our credits - here's a very interesting graph: [3].
>
> To avoid similar incidents in future, I have taken the following actions:
> 1. I switched EFS from burst-mode to provisioned throughput with 300MB/s
> (in the graph at [3] you can see how our IO immediately increases - and
> thus our CI gets faster - as soon as I added provisioned throughput).
> 2. I created internal follow-up tickets to add monitoring and automated
> actions.
>
> First, we should be notified if we are running low on credits to kick-off
> an investigation. Second (nice to have), we could have a lambda-function
> which listens for that event and automatically switches the EFS volume from
> burst-mode to provisioned throughput during high-load-times. The required
> throughput could be retrieved via CloudWatch and then multiplied by a
> factor. EFS allows to downgrade the throughput mode 24h after the last
> changes (to reduce capacity if the load is over) and always allows to
> upgrade the provisioned capacity (if the load goes even higher). I've been
> looking for a pre-made CloudFormation template to facilitate that, but so
> far, I haven't been able to find it.
>
> I'm now running additional load tests on our test CI environment to detect
> other potential bottlenecks.
>
> Thanks a lot for your support!
>
> Best regards,
> Marco
>
> [1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html
> [2]:
> https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes
> [3]: https://i.imgur.com/nboQLOn.png
>
> On Thu, Nov 22, 2018 at 1:40 AM Qing Lan <lanking...@live.com> wrote:
>
> > Appreciated for your effort and help to make CI a better place!
> >
> > Qing
> >
> > On 11/21/18, 4:38 PM, "Lin Yuan" <apefor...@gmail.com> wrote:
> >
> >     Thanks for your efforts, Marco!
> >
> >     On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <
> > anirudh2...@gmail.com>
> >     wrote:
> >
> >     > Thanks for the quick response and mitigation!
> >     >
> >     > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
> >     > <marco.g.ab...@googlemail.com.invalid> wrote:
> >     >
> >     > > Hello,
> >     > >
> >     > > today, CI had some issues and I had to cancel all jobs a few
> > minutes ago.
> >     > > This was basically caused by the high load that is currently
> being
> > put on
> >     > > our CI system due to the pre-release efforts for this Friday.
> >     > >
> >     > > It's really unfortunate that we just had outages of three core
> > components
> >     > > within the last two days - sorry about that!. To recap, we had
> the
> >     > > following outages (which are unrelated to the parallel refactor
> of
> > the
> >     > > Jenkins pipeline):
> >     > > - (yesterday evening) The Jenkins master ran out of disk space
> and
> > thus
> >     > > processed requests at reduced capacity
> >     > > - (this morning) The Jenkins master got updated which broke our
> >     > > autoscalings upscaling capabilities.
> >     > > - (new, this evening) Jenkins API was irresponsive: Due to the
> high
> >     > number
> >     > > of jobs and a bad API design in the Jenkins REST API, the
> > time-complexity
> >     > > of a simple create or delete request was quadratic which resulted
> > in all
> >     > > requests timing out (that was the current outage). This resulted
> > in our
> >     > > auto scaling to be unable to interface with the Jenkins master.
> >     > >
> >     > > I have now made improvements to our REST API calls which reduced
> > the
> >     > > complexity from O(N^2) to O(1). The reason was an underlying
> > redirect
> >     > loop
> >     > > in the Jenkins createNode and deleteNode REST API in combination
> > with
> >     > > unrolling the entire slave and job graph (which got quite huge
> > during
> >     > > extensive load) upon every single request. Since we had about 150
> >     > > registered slaves and 1000 jobs in the queue, the duration for a
> > single
> >     > > REST API call rose to up to 45 seconds (we execute up to a few
> > hundred
> >     > > queries per auto scaling loop). This lead to our auto scaling
> > timing out.
> >     > >
> >     > > Everything should be back to normal now. I'm closely observing
> the
> >     > > situation and I'll let you know if I encounter any additional
> > issues.
> >     > >
> >     > > Again, sorry for any caused inconveniences.
> >     > >
> >     > > Best regards,
> >     > > Marco
> >     > >
> >     > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <
> > gavin.max.b...@gmail.com>
> >     > > wrote:
> >     > >
> >     > > > Yes, let me add to the kudos, very nice work Marco.
> >     > > >
> >     > > >
> >     > > > "I'm trying real hard to be the shepherd." -Jules Winnfield
> >     > > >
> >     > > >
> >     > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> >     > > > <kell...@amazon.de.INVALID> wrote:
> >     > > > >
> >     > > > > Appreciate the big effort in bring the CI back so quickly.
> > Thanks
> >     > > Marco.
> >     > > > >
> >     > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
> >     > marco.g.ab...@googlemail.com
> >     > > .INVALID>
> >     > > > wrote:
> >     > > > > Thanks Aaron! Just for the record, the new Jenkins jobs were
> >     > unrelated
> >     > > to
> >     > > > > that incident.
> >     > > > >
> >     > > > > If somebody is interested in the details around the outage:
> >     > > > >
> >     > > > > Due to a required maintenance (disk running full), we had to
> > upgrade
> >     > > our
> >     > > > > Jenkins master because it was running on Ubuntu 17.04 (for an
> > unknown
> >     > > > > reason, it used to be 16.04) and we needed to install some
> > packages.
> >     > > > Since
> >     > > > > the support for Ubuntu 17.04 was stopped, this resulted in
> all
> >     > package
> >     > > > > updates and installations to fail because the repositories
> > were taken
> >     > > > > offline. Due to the unavailable maintenance package and other
> > issues
> >     > > with
> >     > > > > the installed OpenJDK8 version, we made the decision to
> > upgrade the
> >     > > > Jenkins
> >     > > > > master to Ubuntu 18.04 LTS in order to get back to a
> supported
> >     > version
> >     > > > with
> >     > > > > maintenance tools. During this upgrade, Jenkins was
> > automatically
> >     > > updated
> >     > > > > by APT as part of the dist-upgrade process.
> >     > > > >
> >     > > > > In the latest version of Jenkins, some labels have been
> > changed which
> >     > > we
> >     > > > > depend on for our auto scaling. To be more specific:
> >     > > > >> Waiting for next available executor on mxnetlinux-gpu
> >     > > > > has been changed to
> >     > > > >> Waiting for next available executor on ‘mxnetlinux-gpu’
> >     > > > > Notice the quote characters.
> >     > > > >
> >     > > > > Jenkins does not offer a better way than to parse these
> > messages
> >     > > > > unfortunately - there's no standardized way to express queue
> > items.
> >     > > Since
> >     > > > > our parser expected the above message without quote signs,
> this
> >     > message
> >     > > > was
> >     > > > > discarded.
> >     > > > >
> >     > > > > We support various queue reasons (5 of them to be exact) that
> >     > indicate
> >     > > > > resource starvation. If we run super low on capacity, the
> queue
> >     > reason
> >     > > is
> >     > > > > different and we would still be able to scale up, but most of
> > the
> >     > cases
> >     > > > > would have printed the unsupported message. This resulted in
> > reduced
> >     > > > > capacity (to be specific, the limit during that time was 1
> > slave per
> >     > > > type).
> >     > > > >
> >     > > > > We have now fixed our autoscaling to automatically strip
> these
> >     > > characters
> >     > > > > and added that message to our test suite.
> >     > > > >
> >     > > > > Best regards,
> >     > > > > Marco
> >     > > > >
> >     > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
> >     > > aaron.s.mark...@gmail.com
> >     > > > >
> >     > > > > wrote:
> >     > > > >
> >     > > > >> Marco, thanks for your hard work on this. I'm super excited
> > about
> >     > the
> >     > > > new
> >     > > > >> Jenkins jobs. This is going to be very helpful and improve
> > sanity
> >     > for
> >     > > > our
> >     > > > >> PRs and ourselves!
> >     > > > >>
> >     > > > >> Cheers,
> >     > > > >> Aaron
> >     > > > >>
> >     > > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> >     > > > >> <marco.g.ab...@googlemail.com.invalid wrote:
> >     > > > >>
> >     > > > >>> Hello,
> >     > > > >>>
> >     > > > >>> the CI is now back up and running. Auto scaling is working
> as
> >     > > expected
> >     > > > >> and
> >     > > > >>> it passed our load tests.
> >     > > > >>>
> >     > > > >>> Please excuse the caused inconveniences.
> >     > > > >>>
> >     > > > >>> Best regards,
> >     > > > >>> Marco
> >     > > > >>>
> >     > > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> >     > > > >>> marco.g.ab...@googlemail.com>
> >     > > > >>> wrote:
> >     > > > >>>
> >     > > > >>>> Hello,
> >     > > > >>>>
> >     > > > >>>> I'd like to let you know that our CI was impaired and down
> > for the
> >     > > > last
> >     > > > >>>> few hours. After getting the CI back up, I noticed that
> our
> > auto
> >     > > > >> scaling
> >     > > > >>>> broke due to a silent update of Jenkins which broke our
> >     > > > >>> upscale-detection.
> >     > > > >>>> Manual scaling is currently not possible and stopping the
> > scaling
> >     > > > won't
> >     > > > >>>> help either because there are currently no p3 instances
> > available,
> >     > > > >> which
> >     > > > >>>> means that all jobs will fail none the less. In a few
> > hours, the
> >     > > auto
> >     > > > >>>> scaling will have recycled all slaves through the
> down-scale
> >     > > mechanism
> >     > > > >>> and
> >     > > > >>>> we will be out of capacity. This will lead to resource
> > starvation
> >     > > and
> >     > > > >>> thus
> >     > > > >>>> timeouts.
> >     > > > >>>>
> >     > > > >>>> Your PRs will be properly registered by Jenkins, but
> please
> > expect
> >     > > the
> >     > > > >>>> jobs to time out and thus fail your PRs.
> >     > > > >>>>
> >     > > > >>>> I will fix the auto scaling as soon as I'm awake again.
> >     > > > >>>>
> >     > > > >>>> Sorry for the caused inconveniences.
> >     > > > >>>>
> >     > > > >>>> Best regards,
> >     > > > >>>> Marco
> >     > > > >>>>
> >     > > > >>>>
> >     > > > >>>> P.S. Sorry for the brief email and my lack of further
> > fixes, but
> >     > > it's
> >     > > > >>>> 5:30AM now and I've been working for 17 hours.
> >     > > > >>>>
> >     > > > >>>
> >     > > > >>
> >     > > >
> >     > >
> >     >
> >
> >
> >
>

Re: CI impaired

Reply via email to