Re: CI Pipeline Change Proposal

Marco de Abreu Fri, 27 Mar 2020 00:55:10 -0700

What about dependency pinning?

The cache should not be our method to do dependency pinning and
synchronization.


-Marco

Aaron Markham <aaron.s.mark...@gmail.com> schrieb am Fr., 27. März 2020,
03:45:

> I'm dealing with a Ruby dep breaking the site build right now.
> I wish this would be on occasion that I choose, not when Ruby or x
> dependency releases a new version. When the cache expires for Jekyll the
> site won't publish anymore... And CI will be blocked for the website test.
>
> If we built the base OS and main deps once when we do a minor release,
> upload that to dockerhub, then we'd save build time and things breaking
> randomly. Users can use those docker images too. At release time we do a
> round of updates and testing when we're ready. Can we find a balance
> between caching, prebuilt docker images, freshness, and efficiency?
>
>
> On Thu, Mar 26, 2020, 14:31 Marco de Abreu <marco.g.ab...@gmail.com>
> wrote:
>
> > Correct. But I'm surprised about 2:50min to pull down the images.
> >
> > Maybe it makes sense to use ECR as mirror?
> >
> > -Marco
> >
> > Joe Evans <joseph.ev...@gmail.com> schrieb am Do., 26. März 2020, 22:02:
> >
> > > +1 on rebuilding the containers regularly without caching layers.
> > >
> > > We are both pulling down a bunch of docker layers (when docker pulls an
> > > image) and then building a new container to run the sanity build in.
> > > Pulling down all the layers is what is taking so long (2m50s.) Within
> the
> > > docker build, all the layers are cached, so it doesn't take long.
> Unless
> > > I'm missing something, it doesn't make much sense to be rebuilding the
> > > image every build.
> > >
> > > On Thu, Mar 26, 2020 at 1:12 PM Lausen, Leonard
> > <lau...@amazon.com.invalid
> > > >
> > > wrote:
> > >
> > > > WRT Docker Cache: We need to add a mechanism to invalidate the cache
> > and
> > > > rebuild
> > > > the containers on a set schedule. The builds break too often and the
> > > > breakage is
> > > > only detected when a contributor touches the Dockerfiles (manually
> > > causing
> > > > cache
> > > > invalidation)
> > > >
> > > > On Thu, 2020-03-26 at 16:06 -0400, Aaron Markham wrote:
> > > > > I think it is a good idea to do the sanity check first. Even at 10
> > > > minutes.
> > > > > And also try to fix the docker cache situation, but those can be
> > > separate
> > > > > tasks.
> > > > >
> > > > > On Thu, Mar 26, 2020, 12:52 Marco de Abreu <
> marco.g.ab...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Jenkins doesn't load for me, so let me ask this way: are we
> > actually
> > > > > > rebuilding every single time or do you mean the docker cache?
> > Pulling
> > > > the
> > > > > > cache should only take a few seconds from my experience - docker
> > > build
> > > > > > should be a no-op in most cases.
> > > > > >
> > > > > > -Marco
> > > > > >
> > > > > >
> > > > > > Joe Evans <joseph.ev...@gmail.com> schrieb am Do., 26. März
> 2020,
> > > > 20:46:
> > > > > >
> > > > > > > The sanity-lint check pulls a docker image cache, builds a new
> > > > container
> > > > > > > and runs inside. The docker setup is taking around 3 minutes,
> at
> > > > least:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fsanity/detail/master/1764/pipeline/39
> > > > > > > We could improve this by not having to build a new container
> > every
> > > > time.
> > > > > > > Also, our CI containers are huge so it takes awhile to pull
> them
> > > > down.
> > > > > > I'm
> > > > > > > sure we could reduce the size by being a bit more careful in
> > > building
> > > > > > them
> > > > > > > too.
> > > > > > >
> > > > > > > Joe
> > > > > > >
> > > > > > > On Thu, Mar 26, 2020 at 12:33 PM Marco de Abreu <
> > > > marco.g.ab...@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Do you know what's driving the duration for sanity? It used
> to
> > be
> > > > 50
> > > > > > sec
> > > > > > > > execution and 60 sec preparation.
> > > > > > > >
> > > > > > > > -Marco
> > > > > > > >
> > > > > > > > Joe Evans <joseph.ev...@gmail.com> schrieb am Do., 26. März
> > > 2020,
> > > > > > 20:31:
> > > > > > > > > Thanks Marco and Aaron for your input.
> > > > > > > > >
> > > > > > > > > > Can you show by how much the duration will increase?
> > > > > > > > >
> > > > > > > > > The average sanity build time is around 10min, while the
> > > average
> > > > > > build
> > > > > > > > time
> > > > > > > > > for unix-cpu is about 2 hours, so the entire build pipeline
> > > would
> > > > > > > > increase
> > > > > > > > > by 2 hours if we required both unix-cpu and sanity to
> > complete
> > > in
> > > > > > > > parallel.
> > > > > > > > > I took a look at the CloudWatch metrics we're saving for
> > > Jenkins
> > > > > > jobs.
> > > > > > > > Here
> > > > > > > > > is the failure rate per job, based on builds triggered by
> PRs
> > > in
> > > > the
> > > > > > > past
> > > > > > > > > year. As you can see, the sanity build failure is still
> > fairly
> > > > high
> > > > > > and
> > > > > > > > > would save a lot of unneeded build jobs.
> > > > > > > > >
> > > > > > > > > Job Successful Failed Failure Rate
> > > > > > > > > sanity 6900 2729 28.34%
> > > > > > > > > unix-cpu 4268 4786 52.86%
> > > > > > > > > unix-gpu 3686 5637 60.46%
> > > > > > > > > centos-cpu 6777 2809 29.30%
> > > > > > > > > centos-gpu 6318 3350 34.65%
> > > > > > > > > clang 7879 1588 16.77%
> > > > > > > > > edge 7654 1933 20.16%
> > > > > > > > > miscellaneous 8090 1510 15.73%
> > > > > > > > > website 7226 2179 23.17%
> > > > > > > > > windows-cpu 6084 3621 37.31%
> > > > > > > > > windows-gpu 5191 4721 47.63%
> > > > > > > > >
> > > > > > > > > We can start by requiring only the sanity job to complete
> > > before
> > > > > > > > triggering
> > > > > > > > > the rest, and collect data to decide if it makes sense to
> > > change
> > > > it
> > > > > > > from
> > > > > > > > > there. Any objections to this approach?
> > > > > > > > >
> > > > > > > > > Thanks.
> > > > > > > > > Joe
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Mar 25, 2020 at 9:35 AM Marco de Abreu <
> > > > > > > marco.g.ab...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Back then I have created a system which exports all
> Jenkins
> > > > results
> > > > > > > to
> > > > > > > > > > cloud watch. It does not include individual test results
> > but
> > > > rather
> > > > > > > > > stages
> > > > > > > > > > and jobs. The data for the sanity check should be
> available
> > > > there.
> > > > > > > > > >
> > > > > > > > > > Something I'd also be curious about is the percentage of
> > the
> > > > > > failures
> > > > > > > > in
> > > > > > > > > > one run. Speak, if a commit failed, have there been
> > multiple
> > > > jobs
> > > > > > > > failing
> > > > > > > > > > (indicating an error in the code) or only one or two
> > > > (indicating
> > > > > > > > > > flakyness). This should give us a proper understanding of
> > how
> > > > > > > > unnecessary
> > > > > > > > > > these runs really are.
> > > > > > > > > >
> > > > > > > > > > -Marck
> > > > > > > > > >
> > > > > > > > > > Aaron Markham <aaron.s.mark...@gmail.com> schrieb am
> Mi.,
> > > 25.
> > > > März
> > > > > > > > 2020,
> > > > > > > > > > 16:53:
> > > > > > > > > >
> > > > > > > > > > > +1 for sanity check - that's fast.
> > > > > > > > > > > -1 for unix-cpu - that's slow and can just hang.
> > > > > > > > > > >
> > > > > > > > > > > So my suggestion would be to see the data apart -
> what's
> > > the
> > > > > > > failure
> > > > > > > > > > > rate on the sanity check and the unix-cpu? Actually,
> can
> > we
> > > > get a
> > > > > > > > > > > table of all of the tests with this data?!
> > > > > > > > > > > If the sanity check fails... let's say 20% of the time,
> > but
> > > > only
> > > > > > > > takes
> > > > > > > > > > > a couple of minutes, then ya, let's stack it and do
> that
> > > one
> > > > > > first.
> > > > > > > > > > > I think unix-cpu needs to be broken apart. It's too
> > complex
> > > > and
> > > > > > > fails
> > > > > > > > > > > in multiple ways. Isolate the brittle parts. Then we
> can
> > > > > > > > > > > restart/disable those as needed, while all of the other
> > > parts
> > > > > > pass
> > > > > > > > and
> > > > > > > > > > > don't have to be rerun.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Mar 25, 2020 at 1:32 AM Marco de Abreu <
> > > > > > > > > marco.g.ab...@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > We had this structure in the past and the community
> was
> > > > > > bothered
> > > > > > > by
> > > > > > > > > CI
> > > > > > > > > > > > taking more time, thus we moved to the current model
> > with
> > > > > > > > everything
> > > > > > > > > > > > parallelized. We'd basically revert that then.
> > > > > > > > > > > >
> > > > > > > > > > > > Can you show by how much the duration will increase?
> > > > > > > > > > > >
> > > > > > > > > > > > Also, we have zero test parallelisation, speak we are
> > > > running
> > > > > > one
> > > > > > > > > test
> > > > > > > > > > on
> > > > > > > > > > > > 72 core machines (although multiple workers).
> Wouldn't
> > it
> > > > be
> > > > > > way
> > > > > > > > more
> > > > > > > > > > > > efficient to add parallelisation and thus heavily
> > reduce
> > > > the
> > > > > > time
> > > > > > > > > spent
> > > > > > > > > > > on
> > > > > > > > > > > > the tasks instead of staggering?
> > > > > > > > > > > >
> > > > > > > > > > > > I feel concerned that these measures to save cost are
> > > paid
> > > > in
> > > > > > the
> > > > > > > > > form
> > > > > > > > > > > of a
> > > > > > > > > > > > worse user experience. I see a big potential to save
> > > costs
> > > > by
> > > > > > > > > > increasing
> > > > > > > > > > > > efficiency while actually improving the user
> experience
> > > > due to
> > > > > > CI
> > > > > > > > > being
> > > > > > > > > > > > faster.
> > > > > > > > > > > >
> > > > > > > > > > > > -Marco
> > > > > > > > > > > >
> > > > > > > > > > > > Joe Evans <joseph.ev...@gmail.com> schrieb am Mi.,
> 25.
> > > > März
> > > > > > > 2020,
> > > > > > > > > > 04:58:
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > First, I just wanted to introduce myself to the
> MXNet
> > > > > > > community.
> > > > > > > > > I’m
> > > > > > > > > > > Joe
> > > > > > > > > > > > > and will be working with Chai and the AWS team to
> > > improve
> > > > > > some
> > > > > > > > > issues
> > > > > > > > > > > > > around MXNet CI. One of our goals is to reduce the
> > > costs
> > > > > > > > associated
> > > > > > > > > > > with
> > > > > > > > > > > > > running MXNet CI. The task I’m working on now is
> this
> > > > issue:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > https://github.com/apache/incubator-mxnet/issues/17802
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Proposal: Staggered Jenkins CI pipeline
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Based on data collected from Jenkins, around 55% of
> > the
> > > > time
> > > > > > > when
> > > > > > > > > the
> > > > > > > > > > > > > mxnet-validation CI build is triggered by a PR,
> > either
> > > > the
> > > > > > > sanity
> > > > > > > > > or
> > > > > > > > > > > > > unix-cpu builds fail. When either of these builds
> > fail,
> > > > it
> > > > > > > > doesn’t
> > > > > > > > > > make
> > > > > > > > > > > > > sense to run the rest of the pipelines and utilize
> > all
> > > > those
> > > > > > > > > > resources
> > > > > > > > > > > if
> > > > > > > > > > > > > we’ve already identified a build or unit test
> > failure.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > We are proposing changing the MXNet Jenkins CI
> > pipeline
> > > > by
> > > > > > > > > requiring
> > > > > > > > > > > the
> > > > > > > > > > > > > *sanity* and *unix-cpu* builds to complete and pass
> > > tests
> > > > > > > > > > successfully
> > > > > > > > > > > > > before starting the other build pipelines
> > > > (centos-cpu/gpu,
> > > > > > > > > unix-gpu,
> > > > > > > > > > > > > windows-cpu/gpu, etc.) Once the sanity builds
> > > > successfully
> > > > > > > > > complete,
> > > > > > > > > > > the
> > > > > > > > > > > > > remaining build pipelines will be triggered and run
> > in
> > > > > > parallel
> > > > > > > > (as
> > > > > > > > > > > they
> > > > > > > > > > > > > currently do.) The purpose of this change is to
> > > identify
> > > > > > faulty
> > > > > > > > > code
> > > > > > > > > > or
> > > > > > > > > > > > > compatibility issues early and prevent further
> > > execution
> > > > of
> > > > > > CI
> > > > > > > > > > builds.
> > > > > > > > > > > This
> > > > > > > > > > > > > will increase the time required to test a PR, but
> > will
> > > > > > prevent
> > > > > > > > > > > unnecessary
> > > > > > > > > > > > > builds from running.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Does anyone have any concerns with this change or
> > > > > > suggestions?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Joe Evans
> > > > > > > > > > > > >
> > > > > > > > > > > > > joseph.ev...@gmail.com
> > > > > > > > > > > > >
> > > >
> > >
> >
>

Re: CI Pipeline Change Proposal

Reply via email to