Re: CI and PRs

Pedro Larroy Fri, 16 Aug 2019 11:33:03 -0700

Also, I forgot, another workaround is that I added the -R flag to the build
logic (build.py) so the container is not rebuilt for manual use.


On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <pedro.larroy.li...@gmail.com>
wrote:

>
> Hi Aaron.
>
> As Marco explained, if you are in master the cache usually works, there's
> two issues that I have observed:
>
> 1 - Docker doesn't automatically pull the base image (ex. ubuntu:16.04) so
> if your cached base which is used in the FROM statement becomes outdated
> your caching won't work. (Using docker pull ubuntu:16.04) or the base
> images from the container helps with this.
>
> 2 - There's another situation where the above doesn't help which seems to
> be an unidentified issue with the docker cache:
> https://github.com/docker/docker.github.io/issues/8886
>
> We can get a short term workaround for #1 by explicitly pulling bases from
> the script, but I think docker should do it when using --cache-from so
> maybe contributing a patch to docker would the best approach.
>
> Pedro
>
> On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <aaron.s.mark...@gmail.com>
> wrote:
>
>> When you create a new Dockerfile and use that on CI, it doesn't seem
>> to cache some of the steps... like this:
>>
>> Step 13/15 : RUN /work/ubuntu_docs.sh
>>  ---> Running in a1e522f3283b
>>  [91m+ echo 'Installing dependencies...'
>> + apt-get update
>>  [0mInstalling dependencies.
>>
>> Or this....
>>
>> Step 4/13 : RUN /work/ubuntu_core.sh
>>  ---> Running in e7882d7aa750
>>  [91m+ apt-get update
>>
>> I get if I was changing those scripts, but then I'd think it should
>> cache after running it once... but, no.
>>
>>
>> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <marco.g.ab...@gmail.com>
>> wrote:
>> >
>> > Do I understand it correctly that you are saying that the Docker cache
>> > doesn't work properly and regularly reinstalls dependencies? Or do you
>> mean
>> > that you only have cache misses when you modify the dependencies - which
>> > would be expected?
>> >
>> > -Marco
>> >
>> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
>> aaron.s.mark...@gmail.com>
>> > wrote:
>> >
>> > > Many of the CI pipelines follow this pattern:
>> > > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
>> > > repeat steps 1-3 over and over?
>> > >
>> > > Now, some tests use a stashed binary and docker cache. And I see this
>> work
>> > > locally, but for the most part, on CI, you're gonna sit through a
>> > > dependency install.
>> > >
>> > > I noticed that almost all jobs use an ubuntu setup that is fully
>> loaded.
>> > > Without cache, it can take 10 or more minutes to build.  So I made a
>> lite
>> > > version. Takes only a few minutes instead.
>> > >
>> > > In some cases archiving worked great to share across pipelines, but as
>> > > Marco mentioned we need a storage solution to make that happen. We
>> can't
>> > > archive every intermediate artifact for each PR.
>> > >
>> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi Aaron. Why speeds things up? What's the difference?
>> > > >
>> > > > Pedro.
>> > > >
>> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
>> aaron.s.mark...@gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > > > The PRs Thomas and I are working on for the new docs and website
>> share
>> > > > the
>> > > > > mxnet binary in the new CI pipelines we made. Speeds things up a
>> lot.
>> > > > >
>> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivie...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > I see it done daily now, and while I can’t share all the
>> details,
>> > > it’s
>> > > > > not
>> > > > > > an incredibly complex thing, and involves not much more than
>> nfs/efs
>> > > > > > sharing and remote ssh commands.  All it takes is a little
>> ingenuity
>> > > > and
>> > > > > > some imagination.
>> > > > > >
>> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
>> > > > > pedro.larroy.li...@gmail.com
>> > > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Sounds good in theory. I think there are complex details with
>> > > regards
>> > > > > of
>> > > > > > > resource sharing during parallel execution. Still I think
>> both ways
>> > > > can
>> > > > > > be
>> > > > > > > explored. I think some tests run for unreasonably long times
>> for
>> > > what
>> > > > > > they
>> > > > > > > are doing. We already scale parts of the pipeline horizontally
>> > > across
>> > > > > > > workers.
>> > > > > > >
>> > > > > > >
>> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
>> > > > cjolivie...@apache.org>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > +1
>> > > > > > > >
>> > > > > > > > Rather than remove tests (which doesn’t scale as a
>> solution), why
>> > > > not
>> > > > > > > scale
>> > > > > > > > them horizontally so that they finish more quickly? Across
>> > > > processes
>> > > > > or
>> > > > > > > > even on a pool of machines that aren’t necessarily the build
>> > > > machine?
>> > > > > > > >
>> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
>> > > > > > marco.g.ab...@gmail.com
>> > > > > > > >
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > With regards to time I rather prefer us spending a bit
>> more
>> > > time
>> > > > on
>> > > > > > > > > maintenance than somebody running into an error that
>> could've
>> > > > been
>> > > > > > > caught
>> > > > > > > > > with a test.
>> > > > > > > > >
>> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has been
>> broken
>> > > for
>> > > > > > quite
>> > > > > > > > > some time now, but nobody noticed that. Basically my
>> stance on
>> > > > that
>> > > > > > > > matter
>> > > > > > > > > is that as soon as something is not blocking, you can
>> also just
>> > > > > > > > deactivate
>> > > > > > > > > it since you don't have a forcing function in an open
>> source
>> > > > > project.
>> > > > > > > > > People will rarely come back and fix the errors of some
>> nightly
>> > > > > test
>> > > > > > > that
>> > > > > > > > > they introduced.
>> > > > > > > > >
>> > > > > > > > > -Marco
>> > > > > > > > >
>> > > > > > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi., 14.
>> Aug.
>> > > > 2019,
>> > > > > > > 21:59:
>> > > > > > > > >
>> > > > > > > > > > If a language binding test is failing for a not
>> important
>> > > > reason,
>> > > > > > > then
>> > > > > > > > it
>> > > > > > > > > > is too brittle and needs to be fixed (we have fixed
>> some of
>> > > > these
>> > > > > > > with
>> > > > > > > > > the
>> > > > > > > > > > Clojure package [1]).
>> > > > > > > > > > But in general, if we thinking of the MXNet project as
>> one
>> > > > > project
>> > > > > > > that
>> > > > > > > > > is
>> > > > > > > > > > across all the language bindings, then we want to know
>> if
>> > > some
>> > > > > > > > > fundamental
>> > > > > > > > > > code change is going to break a downstream package.
>> > > > > > > > > > I can't speak for all the high level package binding
>> > > > maintainers,
>> > > > > > but
>> > > > > > > > I'm
>> > > > > > > > > > always happy to pitch in to provide code fixes to help
>> the
>> > > base
>> > > > > PR
>> > > > > > > get
>> > > > > > > > > > green.
>> > > > > > > > > >
>> > > > > > > > > > The time costs to maintain such a large CI project
>> obviously
>> > > > > needs
>> > > > > > to
>> > > > > > > > be
>> > > > > > > > > > considered as well.
>> > > > > > > > > >
>> > > > > > > > > > [1]
>> https://github.com/apache/incubator-mxnet/pull/15579
>> > > > > > > > > >
>> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
>> > > > > > > > > pedro.larroy.li...@gmail.com
>> > > > > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > From what I have seen Clojure is 15 minutes, which I
>> think
>> > > is
>> > > > > > > > > reasonable.
>> > > > > > > > > > > The only question is that when a binding such as R,
>> Perl or
>> > > > > > Clojure
>> > > > > > > > > > fails,
>> > > > > > > > > > > some devs are a bit confused about how to fix them
>> since
>> > > they
>> > > > > are
>> > > > > > > not
>> > > > > > > > > > > familiar with the testing tools and the language.
>> > > > > > > > > > >
>> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
>> > > > > > carinme...@gmail.com
>> > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Great idea Marco! Anything that you think would be
>> > > valuable
>> > > > > to
>> > > > > > > > share
>> > > > > > > > > > > would
>> > > > > > > > > > > > be good. The duration of each node in the test stage
>> > > sounds
>> > > > > > like
>> > > > > > > a
>> > > > > > > > > good
>> > > > > > > > > > > > start.
>> > > > > > > > > > > >
>> > > > > > > > > > > > - Carin
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
>> > > > > > > > > > marco.g.ab...@gmail.com>
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Hi,
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > we record a bunch of metrics about run statistics
>> (down
>> > > > to
>> > > > > > the
>> > > > > > > > > > duration
>> > > > > > > > > > > > of
>> > > > > > > > > > > > > every individual step). If you tell me which ones
>> > > you're
>> > > > > > > > > particularly
>> > > > > > > > > > > > > interested in (probably total duration of each
>> node in
>> > > > the
>> > > > > > test
>> > > > > > > > > > stage),
>> > > > > > > > > > > > I'm
>> > > > > > > > > > > > > happy to provide them.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Dimensions are (in hierarchical order):
>> > > > > > > > > > > > > - job
>> > > > > > > > > > > > > - branch
>> > > > > > > > > > > > > - stage
>> > > > > > > > > > > > > - node
>> > > > > > > > > > > > > - step
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Unfortunately I don't have the possibility to
>> export
>> > > them
>> > > > > > since
>> > > > > > > > we
>> > > > > > > > > > > store
>> > > > > > > > > > > > > them in CloudWatch Metrics which afaik doesn't
>> offer
>> > > raw
>> > > > > > > exports.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Best regards,
>> > > > > > > > > > > > > Marco
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Carin Meier <carinme...@gmail.com> schrieb am
>> Mi., 14.
>> > > > > Aug.
>> > > > > > > > 2019,
>> > > > > > > > > > > 19:43:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > I would prefer to keep the language binding in
>> the PR
>> > > > > > > process.
>> > > > > > > > > > > Perhaps
>> > > > > > > > > > > > we
>> > > > > > > > > > > > > > could do some analytics to see how much each of
>> the
>> > > > > > language
>> > > > > > > > > > bindings
>> > > > > > > > > > > > is
>> > > > > > > > > > > > > > contributing to overall run time.
>> > > > > > > > > > > > > > If we have some metrics on that, maybe we can
>> come up
>> > > > > with
>> > > > > > a
>> > > > > > > > > > > guideline
>> > > > > > > > > > > > of
>> > > > > > > > > > > > > > how much time each should take. Another
>> possibility
>> > > is
>> > > > > > > leverage
>> > > > > > > > > the
>> > > > > > > > > > > > > > parallel builds more.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
>> > > > > > > > > > > > > pedro.larroy.li...@gmail.com
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Hi Carin.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > That's a good point, all things considered
>> would
>> > > your
>> > > > > > > > > preference
>> > > > > > > > > > be
>> > > > > > > > > > > > to
>> > > > > > > > > > > > > > keep
>> > > > > > > > > > > > > > > the Clojure tests as part of the PR process
>> or in
>> > > > > > Nightly?
>> > > > > > > > > > > > > > > Some options are having notifications here or
>> in
>> > > > slack.
>> > > > > > But
>> > > > > > > > if
>> > > > > > > > > we
>> > > > > > > > > > > > think
>> > > > > > > > > > > > > > > breakages would go unnoticed maybe is not a
>> good
>> > > idea
>> > > > > to
>> > > > > > > > fully
>> > > > > > > > > > > remove
>> > > > > > > > > > > > > > > bindings from the PR process and just
>> streamline
>> > > the
>> > > > > > > process.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Pedro.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
>> > > > > > > > > > carinme...@gmail.com>
>> > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Before any binding tests are moved to
>> nightly, I
>> > > > > think
>> > > > > > we
>> > > > > > > > > need
>> > > > > > > > > > to
>> > > > > > > > > > > > > > figure
>> > > > > > > > > > > > > > > > out how the community can get proper
>> > > notifications
>> > > > of
>> > > > > > > > failure
>> > > > > > > > > > and
>> > > > > > > > > > > > > > success
>> > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I think
>> that
>> > > > > > breakages
>> > > > > > > > > would
>> > > > > > > > > > go
>> > > > > > > > > > > > > > > > unnoticed.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > -Carin
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro
>> Larroy <
>> > > > > > > > > > > > > > > pedro.larroy.li...@gmail.com
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Hi
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Seems we are hitting some problems in CI.
>> I
>> > > > propose
>> > > > > > the
>> > > > > > > > > > > following
>> > > > > > > > > > > > > > > action
>> > > > > > > > > > > > > > > > > items to remedy the situation and
>> accelerate
>> > > turn
>> > > > > > > around
>> > > > > > > > > > times
>> > > > > > > > > > > in
>> > > > > > > > > > > > > CI,
>> > > > > > > > > > > > > > > > > reduce cost, complexity and probability of
>> > > > failure
>> > > > > > > > blocking
>> > > > > > > > > > PRs
>> > > > > > > > > > > > and
>> > > > > > > > > > > > > > > > > frustrating developers:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio from VS
>> 2015 to
>> > > > VS
>> > > > > > > 2017.
>> > > > > > > > > The
>> > > > > > > > > > > > > > > > > build_windows.py infrastructure should
>> easily
>> > > > work
>> > > > > > with
>> > > > > > > > the
>> > > > > > > > > > new
>> > > > > > > > > > > > > > > version.
>> > > > > > > > > > > > > > > > > Currently some PRs are blocked by this:
>> > > > > > > > > > > > > > > > >
>> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
>> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly.
>> > > Tracked
>> > > > at
>> > > > > > > > > > > > > > > > >
>> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
>> > > > > > > > > > > > > > > > > * Move non-python bindings tests to
>> nightly.
>> > > If a
>> > > > > > > commit
>> > > > > > > > is
>> > > > > > > > > > > > > touching
>> > > > > > > > > > > > > > > > other
>> > > > > > > > > > > > > > > > > bindings, the reviewer should ask for a
>> full
>> > > run
>> > > > > > which
>> > > > > > > > can
>> > > > > > > > > be
>> > > > > > > > > > > > done
>> > > > > > > > > > > > > > > > locally,
>> > > > > > > > > > > > > > > > > use the label bot to trigger a full CI
>> build,
>> > > or
>> > > > > > defer
>> > > > > > > to
>> > > > > > > > > > > > nightly.
>> > > > > > > > > > > > > > > > > * Provide a couple of basic sanity
>> performance
>> > > > > tests
>> > > > > > on
>> > > > > > > > > small
>> > > > > > > > > > > > > models
>> > > > > > > > > > > > > > > that
>> > > > > > > > > > > > > > > > > are run on CI and can be echoed by the
>> label
>> > > bot
>> > > > > as a
>> > > > > > > > > comment
>> > > > > > > > > > > for
>> > > > > > > > > > > > > > PRs.
>> > > > > > > > > > > > > > > > > * Address unit tests that take more than
>> > > 10-20s,
>> > > > > > > > streamline
>> > > > > > > > > > > them
>> > > > > > > > > > > > or
>> > > > > > > > > > > > > > > move
>> > > > > > > > > > > > > > > > > them to nightly if it can't be done.
>> > > > > > > > > > > > > > > > > * Open sourcing the remaining CI
>> infrastructure
>> > > > > > scripts
>> > > > > > > > so
>> > > > > > > > > > the
>> > > > > > > > > > > > > > > community
>> > > > > > > > > > > > > > > > > can contribute.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > I think our goal should be turnaround
>> under
>> > > > 30min.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > I would also like to touch base with the
>> > > > community
>> > > > > > that
>> > > > > > > > > some
>> > > > > > > > > > > PRs
>> > > > > > > > > > > > > are
>> > > > > > > > > > > > > > > not
>> > > > > > > > > > > > > > > > > being followed up by committers asking for
>> > > > changes.
>> > > > > > For
>> > > > > > > > > > example
>> > > > > > > > > > > > > this
>> > > > > > > > > > > > > > PR
>> > > > > > > > > > > > > > > > is
>> > > > > > > > > > > > > > > > > importtant and is hanging for a long time.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > https://github.com/apache/incubator-mxnet/pull/15051
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > This is another, less important but more
>> > > trivial
>> > > > to
>> > > > > > > > review:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > https://github.com/apache/incubator-mxnet/pull/14940
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > I think comitters requesting changes and
>> not
>> > > > > > folllowing
>> > > > > > > > up
>> > > > > > > > > in
>> > > > > > > > > > > > > > > reasonable
>> > > > > > > > > > > > > > > > > time is not healthy for the project. I
>> suggest
>> > > > > > > > configuring
>> > > > > > > > > > > github
>> > > > > > > > > > > > > > > > > Notifications for a good SNR and
>> following up.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Regards.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Pedro.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>>
>

Re: CI and PRs

Reply via email to