Re: CI and PRs

Pedro Larroy Fri, 23 Aug 2019 10:33:50 -0700

As Marco has open sourced the bulk of the CI infrastructure donated from
Amazon to the community, I would like to raise the recommendation that the
community takes action to help volunteers working on the CI have a better
experience. In the past, it's my impression that there hasn't been much
action granting PMC or committer privileges to engineers volunteering to
help CI other than Marco. This would encourage more contributions and help
expedite critical fixes and corrective actions. I think this has not
properly enabled those individuals to be as effective as they could, as
well as the lack of recognition for such a critical activity. I'm not sure
about the cause but I believe this is something that should be rectified
for future contributions and help on the CI front if improvements are
desired.


In spanish we have a saying: "es de bien nacido ser agradecido".

Pedro.

On Fri, Aug 16, 2019 at 4:03 PM Pedro Larroy <pedro.larroy.li...@gmail.com>
wrote:

> Hi Aaron. This is difficult to diagnose, because I don't know what to do
> when the hash of the layer in docker doesn't match and decides to rebuild
> it. the r script seems not to have changed. I have observed this in the
> past and I think is due to bugs in docker.   Maybe Kellen is able to give
> some tips here.
>
> In this case you should use -R which is already in master. (you can always
> copy the script on top if you are in an older revision).
>
> Another thing that worked for me in the past was to completely nuke the
> docker cache, so it redonwloads from the CI repo. After that it worked fine
> in some cases.
>
> These two workarounds are not ideal, but should unblock you.
>
> Pedro.
>
> On Fri, Aug 16, 2019 at 11:39 AM Aaron Markham <aaron.s.mark...@gmail.com>
> wrote:
>
>> Is -R already in there?
>>
>> Here's an example of it happening to me right now.... I am making
>> minor changes to the runtime_functions logic for handling the R docs
>> output. I pull the fix, then run the container, but I see the R deps
>> layer re-running. I didn't touch that. Why it that running again?
>>
>> From https://github.com/aaronmarkham/incubator-mxnet
>>    f71cc6d..deec6aa  new_website_pipeline_2_aaron_rdocs ->
>> origin/new_website_pipeline_2_aaron_rdocs
>> Updating f71cc6d..deec6aa
>> Fast-forward
>>  ci/docker/runtime_functions.sh | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>> (base) ubuntu@ip-172-31-47-182:~/aaron/ci$ ./build.py
>> --docker-registry mxnetci --platform ubuntu_cpu_r
>> --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh
>> build_r_docs
>> build.py: 2019-08-16 18:34:44,639Z INFO MXNet container based build tool.
>> build.py: 2019-08-16 18:34:44,641Z INFO Docker cache download is
>> enabled from registry mxnetci
>> build.py: 2019-08-16 18:34:44,641Z INFO Loading Docker cache for
>> mxnetci/build.ubuntu_cpu_r from mxnetci
>> Using default tag: latest
>> latest: Pulling from mxnetci/build.ubuntu_cpu_r
>> Digest:
>> sha256:7dc515c288b3e66d96920eb8975f985a501bb57f70595fbe0cb1c4fcd8d4184b
>> Status: Downloaded newer image for mxnetci/build.ubuntu_cpu_r:latest
>> build.py: 2019-08-16 18:34:44,807Z INFO Successfully pulled docker cache
>> build.py: 2019-08-16 18:34:44,807Z INFO Building docker container
>> tagged 'mxnetci/build.ubuntu_cpu_r' with docker
>> build.py: 2019-08-16 18:34:44,807Z INFO Running command: 'docker build
>> -f docker/Dockerfile.build.ubuntu_cpu_r --build-arg USER_ID=1000
>> --build-arg GROUP_ID=1000 --cache-from mxnetci/build.ubuntu_cpu_r -t
>> mxnetci/build.ubuntu_cpu_r docker'
>> Sending build context to Docker daemon  289.8kB
>> Step 1/15 : FROM ubuntu:16.04
>>  ---> 5e13f8dd4c1a
>> Step 2/15 : WORKDIR /work/deps
>>  ---> Using cache
>>  ---> afc2a135945d
>> Step 3/15 : COPY install/ubuntu_core.sh /work/
>>  ---> Using cache
>>  ---> da2b2e7f35e1
>> Step 4/15 : RUN /work/ubuntu_core.sh
>>  ---> Using cache
>>  ---> d1e88b26b1d2
>> Step 5/15 : COPY install/deb_ubuntu_ccache.sh /work/
>>  ---> Using cache
>>  ---> 3aa97dea3b7b
>> Step 6/15 : RUN /work/deb_ubuntu_ccache.sh
>>  ---> Using cache
>>  ---> bec503f1d149
>> Step 7/15 : COPY install/ubuntu_r.sh /work/
>>  ---> c5e77c38031d
>> Step 8/15 : COPY install/r.gpg /work/
>>  ---> d8cdbf015d2b
>> Step 9/15 : RUN /work/ubuntu_r.sh
>>  ---> Running in c6c90b9e1538
>> ++ dirname /work/ubuntu_r.sh
>> + cd /work
>> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
>> + apt-key add r.gpg
>> OK
>> + add-apt-repository 'deb [arch=amd64,i386]
>> https://cran.rstudio.com/bin/linux/ubuntu xenial/'
>> + apt-get update
>> Ign:1 http://cran.rstudio.com/bin/linux/ubuntu trusty/ InRelease
>>
>> On Fri, Aug 16, 2019 at 11:32 AM Pedro Larroy
>> <pedro.larroy.li...@gmail.com> wrote:
>> >
>> > Also, I forgot, another workaround is that I added the -R flag to the
>> build
>> > logic (build.py) so the container is not rebuilt for manual use.
>> >
>> > On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <
>> pedro.larroy.li...@gmail.com>
>> > wrote:
>> >
>> > >
>> > > Hi Aaron.
>> > >
>> > > As Marco explained, if you are in master the cache usually works,
>> there's
>> > > two issues that I have observed:
>> > >
>> > > 1 - Docker doesn't automatically pull the base image (ex.
>> ubuntu:16.04) so
>> > > if your cached base which is used in the FROM statement becomes
>> outdated
>> > > your caching won't work. (Using docker pull ubuntu:16.04) or the base
>> > > images from the container helps with this.
>> > >
>> > > 2 - There's another situation where the above doesn't help which
>> seems to
>> > > be an unidentified issue with the docker cache:
>> > > https://github.com/docker/docker.github.io/issues/8886
>> > >
>> > > We can get a short term workaround for #1 by explicitly pulling bases
>> from
>> > > the script, but I think docker should do it when using --cache-from so
>> > > maybe contributing a patch to docker would the best approach.
>> > >
>> > > Pedro
>> > >
>> > > On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <
>> aaron.s.mark...@gmail.com>
>> > > wrote:
>> > >
>> > >> When you create a new Dockerfile and use that on CI, it doesn't seem
>> > >> to cache some of the steps... like this:
>> > >>
>> > >> Step 13/15 : RUN /work/ubuntu_docs.sh
>> > >>  ---> Running in a1e522f3283b
>> > >>  [91m+ echo 'Installing dependencies...'
>> > >> + apt-get update
>> > >>  [0mInstalling dependencies.
>> > >>
>> > >> Or this....
>> > >>
>> > >> Step 4/13 : RUN /work/ubuntu_core.sh
>> > >>  ---> Running in e7882d7aa750
>> > >>  [91m+ apt-get update
>> > >>
>> > >> I get if I was changing those scripts, but then I'd think it should
>> > >> cache after running it once... but, no.
>> > >>
>> > >>
>> > >> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <
>> marco.g.ab...@gmail.com>
>> > >> wrote:
>> > >> >
>> > >> > Do I understand it correctly that you are saying that the Docker
>> cache
>> > >> > doesn't work properly and regularly reinstalls dependencies? Or do
>> you
>> > >> mean
>> > >> > that you only have cache misses when you modify the dependencies -
>> which
>> > >> > would be expected?
>> > >> >
>> > >> > -Marco
>> > >> >
>> > >> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
>> > >> aaron.s.mark...@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Many of the CI pipelines follow this pattern:
>> > >> > > Load ubuntu 16.04, install deps, build mxnet, then run some
>> tests. Why
>> > >> > > repeat steps 1-3 over and over?
>> > >> > >
>> > >> > > Now, some tests use a stashed binary and docker cache. And I see
>> this
>> > >> work
>> > >> > > locally, but for the most part, on CI, you're gonna sit through a
>> > >> > > dependency install.
>> > >> > >
>> > >> > > I noticed that almost all jobs use an ubuntu setup that is fully
>> > >> loaded.
>> > >> > > Without cache, it can take 10 or more minutes to build.  So I
>> made a
>> > >> lite
>> > >> > > version. Takes only a few minutes instead.
>> > >> > >
>> > >> > > In some cases archiving worked great to share across pipelines,
>> but as
>> > >> > > Marco mentioned we need a storage solution to make that happen.
>> We
>> > >> can't
>> > >> > > archive every intermediate artifact for each PR.
>> > >> > >
>> > >> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <
>> > >> pedro.larroy.li...@gmail.com>
>> > >> > > wrote:
>> > >> > >
>> > >> > > > Hi Aaron. Why speeds things up? What's the difference?
>> > >> > > >
>> > >> > > > Pedro.
>> > >> > > >
>> > >> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
>> > >> aaron.s.mark...@gmail.com
>> > >> > > >
>> > >> > > > wrote:
>> > >> > > >
>> > >> > > > > The PRs Thomas and I are working on for the new docs and
>> website
>> > >> share
>> > >> > > > the
>> > >> > > > > mxnet binary in the new CI pipelines we made. Speeds things
>> up a
>> > >> lot.
>> > >> > > > >
>> > >> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <
>> cjolivie...@gmail.com>
>> > >> > > wrote:
>> > >> > > > >
>> > >> > > > > > I see it done daily now, and while I can’t share all the
>> > >> details,
>> > >> > > it’s
>> > >> > > > > not
>> > >> > > > > > an incredibly complex thing, and involves not much more
>> than
>> > >> nfs/efs
>> > >> > > > > > sharing and remote ssh commands.  All it takes is a little
>> > >> ingenuity
>> > >> > > > and
>> > >> > > > > > some imagination.
>> > >> > > > > >
>> > >> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
>> > >> > > > > pedro.larroy.li...@gmail.com
>> > >> > > > > > >
>> > >> > > > > > wrote:
>> > >> > > > > >
>> > >> > > > > > > Sounds good in theory. I think there are complex details
>> with
>> > >> > > regards
>> > >> > > > > of
>> > >> > > > > > > resource sharing during parallel execution. Still I think
>> > >> both ways
>> > >> > > > can
>> > >> > > > > > be
>> > >> > > > > > > explored. I think some tests run for unreasonably long
>> times
>> > >> for
>> > >> > > what
>> > >> > > > > > they
>> > >> > > > > > > are doing. We already scale parts of the pipeline
>> horizontally
>> > >> > > across
>> > >> > > > > > > workers.
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
>> > >> > > > cjolivie...@apache.org>
>> > >> > > > > > > wrote:
>> > >> > > > > > >
>> > >> > > > > > > > +1
>> > >> > > > > > > >
>> > >> > > > > > > > Rather than remove tests (which doesn’t scale as a
>> > >> solution), why
>> > >> > > > not
>> > >> > > > > > > scale
>> > >> > > > > > > > them horizontally so that they finish more quickly?
>> Across
>> > >> > > > processes
>> > >> > > > > or
>> > >> > > > > > > > even on a pool of machines that aren’t necessarily the
>> build
>> > >> > > > machine?
>> > >> > > > > > > >
>> > >> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
>> > >> > > > > > marco.g.ab...@gmail.com
>> > >> > > > > > > >
>> > >> > > > > > > > wrote:
>> > >> > > > > > > >
>> > >> > > > > > > > > With regards to time I rather prefer us spending a
>> bit
>> > >> more
>> > >> > > time
>> > >> > > > on
>> > >> > > > > > > > > maintenance than somebody running into an error that
>> > >> could've
>> > >> > > > been
>> > >> > > > > > > caught
>> > >> > > > > > > > > with a test.
>> > >> > > > > > > > >
>> > >> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has
>> been
>> > >> broken
>> > >> > > for
>> > >> > > > > > quite
>> > >> > > > > > > > > some time now, but nobody noticed that. Basically my
>> > >> stance on
>> > >> > > > that
>> > >> > > > > > > > matter
>> > >> > > > > > > > > is that as soon as something is not blocking, you can
>> > >> also just
>> > >> > > > > > > > deactivate
>> > >> > > > > > > > > it since you don't have a forcing function in an open
>> > >> source
>> > >> > > > > project.
>> > >> > > > > > > > > People will rarely come back and fix the errors of
>> some
>> > >> nightly
>> > >> > > > > test
>> > >> > > > > > > that
>> > >> > > > > > > > > they introduced.
>> > >> > > > > > > > >
>> > >> > > > > > > > > -Marco
>> > >> > > > > > > > >
>> > >> > > > > > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi.,
>> 14.
>> > >> Aug.
>> > >> > > > 2019,
>> > >> > > > > > > 21:59:
>> > >> > > > > > > > >
>> > >> > > > > > > > > > If a language binding test is failing for a not
>> > >> important
>> > >> > > > reason,
>> > >> > > > > > > then
>> > >> > > > > > > > it
>> > >> > > > > > > > > > is too brittle and needs to be fixed (we have fixed
>> > >> some of
>> > >> > > > these
>> > >> > > > > > > with
>> > >> > > > > > > > > the
>> > >> > > > > > > > > > Clojure package [1]).
>> > >> > > > > > > > > > But in general, if we thinking of the MXNet
>> project as
>> > >> one
>> > >> > > > > project
>> > >> > > > > > > that
>> > >> > > > > > > > > is
>> > >> > > > > > > > > > across all the language bindings, then we want to
>> know
>> > >> if
>> > >> > > some
>> > >> > > > > > > > > fundamental
>> > >> > > > > > > > > > code change is going to break a downstream package.
>> > >> > > > > > > > > > I can't speak for all the high level package
>> binding
>> > >> > > > maintainers,
>> > >> > > > > > but
>> > >> > > > > > > > I'm
>> > >> > > > > > > > > > always happy to pitch in to provide code fixes to
>> help
>> > >> the
>> > >> > > base
>> > >> > > > > PR
>> > >> > > > > > > get
>> > >> > > > > > > > > > green.
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > The time costs to maintain such a large CI project
>> > >> obviously
>> > >> > > > > needs
>> > >> > > > > > to
>> > >> > > > > > > > be
>> > >> > > > > > > > > > considered as well.
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > [1]
>> > >> https://github.com/apache/incubator-mxnet/pull/15579
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
>> > >> > > > > > > > > pedro.larroy.li...@gmail.com
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > wrote:
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > > From what I have seen Clojure is 15 minutes,
>> which I
>> > >> think
>> > >> > > is
>> > >> > > > > > > > > reasonable.
>> > >> > > > > > > > > > > The only question is that when a binding such as
>> R,
>> > >> Perl or
>> > >> > > > > > Clojure
>> > >> > > > > > > > > > fails,
>> > >> > > > > > > > > > > some devs are a bit confused about how to fix
>> them
>> > >> since
>> > >> > > they
>> > >> > > > > are
>> > >> > > > > > > not
>> > >> > > > > > > > > > > familiar with the testing tools and the language.
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
>> > >> > > > > > carinme...@gmail.com
>> > >> > > > > > > >
>> > >> > > > > > > > > > wrote:
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > > Great idea Marco! Anything that you think
>> would be
>> > >> > > valuable
>> > >> > > > > to
>> > >> > > > > > > > share
>> > >> > > > > > > > > > > would
>> > >> > > > > > > > > > > > be good. The duration of each node in the test
>> stage
>> > >> > > sounds
>> > >> > > > > > like
>> > >> > > > > > > a
>> > >> > > > > > > > > good
>> > >> > > > > > > > > > > > start.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > - Carin
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu
>> <
>> > >> > > > > > > > > > marco.g.ab...@gmail.com>
>> > >> > > > > > > > > > > > wrote:
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > > Hi,
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > we record a bunch of metrics about run
>> statistics
>> > >> (down
>> > >> > > > to
>> > >> > > > > > the
>> > >> > > > > > > > > > duration
>> > >> > > > > > > > > > > > of
>> > >> > > > > > > > > > > > > every individual step). If you tell me which
>> ones
>> > >> > > you're
>> > >> > > > > > > > > particularly
>> > >> > > > > > > > > > > > > interested in (probably total duration of
>> each
>> > >> node in
>> > >> > > > the
>> > >> > > > > > test
>> > >> > > > > > > > > > stage),
>> > >> > > > > > > > > > > > I'm
>> > >> > > > > > > > > > > > > happy to provide them.
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > Dimensions are (in hierarchical order):
>> > >> > > > > > > > > > > > > - job
>> > >> > > > > > > > > > > > > - branch
>> > >> > > > > > > > > > > > > - stage
>> > >> > > > > > > > > > > > > - node
>> > >> > > > > > > > > > > > > - step
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > Unfortunately I don't have the possibility to
>> > >> export
>> > >> > > them
>> > >> > > > > > since
>> > >> > > > > > > > we
>> > >> > > > > > > > > > > store
>> > >> > > > > > > > > > > > > them in CloudWatch Metrics which afaik
>> doesn't
>> > >> offer
>> > >> > > raw
>> > >> > > > > > > exports.
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > Best regards,
>> > >> > > > > > > > > > > > > Marco
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > Carin Meier <carinme...@gmail.com> schrieb
>> am
>> > >> Mi., 14.
>> > >> > > > > Aug.
>> > >> > > > > > > > 2019,
>> > >> > > > > > > > > > > 19:43:
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > I would prefer to keep the language
>> binding in
>> > >> the PR
>> > >> > > > > > > process.
>> > >> > > > > > > > > > > Perhaps
>> > >> > > > > > > > > > > > we
>> > >> > > > > > > > > > > > > > could do some analytics to see how much
>> each of
>> > >> the
>> > >> > > > > > language
>> > >> > > > > > > > > > bindings
>> > >> > > > > > > > > > > > is
>> > >> > > > > > > > > > > > > > contributing to overall run time.
>> > >> > > > > > > > > > > > > > If we have some metrics on that, maybe we
>> can
>> > >> come up
>> > >> > > > > with
>> > >> > > > > > a
>> > >> > > > > > > > > > > guideline
>> > >> > > > > > > > > > > > of
>> > >> > > > > > > > > > > > > > how much time each should take. Another
>> > >> possibility
>> > >> > > is
>> > >> > > > > > > leverage
>> > >> > > > > > > > > the
>> > >> > > > > > > > > > > > > > parallel builds more.
>> > >> > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro
>> Larroy <
>> > >> > > > > > > > > > > > > pedro.larroy.li...@gmail.com
>> > >> > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > wrote:
>> > >> > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > Hi Carin.
>> > >> > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > That's a good point, all things
>> considered
>> > >> would
>> > >> > > your
>> > >> > > > > > > > > preference
>> > >> > > > > > > > > > be
>> > >> > > > > > > > > > > > to
>> > >> > > > > > > > > > > > > > keep
>> > >> > > > > > > > > > > > > > > the Clojure tests as part of the PR
>> process
>> > >> or in
>> > >> > > > > > Nightly?
>> > >> > > > > > > > > > > > > > > Some options are having notifications
>> here or
>> > >> in
>> > >> > > > slack.
>> > >> > > > > > But
>> > >> > > > > > > > if
>> > >> > > > > > > > > we
>> > >> > > > > > > > > > > > think
>> > >> > > > > > > > > > > > > > > breakages would go unnoticed maybe is
>> not a
>> > >> good
>> > >> > > idea
>> > >> > > > > to
>> > >> > > > > > > > fully
>> > >> > > > > > > > > > > remove
>> > >> > > > > > > > > > > > > > > bindings from the PR process and just
>> > >> streamline
>> > >> > > the
>> > >> > > > > > > process.
>> > >> > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > Pedro.
>> > >> > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin
>> Meier <
>> > >> > > > > > > > > > carinme...@gmail.com>
>> > >> > > > > > > > > > > > > > wrote:
>> > >> > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > Before any binding tests are moved to
>> > >> nightly, I
>> > >> > > > > think
>> > >> > > > > > we
>> > >> > > > > > > > > need
>> > >> > > > > > > > > > to
>> > >> > > > > > > > > > > > > > figure
>> > >> > > > > > > > > > > > > > > > out how the community can get proper
>> > >> > > notifications
>> > >> > > > of
>> > >> > > > > > > > failure
>> > >> > > > > > > > > > and
>> > >> > > > > > > > > > > > > > success
>> > >> > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I
>> think
>> > >> that
>> > >> > > > > > breakages
>> > >> > > > > > > > > would
>> > >> > > > > > > > > > go
>> > >> > > > > > > > > > > > > > > > unnoticed.
>> > >> > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > -Carin
>> > >> > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro
>> > >> Larroy <
>> > >> > > > > > > > > > > > > > > pedro.larroy.li...@gmail.com
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > wrote:
>> > >> > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > Hi
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > Seems we are hitting some problems
>> in CI.
>> > >> I
>> > >> > > > propose
>> > >> > > > > > the
>> > >> > > > > > > > > > > following
>> > >> > > > > > > > > > > > > > > action
>> > >> > > > > > > > > > > > > > > > > items to remedy the situation and
>> > >> accelerate
>> > >> > > turn
>> > >> > > > > > > around
>> > >> > > > > > > > > > times
>> > >> > > > > > > > > > > in
>> > >> > > > > > > > > > > > > CI,
>> > >> > > > > > > > > > > > > > > > > reduce cost, complexity and
>> probability of
>> > >> > > > failure
>> > >> > > > > > > > blocking
>> > >> > > > > > > > > > PRs
>> > >> > > > > > > > > > > > and
>> > >> > > > > > > > > > > > > > > > > frustrating developers:
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio from
>> VS
>> > >> 2015 to
>> > >> > > > VS
>> > >> > > > > > > 2017.
>> > >> > > > > > > > > The
>> > >> > > > > > > > > > > > > > > > > build_windows.py infrastructure
>> should
>> > >> easily
>> > >> > > > work
>> > >> > > > > > with
>> > >> > > > > > > > the
>> > >> > > > > > > > > > new
>> > >> > > > > > > > > > > > > > > version.
>> > >> > > > > > > > > > > > > > > > > Currently some PRs are blocked by
>> this:
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
>> > >> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to
>> nightly.
>> > >> > > Tracked
>> > >> > > > at
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
>> > >> > > > > > > > > > > > > > > > > * Move non-python bindings tests to
>> > >> nightly.
>> > >> > > If a
>> > >> > > > > > > commit
>> > >> > > > > > > > is
>> > >> > > > > > > > > > > > > touching
>> > >> > > > > > > > > > > > > > > > other
>> > >> > > > > > > > > > > > > > > > > bindings, the reviewer should ask
>> for a
>> > >> full
>> > >> > > run
>> > >> > > > > > which
>> > >> > > > > > > > can
>> > >> > > > > > > > > be
>> > >> > > > > > > > > > > > done
>> > >> > > > > > > > > > > > > > > > locally,
>> > >> > > > > > > > > > > > > > > > > use the label bot to trigger a full
>> CI
>> > >> build,
>> > >> > > or
>> > >> > > > > > defer
>> > >> > > > > > > to
>> > >> > > > > > > > > > > > nightly.
>> > >> > > > > > > > > > > > > > > > > * Provide a couple of basic sanity
>> > >> performance
>> > >> > > > > tests
>> > >> > > > > > on
>> > >> > > > > > > > > small
>> > >> > > > > > > > > > > > > models
>> > >> > > > > > > > > > > > > > > that
>> > >> > > > > > > > > > > > > > > > > are run on CI and can be echoed by
>> the
>> > >> label
>> > >> > > bot
>> > >> > > > > as a
>> > >> > > > > > > > > comment
>> > >> > > > > > > > > > > for
>> > >> > > > > > > > > > > > > > PRs.
>> > >> > > > > > > > > > > > > > > > > * Address unit tests that take more
>> than
>> > >> > > 10-20s,
>> > >> > > > > > > > streamline
>> > >> > > > > > > > > > > them
>> > >> > > > > > > > > > > > or
>> > >> > > > > > > > > > > > > > > move
>> > >> > > > > > > > > > > > > > > > > them to nightly if it can't be done.
>> > >> > > > > > > > > > > > > > > > > * Open sourcing the remaining CI
>> > >> infrastructure
>> > >> > > > > > scripts
>> > >> > > > > > > > so
>> > >> > > > > > > > > > the
>> > >> > > > > > > > > > > > > > > community
>> > >> > > > > > > > > > > > > > > > > can contribute.
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > I think our goal should be turnaround
>> > >> under
>> > >> > > > 30min.
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > I would also like to touch base with
>> the
>> > >> > > > community
>> > >> > > > > > that
>> > >> > > > > > > > > some
>> > >> > > > > > > > > > > PRs
>> > >> > > > > > > > > > > > > are
>> > >> > > > > > > > > > > > > > > not
>> > >> > > > > > > > > > > > > > > > > being followed up by committers
>> asking for
>> > >> > > > changes.
>> > >> > > > > > For
>> > >> > > > > > > > > > example
>> > >> > > > > > > > > > > > > this
>> > >> > > > > > > > > > > > > > PR
>> > >> > > > > > > > > > > > > > > > is
>> > >> > > > > > > > > > > > > > > > > importtant and is hanging for a long
>> time.
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > https://github.com/apache/incubator-mxnet/pull/15051
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > This is another, less important but
>> more
>> > >> > > trivial
>> > >> > > > to
>> > >> > > > > > > > review:
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > https://github.com/apache/incubator-mxnet/pull/14940
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > I think comitters requesting changes
>> and
>> > >> not
>> > >> > > > > > folllowing
>> > >> > > > > > > > up
>> > >> > > > > > > > > in
>> > >> > > > > > > > > > > > > > > reasonable
>> > >> > > > > > > > > > > > > > > > > time is not healthy for the project.
>> I
>> > >> suggest
>> > >> > > > > > > > configuring
>> > >> > > > > > > > > > > github
>> > >> > > > > > > > > > > > > > > > > Notifications for a good SNR and
>> > >> following up.
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > Regards.
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > Pedro.
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >>
>> > >
>>
>

Re: CI and PRs

Reply via email to