Hi Aaron. As Marco explained, if you are in master the cache usually works, there's two issues that I have observed:
1 - Docker doesn't automatically pull the base image (ex. ubuntu:16.04) so if your cached base which is used in the FROM statement becomes outdated your caching won't work. (Using docker pull ubuntu:16.04) or the base images from the container helps with this. 2 - There's another situation where the above doesn't help which seems to be an unidentified issue with the docker cache: https://github.com/docker/docker.github.io/issues/8886 We can get a short term workaround for #1 by explicitly pulling bases from the script, but I think docker should do it when using --cache-from so maybe contributing a patch to docker would the best approach. Pedro On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <aaron.s.mark...@gmail.com> wrote: > When you create a new Dockerfile and use that on CI, it doesn't seem > to cache some of the steps... like this: > > Step 13/15 : RUN /work/ubuntu_docs.sh > ---> Running in a1e522f3283b > [91m+ echo 'Installing dependencies...' > + apt-get update > [0mInstalling dependencies. > > Or this.... > > Step 4/13 : RUN /work/ubuntu_core.sh > ---> Running in e7882d7aa750 > [91m+ apt-get update > > I get if I was changing those scripts, but then I'd think it should > cache after running it once... but, no. > > > On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <marco.g.ab...@gmail.com> > wrote: > > > > Do I understand it correctly that you are saying that the Docker cache > > doesn't work properly and regularly reinstalls dependencies? Or do you > mean > > that you only have cache misses when you modify the dependencies - which > > would be expected? > > > > -Marco > > > > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham < > aaron.s.mark...@gmail.com> > > wrote: > > > > > Many of the CI pipelines follow this pattern: > > > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why > > > repeat steps 1-3 over and over? > > > > > > Now, some tests use a stashed binary and docker cache. And I see this > work > > > locally, but for the most part, on CI, you're gonna sit through a > > > dependency install. > > > > > > I noticed that almost all jobs use an ubuntu setup that is fully > loaded. > > > Without cache, it can take 10 or more minutes to build. So I made a > lite > > > version. Takes only a few minutes instead. > > > > > > In some cases archiving worked great to share across pipelines, but as > > > Marco mentioned we need a storage solution to make that happen. We > can't > > > archive every intermediate artifact for each PR. > > > > > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <pedro.larroy.li...@gmail.com > > > > > wrote: > > > > > > > Hi Aaron. Why speeds things up? What's the difference? > > > > > > > > Pedro. > > > > > > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham < > aaron.s.mark...@gmail.com > > > > > > > > wrote: > > > > > > > > > The PRs Thomas and I are working on for the new docs and website > share > > > > the > > > > > mxnet binary in the new CI pipelines we made. Speeds things up a > lot. > > > > > > > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivie...@gmail.com> > > > wrote: > > > > > > > > > > > I see it done daily now, and while I can’t share all the details, > > > it’s > > > > > not > > > > > > an incredibly complex thing, and involves not much more than > nfs/efs > > > > > > sharing and remote ssh commands. All it takes is a little > ingenuity > > > > and > > > > > > some imagination. > > > > > > > > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy < > > > > > pedro.larroy.li...@gmail.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Sounds good in theory. I think there are complex details with > > > regards > > > > > of > > > > > > > resource sharing during parallel execution. Still I think both > ways > > > > can > > > > > > be > > > > > > > explored. I think some tests run for unreasonably long times > for > > > what > > > > > > they > > > > > > > are doing. We already scale parts of the pipeline horizontally > > > across > > > > > > > workers. > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier < > > > > cjolivie...@apache.org> > > > > > > > wrote: > > > > > > > > > > > > > > > +1 > > > > > > > > > > > > > > > > Rather than remove tests (which doesn’t scale as a > solution), why > > > > not > > > > > > > scale > > > > > > > > them horizontally so that they finish more quickly? Across > > > > processes > > > > > or > > > > > > > > even on a pool of machines that aren’t necessarily the build > > > > machine? > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu < > > > > > > marco.g.ab...@gmail.com > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > With regards to time I rather prefer us spending a bit more > > > time > > > > on > > > > > > > > > maintenance than somebody running into an error that > could've > > > > been > > > > > > > caught > > > > > > > > > with a test. > > > > > > > > > > > > > > > > > > I mean, our Publishing pipeline for Scala GPU has been > broken > > > for > > > > > > quite > > > > > > > > > some time now, but nobody noticed that. Basically my > stance on > > > > that > > > > > > > > matter > > > > > > > > > is that as soon as something is not blocking, you can also > just > > > > > > > > deactivate > > > > > > > > > it since you don't have a forcing function in an open > source > > > > > project. > > > > > > > > > People will rarely come back and fix the errors of some > nightly > > > > > test > > > > > > > that > > > > > > > > > they introduced. > > > > > > > > > > > > > > > > > > -Marco > > > > > > > > > > > > > > > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. > Aug. > > > > 2019, > > > > > > > 21:59: > > > > > > > > > > > > > > > > > > > If a language binding test is failing for a not important > > > > reason, > > > > > > > then > > > > > > > > it > > > > > > > > > > is too brittle and needs to be fixed (we have fixed some > of > > > > these > > > > > > > with > > > > > > > > > the > > > > > > > > > > Clojure package [1]). > > > > > > > > > > But in general, if we thinking of the MXNet project as > one > > > > > project > > > > > > > that > > > > > > > > > is > > > > > > > > > > across all the language bindings, then we want to know if > > > some > > > > > > > > > fundamental > > > > > > > > > > code change is going to break a downstream package. > > > > > > > > > > I can't speak for all the high level package binding > > > > maintainers, > > > > > > but > > > > > > > > I'm > > > > > > > > > > always happy to pitch in to provide code fixes to help > the > > > base > > > > > PR > > > > > > > get > > > > > > > > > > green. > > > > > > > > > > > > > > > > > > > > The time costs to maintain such a large CI project > obviously > > > > > needs > > > > > > to > > > > > > > > be > > > > > > > > > > considered as well. > > > > > > > > > > > > > > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579 > > > > > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy < > > > > > > > > > pedro.larroy.li...@gmail.com > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > From what I have seen Clojure is 15 minutes, which I > think > > > is > > > > > > > > > reasonable. > > > > > > > > > > > The only question is that when a binding such as R, > Perl or > > > > > > Clojure > > > > > > > > > > fails, > > > > > > > > > > > some devs are a bit confused about how to fix them > since > > > they > > > > > are > > > > > > > not > > > > > > > > > > > familiar with the testing tools and the language. > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier < > > > > > > carinme...@gmail.com > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Great idea Marco! Anything that you think would be > > > valuable > > > > > to > > > > > > > > share > > > > > > > > > > > would > > > > > > > > > > > > be good. The duration of each node in the test stage > > > sounds > > > > > > like > > > > > > > a > > > > > > > > > good > > > > > > > > > > > > start. > > > > > > > > > > > > > > > > > > > > > > > > - Carin > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu < > > > > > > > > > > marco.g.ab...@gmail.com> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > we record a bunch of metrics about run statistics > (down > > > > to > > > > > > the > > > > > > > > > > duration > > > > > > > > > > > > of > > > > > > > > > > > > > every individual step). If you tell me which ones > > > you're > > > > > > > > > particularly > > > > > > > > > > > > > interested in (probably total duration of each > node in > > > > the > > > > > > test > > > > > > > > > > stage), > > > > > > > > > > > > I'm > > > > > > > > > > > > > happy to provide them. > > > > > > > > > > > > > > > > > > > > > > > > > > Dimensions are (in hierarchical order): > > > > > > > > > > > > > - job > > > > > > > > > > > > > - branch > > > > > > > > > > > > > - stage > > > > > > > > > > > > > - node > > > > > > > > > > > > > - step > > > > > > > > > > > > > > > > > > > > > > > > > > Unfortunately I don't have the possibility to > export > > > them > > > > > > since > > > > > > > > we > > > > > > > > > > > store > > > > > > > > > > > > > them in CloudWatch Metrics which afaik doesn't > offer > > > raw > > > > > > > exports. > > > > > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > > Marco > > > > > > > > > > > > > > > > > > > > > > > > > > Carin Meier <carinme...@gmail.com> schrieb am > Mi., 14. > > > > > Aug. > > > > > > > > 2019, > > > > > > > > > > > 19:43: > > > > > > > > > > > > > > > > > > > > > > > > > > > I would prefer to keep the language binding in > the PR > > > > > > > process. > > > > > > > > > > > Perhaps > > > > > > > > > > > > we > > > > > > > > > > > > > > could do some analytics to see how much each of > the > > > > > > language > > > > > > > > > > bindings > > > > > > > > > > > > is > > > > > > > > > > > > > > contributing to overall run time. > > > > > > > > > > > > > > If we have some metrics on that, maybe we can > come up > > > > > with > > > > > > a > > > > > > > > > > > guideline > > > > > > > > > > > > of > > > > > > > > > > > > > > how much time each should take. Another > possibility > > > is > > > > > > > leverage > > > > > > > > > the > > > > > > > > > > > > > > parallel builds more. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy < > > > > > > > > > > > > > pedro.larroy.li...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Carin. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > That's a good point, all things considered > would > > > your > > > > > > > > > preference > > > > > > > > > > be > > > > > > > > > > > > to > > > > > > > > > > > > > > keep > > > > > > > > > > > > > > > the Clojure tests as part of the PR process or > in > > > > > > Nightly? > > > > > > > > > > > > > > > Some options are having notifications here or > in > > > > slack. > > > > > > But > > > > > > > > if > > > > > > > > > we > > > > > > > > > > > > think > > > > > > > > > > > > > > > breakages would go unnoticed maybe is not a > good > > > idea > > > > > to > > > > > > > > fully > > > > > > > > > > > remove > > > > > > > > > > > > > > > bindings from the PR process and just > streamline > > > the > > > > > > > process. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Pedro. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier < > > > > > > > > > > carinme...@gmail.com> > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Before any binding tests are moved to > nightly, I > > > > > think > > > > > > we > > > > > > > > > need > > > > > > > > > > to > > > > > > > > > > > > > > figure > > > > > > > > > > > > > > > > out how the community can get proper > > > notifications > > > > of > > > > > > > > failure > > > > > > > > > > and > > > > > > > > > > > > > > success > > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I think > that > > > > > > breakages > > > > > > > > > would > > > > > > > > > > go > > > > > > > > > > > > > > > > unnoticed. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -Carin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy > < > > > > > > > > > > > > > > > pedro.larroy.li...@gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Seems we are hitting some problems in CI. I > > > > propose > > > > > > the > > > > > > > > > > > following > > > > > > > > > > > > > > > action > > > > > > > > > > > > > > > > > items to remedy the situation and > accelerate > > > turn > > > > > > > around > > > > > > > > > > times > > > > > > > > > > > in > > > > > > > > > > > > > CI, > > > > > > > > > > > > > > > > > reduce cost, complexity and probability of > > > > failure > > > > > > > > blocking > > > > > > > > > > PRs > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > frustrating developers: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > * Upgrade Windows visual studio from VS > 2015 to > > > > VS > > > > > > > 2017. > > > > > > > > > The > > > > > > > > > > > > > > > > > build_windows.py infrastructure should > easily > > > > work > > > > > > with > > > > > > > > the > > > > > > > > > > new > > > > > > > > > > > > > > > version. > > > > > > > > > > > > > > > > > Currently some PRs are blocked by this: > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/13958 > > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly. > > > Tracked > > > > at > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/15295 > > > > > > > > > > > > > > > > > * Move non-python bindings tests to > nightly. > > > If a > > > > > > > commit > > > > > > > > is > > > > > > > > > > > > > touching > > > > > > > > > > > > > > > > other > > > > > > > > > > > > > > > > > bindings, the reviewer should ask for a > full > > > run > > > > > > which > > > > > > > > can > > > > > > > > > be > > > > > > > > > > > > done > > > > > > > > > > > > > > > > locally, > > > > > > > > > > > > > > > > > use the label bot to trigger a full CI > build, > > > or > > > > > > defer > > > > > > > to > > > > > > > > > > > > nightly. > > > > > > > > > > > > > > > > > * Provide a couple of basic sanity > performance > > > > > tests > > > > > > on > > > > > > > > > small > > > > > > > > > > > > > models > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > are run on CI and can be echoed by the > label > > > bot > > > > > as a > > > > > > > > > comment > > > > > > > > > > > for > > > > > > > > > > > > > > PRs. > > > > > > > > > > > > > > > > > * Address unit tests that take more than > > > 10-20s, > > > > > > > > streamline > > > > > > > > > > > them > > > > > > > > > > > > or > > > > > > > > > > > > > > > move > > > > > > > > > > > > > > > > > them to nightly if it can't be done. > > > > > > > > > > > > > > > > > * Open sourcing the remaining CI > infrastructure > > > > > > scripts > > > > > > > > so > > > > > > > > > > the > > > > > > > > > > > > > > > community > > > > > > > > > > > > > > > > > can contribute. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think our goal should be turnaround under > > > > 30min. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I would also like to touch base with the > > > > community > > > > > > that > > > > > > > > > some > > > > > > > > > > > PRs > > > > > > > > > > > > > are > > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > > > being followed up by committers asking for > > > > changes. > > > > > > For > > > > > > > > > > example > > > > > > > > > > > > > this > > > > > > > > > > > > > > PR > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > importtant and is hanging for a long time. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/15051 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is another, less important but more > > > trivial > > > > to > > > > > > > > review: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/14940 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think comitters requesting changes and > not > > > > > > folllowing > > > > > > > > up > > > > > > > > > in > > > > > > > > > > > > > > > reasonable > > > > > > > > > > > > > > > > > time is not healthy for the project. I > suggest > > > > > > > > configuring > > > > > > > > > > > github > > > > > > > > > > > > > > > > > Notifications for a good SNR and following > up. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Pedro. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >