Also, I forgot, another workaround is that I added the -R flag to the build logic (build.py) so the container is not rebuilt for manual use.
On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <pedro.larroy.li...@gmail.com> wrote: > > Hi Aaron. > > As Marco explained, if you are in master the cache usually works, there's > two issues that I have observed: > > 1 - Docker doesn't automatically pull the base image (ex. ubuntu:16.04) so > if your cached base which is used in the FROM statement becomes outdated > your caching won't work. (Using docker pull ubuntu:16.04) or the base > images from the container helps with this. > > 2 - There's another situation where the above doesn't help which seems to > be an unidentified issue with the docker cache: > https://github.com/docker/docker.github.io/issues/8886 > > We can get a short term workaround for #1 by explicitly pulling bases from > the script, but I think docker should do it when using --cache-from so > maybe contributing a patch to docker would the best approach. > > Pedro > > On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <aaron.s.mark...@gmail.com> > wrote: > >> When you create a new Dockerfile and use that on CI, it doesn't seem >> to cache some of the steps... like this: >> >> Step 13/15 : RUN /work/ubuntu_docs.sh >> ---> Running in a1e522f3283b >> [91m+ echo 'Installing dependencies...' >> + apt-get update >> [0mInstalling dependencies. >> >> Or this.... >> >> Step 4/13 : RUN /work/ubuntu_core.sh >> ---> Running in e7882d7aa750 >> [91m+ apt-get update >> >> I get if I was changing those scripts, but then I'd think it should >> cache after running it once... but, no. >> >> >> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <marco.g.ab...@gmail.com> >> wrote: >> > >> > Do I understand it correctly that you are saying that the Docker cache >> > doesn't work properly and regularly reinstalls dependencies? Or do you >> mean >> > that you only have cache misses when you modify the dependencies - which >> > would be expected? >> > >> > -Marco >> > >> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham < >> aaron.s.mark...@gmail.com> >> > wrote: >> > >> > > Many of the CI pipelines follow this pattern: >> > > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why >> > > repeat steps 1-3 over and over? >> > > >> > > Now, some tests use a stashed binary and docker cache. And I see this >> work >> > > locally, but for the most part, on CI, you're gonna sit through a >> > > dependency install. >> > > >> > > I noticed that almost all jobs use an ubuntu setup that is fully >> loaded. >> > > Without cache, it can take 10 or more minutes to build. So I made a >> lite >> > > version. Takes only a few minutes instead. >> > > >> > > In some cases archiving worked great to share across pipelines, but as >> > > Marco mentioned we need a storage solution to make that happen. We >> can't >> > > archive every intermediate artifact for each PR. >> > > >> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy < >> pedro.larroy.li...@gmail.com> >> > > wrote: >> > > >> > > > Hi Aaron. Why speeds things up? What's the difference? >> > > > >> > > > Pedro. >> > > > >> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham < >> aaron.s.mark...@gmail.com >> > > > >> > > > wrote: >> > > > >> > > > > The PRs Thomas and I are working on for the new docs and website >> share >> > > > the >> > > > > mxnet binary in the new CI pipelines we made. Speeds things up a >> lot. >> > > > > >> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivie...@gmail.com> >> > > wrote: >> > > > > >> > > > > > I see it done daily now, and while I can’t share all the >> details, >> > > it’s >> > > > > not >> > > > > > an incredibly complex thing, and involves not much more than >> nfs/efs >> > > > > > sharing and remote ssh commands. All it takes is a little >> ingenuity >> > > > and >> > > > > > some imagination. >> > > > > > >> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy < >> > > > > pedro.larroy.li...@gmail.com >> > > > > > > >> > > > > > wrote: >> > > > > > >> > > > > > > Sounds good in theory. I think there are complex details with >> > > regards >> > > > > of >> > > > > > > resource sharing during parallel execution. Still I think >> both ways >> > > > can >> > > > > > be >> > > > > > > explored. I think some tests run for unreasonably long times >> for >> > > what >> > > > > > they >> > > > > > > are doing. We already scale parts of the pipeline horizontally >> > > across >> > > > > > > workers. >> > > > > > > >> > > > > > > >> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier < >> > > > cjolivie...@apache.org> >> > > > > > > wrote: >> > > > > > > >> > > > > > > > +1 >> > > > > > > > >> > > > > > > > Rather than remove tests (which doesn’t scale as a >> solution), why >> > > > not >> > > > > > > scale >> > > > > > > > them horizontally so that they finish more quickly? Across >> > > > processes >> > > > > or >> > > > > > > > even on a pool of machines that aren’t necessarily the build >> > > > machine? >> > > > > > > > >> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu < >> > > > > > marco.g.ab...@gmail.com >> > > > > > > > >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > With regards to time I rather prefer us spending a bit >> more >> > > time >> > > > on >> > > > > > > > > maintenance than somebody running into an error that >> could've >> > > > been >> > > > > > > caught >> > > > > > > > > with a test. >> > > > > > > > > >> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has been >> broken >> > > for >> > > > > > quite >> > > > > > > > > some time now, but nobody noticed that. Basically my >> stance on >> > > > that >> > > > > > > > matter >> > > > > > > > > is that as soon as something is not blocking, you can >> also just >> > > > > > > > deactivate >> > > > > > > > > it since you don't have a forcing function in an open >> source >> > > > > project. >> > > > > > > > > People will rarely come back and fix the errors of some >> nightly >> > > > > test >> > > > > > > that >> > > > > > > > > they introduced. >> > > > > > > > > >> > > > > > > > > -Marco >> > > > > > > > > >> > > > > > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. >> Aug. >> > > > 2019, >> > > > > > > 21:59: >> > > > > > > > > >> > > > > > > > > > If a language binding test is failing for a not >> important >> > > > reason, >> > > > > > > then >> > > > > > > > it >> > > > > > > > > > is too brittle and needs to be fixed (we have fixed >> some of >> > > > these >> > > > > > > with >> > > > > > > > > the >> > > > > > > > > > Clojure package [1]). >> > > > > > > > > > But in general, if we thinking of the MXNet project as >> one >> > > > > project >> > > > > > > that >> > > > > > > > > is >> > > > > > > > > > across all the language bindings, then we want to know >> if >> > > some >> > > > > > > > > fundamental >> > > > > > > > > > code change is going to break a downstream package. >> > > > > > > > > > I can't speak for all the high level package binding >> > > > maintainers, >> > > > > > but >> > > > > > > > I'm >> > > > > > > > > > always happy to pitch in to provide code fixes to help >> the >> > > base >> > > > > PR >> > > > > > > get >> > > > > > > > > > green. >> > > > > > > > > > >> > > > > > > > > > The time costs to maintain such a large CI project >> obviously >> > > > > needs >> > > > > > to >> > > > > > > > be >> > > > > > > > > > considered as well. >> > > > > > > > > > >> > > > > > > > > > [1] >> https://github.com/apache/incubator-mxnet/pull/15579 >> > > > > > > > > > >> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy < >> > > > > > > > > pedro.larroy.li...@gmail.com >> > > > > > > > > > > >> > > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > From what I have seen Clojure is 15 minutes, which I >> think >> > > is >> > > > > > > > > reasonable. >> > > > > > > > > > > The only question is that when a binding such as R, >> Perl or >> > > > > > Clojure >> > > > > > > > > > fails, >> > > > > > > > > > > some devs are a bit confused about how to fix them >> since >> > > they >> > > > > are >> > > > > > > not >> > > > > > > > > > > familiar with the testing tools and the language. >> > > > > > > > > > > >> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier < >> > > > > > carinme...@gmail.com >> > > > > > > > >> > > > > > > > > > wrote: >> > > > > > > > > > > >> > > > > > > > > > > > Great idea Marco! Anything that you think would be >> > > valuable >> > > > > to >> > > > > > > > share >> > > > > > > > > > > would >> > > > > > > > > > > > be good. The duration of each node in the test stage >> > > sounds >> > > > > > like >> > > > > > > a >> > > > > > > > > good >> > > > > > > > > > > > start. >> > > > > > > > > > > > >> > > > > > > > > > > > - Carin >> > > > > > > > > > > > >> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu < >> > > > > > > > > > marco.g.ab...@gmail.com> >> > > > > > > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > Hi, >> > > > > > > > > > > > > >> > > > > > > > > > > > > we record a bunch of metrics about run statistics >> (down >> > > > to >> > > > > > the >> > > > > > > > > > duration >> > > > > > > > > > > > of >> > > > > > > > > > > > > every individual step). If you tell me which ones >> > > you're >> > > > > > > > > particularly >> > > > > > > > > > > > > interested in (probably total duration of each >> node in >> > > > the >> > > > > > test >> > > > > > > > > > stage), >> > > > > > > > > > > > I'm >> > > > > > > > > > > > > happy to provide them. >> > > > > > > > > > > > > >> > > > > > > > > > > > > Dimensions are (in hierarchical order): >> > > > > > > > > > > > > - job >> > > > > > > > > > > > > - branch >> > > > > > > > > > > > > - stage >> > > > > > > > > > > > > - node >> > > > > > > > > > > > > - step >> > > > > > > > > > > > > >> > > > > > > > > > > > > Unfortunately I don't have the possibility to >> export >> > > them >> > > > > > since >> > > > > > > > we >> > > > > > > > > > > store >> > > > > > > > > > > > > them in CloudWatch Metrics which afaik doesn't >> offer >> > > raw >> > > > > > > exports. >> > > > > > > > > > > > > >> > > > > > > > > > > > > Best regards, >> > > > > > > > > > > > > Marco >> > > > > > > > > > > > > >> > > > > > > > > > > > > Carin Meier <carinme...@gmail.com> schrieb am >> Mi., 14. >> > > > > Aug. >> > > > > > > > 2019, >> > > > > > > > > > > 19:43: >> > > > > > > > > > > > > >> > > > > > > > > > > > > > I would prefer to keep the language binding in >> the PR >> > > > > > > process. >> > > > > > > > > > > Perhaps >> > > > > > > > > > > > we >> > > > > > > > > > > > > > could do some analytics to see how much each of >> the >> > > > > > language >> > > > > > > > > > bindings >> > > > > > > > > > > > is >> > > > > > > > > > > > > > contributing to overall run time. >> > > > > > > > > > > > > > If we have some metrics on that, maybe we can >> come up >> > > > > with >> > > > > > a >> > > > > > > > > > > guideline >> > > > > > > > > > > > of >> > > > > > > > > > > > > > how much time each should take. Another >> possibility >> > > is >> > > > > > > leverage >> > > > > > > > > the >> > > > > > > > > > > > > > parallel builds more. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy < >> > > > > > > > > > > > > pedro.larroy.li...@gmail.com >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Hi Carin. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > That's a good point, all things considered >> would >> > > your >> > > > > > > > > preference >> > > > > > > > > > be >> > > > > > > > > > > > to >> > > > > > > > > > > > > > keep >> > > > > > > > > > > > > > > the Clojure tests as part of the PR process >> or in >> > > > > > Nightly? >> > > > > > > > > > > > > > > Some options are having notifications here or >> in >> > > > slack. >> > > > > > But >> > > > > > > > if >> > > > > > > > > we >> > > > > > > > > > > > think >> > > > > > > > > > > > > > > breakages would go unnoticed maybe is not a >> good >> > > idea >> > > > > to >> > > > > > > > fully >> > > > > > > > > > > remove >> > > > > > > > > > > > > > > bindings from the PR process and just >> streamline >> > > the >> > > > > > > process. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Pedro. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier < >> > > > > > > > > > carinme...@gmail.com> >> > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Before any binding tests are moved to >> nightly, I >> > > > > think >> > > > > > we >> > > > > > > > > need >> > > > > > > > > > to >> > > > > > > > > > > > > > figure >> > > > > > > > > > > > > > > > out how the community can get proper >> > > notifications >> > > > of >> > > > > > > > failure >> > > > > > > > > > and >> > > > > > > > > > > > > > success >> > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I think >> that >> > > > > > breakages >> > > > > > > > > would >> > > > > > > > > > go >> > > > > > > > > > > > > > > > unnoticed. >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > -Carin >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro >> Larroy < >> > > > > > > > > > > > > > > pedro.larroy.li...@gmail.com >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Hi >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Seems we are hitting some problems in CI. >> I >> > > > propose >> > > > > > the >> > > > > > > > > > > following >> > > > > > > > > > > > > > > action >> > > > > > > > > > > > > > > > > items to remedy the situation and >> accelerate >> > > turn >> > > > > > > around >> > > > > > > > > > times >> > > > > > > > > > > in >> > > > > > > > > > > > > CI, >> > > > > > > > > > > > > > > > > reduce cost, complexity and probability of >> > > > failure >> > > > > > > > blocking >> > > > > > > > > > PRs >> > > > > > > > > > > > and >> > > > > > > > > > > > > > > > > frustrating developers: >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio from VS >> 2015 to >> > > > VS >> > > > > > > 2017. >> > > > > > > > > The >> > > > > > > > > > > > > > > > > build_windows.py infrastructure should >> easily >> > > > work >> > > > > > with >> > > > > > > > the >> > > > > > > > > > new >> > > > > > > > > > > > > > > version. >> > > > > > > > > > > > > > > > > Currently some PRs are blocked by this: >> > > > > > > > > > > > > > > > > >> > > > > > https://github.com/apache/incubator-mxnet/issues/13958 >> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly. >> > > Tracked >> > > > at >> > > > > > > > > > > > > > > > > >> > > > > > https://github.com/apache/incubator-mxnet/issues/15295 >> > > > > > > > > > > > > > > > > * Move non-python bindings tests to >> nightly. >> > > If a >> > > > > > > commit >> > > > > > > > is >> > > > > > > > > > > > > touching >> > > > > > > > > > > > > > > > other >> > > > > > > > > > > > > > > > > bindings, the reviewer should ask for a >> full >> > > run >> > > > > > which >> > > > > > > > can >> > > > > > > > > be >> > > > > > > > > > > > done >> > > > > > > > > > > > > > > > locally, >> > > > > > > > > > > > > > > > > use the label bot to trigger a full CI >> build, >> > > or >> > > > > > defer >> > > > > > > to >> > > > > > > > > > > > nightly. >> > > > > > > > > > > > > > > > > * Provide a couple of basic sanity >> performance >> > > > > tests >> > > > > > on >> > > > > > > > > small >> > > > > > > > > > > > > models >> > > > > > > > > > > > > > > that >> > > > > > > > > > > > > > > > > are run on CI and can be echoed by the >> label >> > > bot >> > > > > as a >> > > > > > > > > comment >> > > > > > > > > > > for >> > > > > > > > > > > > > > PRs. >> > > > > > > > > > > > > > > > > * Address unit tests that take more than >> > > 10-20s, >> > > > > > > > streamline >> > > > > > > > > > > them >> > > > > > > > > > > > or >> > > > > > > > > > > > > > > move >> > > > > > > > > > > > > > > > > them to nightly if it can't be done. >> > > > > > > > > > > > > > > > > * Open sourcing the remaining CI >> infrastructure >> > > > > > scripts >> > > > > > > > so >> > > > > > > > > > the >> > > > > > > > > > > > > > > community >> > > > > > > > > > > > > > > > > can contribute. >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > I think our goal should be turnaround >> under >> > > > 30min. >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > I would also like to touch base with the >> > > > community >> > > > > > that >> > > > > > > > > some >> > > > > > > > > > > PRs >> > > > > > > > > > > > > are >> > > > > > > > > > > > > > > not >> > > > > > > > > > > > > > > > > being followed up by committers asking for >> > > > changes. >> > > > > > For >> > > > > > > > > > example >> > > > > > > > > > > > > this >> > > > > > > > > > > > > > PR >> > > > > > > > > > > > > > > > is >> > > > > > > > > > > > > > > > > importtant and is hanging for a long time. >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > https://github.com/apache/incubator-mxnet/pull/15051 >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > This is another, less important but more >> > > trivial >> > > > to >> > > > > > > > review: >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > https://github.com/apache/incubator-mxnet/pull/14940 >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > I think comitters requesting changes and >> not >> > > > > > folllowing >> > > > > > > > up >> > > > > > > > > in >> > > > > > > > > > > > > > > reasonable >> > > > > > > > > > > > > > > > > time is not healthy for the project. I >> suggest >> > > > > > > > configuring >> > > > > > > > > > > github >> > > > > > > > > > > > > > > > > Notifications for a good SNR and >> following up. >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Regards. >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Pedro. >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> >