To parallelize across machines: For GluonNLP we started submitting test jobs to AWS Batch. Just adding a for-loop over the units in the Jenkinsfile [1] and submitting a job for each [2] works quite well. Then Jenkins just waits for all jobs to finish and retrieves their status. This works since AWS Batch added GPU support this April [3].
For MXNet, naively parallelizing over the files defining the test cases that are in the longest running Pipeline stage may already help? [1]: https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53 [2]: https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py [3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/ Marco de Abreu <marco.g.ab...@gmail.com> writes: > The first start wrt parallelization could certainly be start adding > parallel test execution in nosetests. > > -Marco > > Aaron Markham <aaron.s.mark...@gmail.com> schrieb am Do., 15. Aug. 2019, > 05:39: > >> The PRs Thomas and I are working on for the new docs and website share the >> mxnet binary in the new CI pipelines we made. Speeds things up a lot. >> >> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivie...@gmail.com> wrote: >> >> > I see it done daily now, and while I can’t share all the details, it’s >> not >> > an incredibly complex thing, and involves not much more than nfs/efs >> > sharing and remote ssh commands. All it takes is a little ingenuity and >> > some imagination. >> > >> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy < >> pedro.larroy.li...@gmail.com >> > > >> > wrote: >> > >> > > Sounds good in theory. I think there are complex details with regards >> of >> > > resource sharing during parallel execution. Still I think both ways can >> > be >> > > explored. I think some tests run for unreasonably long times for what >> > they >> > > are doing. We already scale parts of the pipeline horizontally across >> > > workers. >> > > >> > > >> > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <cjolivie...@apache.org> >> > > wrote: >> > > >> > > > +1 >> > > > >> > > > Rather than remove tests (which doesn’t scale as a solution), why not >> > > scale >> > > > them horizontally so that they finish more quickly? Across processes >> or >> > > > even on a pool of machines that aren’t necessarily the build machine? >> > > > >> > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu < >> > marco.g.ab...@gmail.com >> > > > >> > > > wrote: >> > > > >> > > > > With regards to time I rather prefer us spending a bit more time on >> > > > > maintenance than somebody running into an error that could've been >> > > caught >> > > > > with a test. >> > > > > >> > > > > I mean, our Publishing pipeline for Scala GPU has been broken for >> > quite >> > > > > some time now, but nobody noticed that. Basically my stance on that >> > > > matter >> > > > > is that as soon as something is not blocking, you can also just >> > > > deactivate >> > > > > it since you don't have a forcing function in an open source >> project. >> > > > > People will rarely come back and fix the errors of some nightly >> test >> > > that >> > > > > they introduced. >> > > > > >> > > > > -Marco >> > > > > >> > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. Aug. 2019, >> > > 21:59: >> > > > > >> > > > > > If a language binding test is failing for a not important reason, >> > > then >> > > > it >> > > > > > is too brittle and needs to be fixed (we have fixed some of these >> > > with >> > > > > the >> > > > > > Clojure package [1]). >> > > > > > But in general, if we thinking of the MXNet project as one >> project >> > > that >> > > > > is >> > > > > > across all the language bindings, then we want to know if some >> > > > > fundamental >> > > > > > code change is going to break a downstream package. >> > > > > > I can't speak for all the high level package binding maintainers, >> > but >> > > > I'm >> > > > > > always happy to pitch in to provide code fixes to help the base >> PR >> > > get >> > > > > > green. >> > > > > > >> > > > > > The time costs to maintain such a large CI project obviously >> needs >> > to >> > > > be >> > > > > > considered as well. >> > > > > > >> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579 >> > > > > > >> > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy < >> > > > > pedro.larroy.li...@gmail.com >> > > > > > > >> > > > > > wrote: >> > > > > > >> > > > > > > From what I have seen Clojure is 15 minutes, which I think is >> > > > > reasonable. >> > > > > > > The only question is that when a binding such as R, Perl or >> > Clojure >> > > > > > fails, >> > > > > > > some devs are a bit confused about how to fix them since they >> are >> > > not >> > > > > > > familiar with the testing tools and the language. >> > > > > > > >> > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier < >> > carinme...@gmail.com >> > > > >> > > > > > wrote: >> > > > > > > >> > > > > > > > Great idea Marco! Anything that you think would be valuable >> to >> > > > share >> > > > > > > would >> > > > > > > > be good. The duration of each node in the test stage sounds >> > like >> > > a >> > > > > good >> > > > > > > > start. >> > > > > > > > >> > > > > > > > - Carin >> > > > > > > > >> > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu < >> > > > > > marco.g.ab...@gmail.com> >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > Hi, >> > > > > > > > > >> > > > > > > > > we record a bunch of metrics about run statistics (down to >> > the >> > > > > > duration >> > > > > > > > of >> > > > > > > > > every individual step). If you tell me which ones you're >> > > > > particularly >> > > > > > > > > interested in (probably total duration of each node in the >> > test >> > > > > > stage), >> > > > > > > > I'm >> > > > > > > > > happy to provide them. >> > > > > > > > > >> > > > > > > > > Dimensions are (in hierarchical order): >> > > > > > > > > - job >> > > > > > > > > - branch >> > > > > > > > > - stage >> > > > > > > > > - node >> > > > > > > > > - step >> > > > > > > > > >> > > > > > > > > Unfortunately I don't have the possibility to export them >> > since >> > > > we >> > > > > > > store >> > > > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw >> > > exports. >> > > > > > > > > >> > > > > > > > > Best regards, >> > > > > > > > > Marco >> > > > > > > > > >> > > > > > > > > Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. >> Aug. >> > > > 2019, >> > > > > > > 19:43: >> > > > > > > > > >> > > > > > > > > > I would prefer to keep the language binding in the PR >> > > process. >> > > > > > > Perhaps >> > > > > > > > we >> > > > > > > > > > could do some analytics to see how much each of the >> > language >> > > > > > bindings >> > > > > > > > is >> > > > > > > > > > contributing to overall run time. >> > > > > > > > > > If we have some metrics on that, maybe we can come up >> with >> > a >> > > > > > > guideline >> > > > > > > > of >> > > > > > > > > > how much time each should take. Another possibility is >> > > leverage >> > > > > the >> > > > > > > > > > parallel builds more. >> > > > > > > > > > >> > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy < >> > > > > > > > > pedro.larroy.li...@gmail.com >> > > > > > > > > > > >> > > > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > Hi Carin. >> > > > > > > > > > > >> > > > > > > > > > > That's a good point, all things considered would your >> > > > > preference >> > > > > > be >> > > > > > > > to >> > > > > > > > > > keep >> > > > > > > > > > > the Clojure tests as part of the PR process or in >> > Nightly? >> > > > > > > > > > > Some options are having notifications here or in slack. >> > But >> > > > if >> > > > > we >> > > > > > > > think >> > > > > > > > > > > breakages would go unnoticed maybe is not a good idea >> to >> > > > fully >> > > > > > > remove >> > > > > > > > > > > bindings from the PR process and just streamline the >> > > process. >> > > > > > > > > > > >> > > > > > > > > > > Pedro. >> > > > > > > > > > > >> > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier < >> > > > > > carinme...@gmail.com> >> > > > > > > > > > wrote: >> > > > > > > > > > > >> > > > > > > > > > > > Before any binding tests are moved to nightly, I >> think >> > we >> > > > > need >> > > > > > to >> > > > > > > > > > figure >> > > > > > > > > > > > out how the community can get proper notifications of >> > > > failure >> > > > > > and >> > > > > > > > > > success >> > > > > > > > > > > > on those nightly runs. Otherwise, I think that >> > breakages >> > > > > would >> > > > > > go >> > > > > > > > > > > > unnoticed. >> > > > > > > > > > > > >> > > > > > > > > > > > -Carin >> > > > > > > > > > > > >> > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy < >> > > > > > > > > > > pedro.larroy.li...@gmail.com >> > > > > > > > > > > > > >> > > > > > > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > Hi >> > > > > > > > > > > > > >> > > > > > > > > > > > > Seems we are hitting some problems in CI. I propose >> > the >> > > > > > > following >> > > > > > > > > > > action >> > > > > > > > > > > > > items to remedy the situation and accelerate turn >> > > around >> > > > > > times >> > > > > > > in >> > > > > > > > > CI, >> > > > > > > > > > > > > reduce cost, complexity and probability of failure >> > > > blocking >> > > > > > PRs >> > > > > > > > and >> > > > > > > > > > > > > frustrating developers: >> > > > > > > > > > > > > >> > > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS >> > > 2017. >> > > > > The >> > > > > > > > > > > > > build_windows.py infrastructure should easily work >> > with >> > > > the >> > > > > > new >> > > > > > > > > > > version. >> > > > > > > > > > > > > Currently some PRs are blocked by this: >> > > > > > > > > > > > > >> > https://github.com/apache/incubator-mxnet/issues/13958 >> > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at >> > > > > > > > > > > > > >> > https://github.com/apache/incubator-mxnet/issues/15295 >> > > > > > > > > > > > > * Move non-python bindings tests to nightly. If a >> > > commit >> > > > is >> > > > > > > > > touching >> > > > > > > > > > > > other >> > > > > > > > > > > > > bindings, the reviewer should ask for a full run >> > which >> > > > can >> > > > > be >> > > > > > > > done >> > > > > > > > > > > > locally, >> > > > > > > > > > > > > use the label bot to trigger a full CI build, or >> > defer >> > > to >> > > > > > > > nightly. >> > > > > > > > > > > > > * Provide a couple of basic sanity performance >> tests >> > on >> > > > > small >> > > > > > > > > models >> > > > > > > > > > > that >> > > > > > > > > > > > > are run on CI and can be echoed by the label bot >> as a >> > > > > comment >> > > > > > > for >> > > > > > > > > > PRs. >> > > > > > > > > > > > > * Address unit tests that take more than 10-20s, >> > > > streamline >> > > > > > > them >> > > > > > > > or >> > > > > > > > > > > move >> > > > > > > > > > > > > them to nightly if it can't be done. >> > > > > > > > > > > > > * Open sourcing the remaining CI infrastructure >> > scripts >> > > > so >> > > > > > the >> > > > > > > > > > > community >> > > > > > > > > > > > > can contribute. >> > > > > > > > > > > > > >> > > > > > > > > > > > > I think our goal should be turnaround under 30min. >> > > > > > > > > > > > > >> > > > > > > > > > > > > I would also like to touch base with the community >> > that >> > > > > some >> > > > > > > PRs >> > > > > > > > > are >> > > > > > > > > > > not >> > > > > > > > > > > > > being followed up by committers asking for changes. >> > For >> > > > > > example >> > > > > > > > > this >> > > > > > > > > > PR >> > > > > > > > > > > > is >> > > > > > > > > > > > > importtant and is hanging for a long time. >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> https://github.com/apache/incubator-mxnet/pull/15051 >> > > > > > > > > > > > > >> > > > > > > > > > > > > This is another, less important but more trivial to >> > > > review: >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> https://github.com/apache/incubator-mxnet/pull/14940 >> > > > > > > > > > > > > >> > > > > > > > > > > > > I think comitters requesting changes and not >> > folllowing >> > > > up >> > > > > in >> > > > > > > > > > > reasonable >> > > > > > > > > > > > > time is not healthy for the project. I suggest >> > > > configuring >> > > > > > > github >> > > > > > > > > > > > > Notifications for a good SNR and following up. >> > > > > > > > > > > > > >> > > > > > > > > > > > > Regards. >> > > > > > > > > > > > > >> > > > > > > > > > > > > Pedro. >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >>