The AWS Batch approach should also help with hardware utilization as machines are launched only when needed :)
-sz > On Aug 15, 2019, at 9:11 AM, Marco de Abreu <marco.g.ab...@gmail.com> wrote: > > Thanks Leonard. Naively dividing by test files would certainly be an easy > and doable way before going into to proper nose parallelization. Great idea! > > Scalability in terms of nodes is not an issue. Our system can handle at > least 600 slaves (didn't want to go higher for obvious reasons). But I > think we don't even have to go that far because most of the time, our > machines are heavily under utilized due to the single-threaded nature of > most tests. Thus, parallel test execution on the same machine would already > speed up the process by great lengths. > > -Marco > > P.S. the structure of the Jenkinsfiles seems pretty familiar :P i am glad > my approach is considered helpful :) > > Leonard Lausen <l-softw...@lausen.nl> schrieb am Do., 15. Aug. 2019, 18:59: > >> To parallelize across machines: For GluonNLP we started submitting test >> jobs to AWS Batch. Just adding a for-loop over the units in the >> Jenkinsfile [1] and submitting a job for each [2] works quite well. Then >> Jenkins just waits for all jobs to finish and retrieves their status. >> This works since AWS Batch added GPU support this April [3]. >> >> For MXNet, naively parallelizing over the files defining the test cases >> that are in the longest running Pipeline stage may already help? >> >> [1]: >> https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53 >> [2]: https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py >> [3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/ >> >> Marco de Abreu <marco.g.ab...@gmail.com> writes: >> >>> The first start wrt parallelization could certainly be start adding >>> parallel test execution in nosetests. >>> >>> -Marco >>> >>> Aaron Markham <aaron.s.mark...@gmail.com> schrieb am Do., 15. Aug. 2019, >>> 05:39: >>> >>>> The PRs Thomas and I are working on for the new docs and website share >> the >>>> mxnet binary in the new CI pipelines we made. Speeds things up a lot. >>>> >>>> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivie...@gmail.com> >> wrote: >>>> >>>>> I see it done daily now, and while I can’t share all the details, it’s >>>> not >>>>> an incredibly complex thing, and involves not much more than nfs/efs >>>>> sharing and remote ssh commands. All it takes is a little ingenuity >> and >>>>> some imagination. >>>>> >>>>> On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy < >>>> pedro.larroy.li...@gmail.com >>>>>> >>>>> wrote: >>>>> >>>>>> Sounds good in theory. I think there are complex details with >> regards >>>> of >>>>>> resource sharing during parallel execution. Still I think both ways >> can >>>>> be >>>>>> explored. I think some tests run for unreasonably long times for >> what >>>>> they >>>>>> are doing. We already scale parts of the pipeline horizontally >> across >>>>>> workers. >>>>>> >>>>>> >>>>>> On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier < >> cjolivie...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> +1 >>>>>>> >>>>>>> Rather than remove tests (which doesn’t scale as a solution), why >> not >>>>>> scale >>>>>>> them horizontally so that they finish more quickly? Across >> processes >>>> or >>>>>>> even on a pool of machines that aren’t necessarily the build >> machine? >>>>>>> >>>>>>> On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu < >>>>> marco.g.ab...@gmail.com >>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> With regards to time I rather prefer us spending a bit more >> time on >>>>>>>> maintenance than somebody running into an error that could've >> been >>>>>> caught >>>>>>>> with a test. >>>>>>>> >>>>>>>> I mean, our Publishing pipeline for Scala GPU has been broken >> for >>>>> quite >>>>>>>> some time now, but nobody noticed that. Basically my stance on >> that >>>>>>> matter >>>>>>>> is that as soon as something is not blocking, you can also just >>>>>>> deactivate >>>>>>>> it since you don't have a forcing function in an open source >>>> project. >>>>>>>> People will rarely come back and fix the errors of some nightly >>>> test >>>>>> that >>>>>>>> they introduced. >>>>>>>> >>>>>>>> -Marco >>>>>>>> >>>>>>>> Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. Aug. >> 2019, >>>>>> 21:59: >>>>>>>> >>>>>>>>> If a language binding test is failing for a not important >> reason, >>>>>> then >>>>>>> it >>>>>>>>> is too brittle and needs to be fixed (we have fixed some of >> these >>>>>> with >>>>>>>> the >>>>>>>>> Clojure package [1]). >>>>>>>>> But in general, if we thinking of the MXNet project as one >>>> project >>>>>> that >>>>>>>> is >>>>>>>>> across all the language bindings, then we want to know if some >>>>>>>> fundamental >>>>>>>>> code change is going to break a downstream package. >>>>>>>>> I can't speak for all the high level package binding >> maintainers, >>>>> but >>>>>>> I'm >>>>>>>>> always happy to pitch in to provide code fixes to help the >> base >>>> PR >>>>>> get >>>>>>>>> green. >>>>>>>>> >>>>>>>>> The time costs to maintain such a large CI project obviously >>>> needs >>>>> to >>>>>>> be >>>>>>>>> considered as well. >>>>>>>>> >>>>>>>>> [1] https://github.com/apache/incubator-mxnet/pull/15579 >>>>>>>>> >>>>>>>>> On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy < >>>>>>>> pedro.larroy.li...@gmail.com >>>>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> From what I have seen Clojure is 15 minutes, which I think >> is >>>>>>>> reasonable. >>>>>>>>>> The only question is that when a binding such as R, Perl or >>>>> Clojure >>>>>>>>> fails, >>>>>>>>>> some devs are a bit confused about how to fix them since >> they >>>> are >>>>>> not >>>>>>>>>> familiar with the testing tools and the language. >>>>>>>>>> >>>>>>>>>> On Wed, Aug 14, 2019 at 11:57 AM Carin Meier < >>>>> carinme...@gmail.com >>>>>>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Great idea Marco! Anything that you think would be >> valuable >>>> to >>>>>>> share >>>>>>>>>> would >>>>>>>>>>> be good. The duration of each node in the test stage >> sounds >>>>> like >>>>>> a >>>>>>>> good >>>>>>>>>>> start. >>>>>>>>>>> >>>>>>>>>>> - Carin >>>>>>>>>>> >>>>>>>>>>> On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu < >>>>>>>>> marco.g.ab...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> we record a bunch of metrics about run statistics (down >> to >>>>> the >>>>>>>>> duration >>>>>>>>>>> of >>>>>>>>>>>> every individual step). If you tell me which ones you're >>>>>>>> particularly >>>>>>>>>>>> interested in (probably total duration of each node in >> the >>>>> test >>>>>>>>> stage), >>>>>>>>>>> I'm >>>>>>>>>>>> happy to provide them. >>>>>>>>>>>> >>>>>>>>>>>> Dimensions are (in hierarchical order): >>>>>>>>>>>> - job >>>>>>>>>>>> - branch >>>>>>>>>>>> - stage >>>>>>>>>>>> - node >>>>>>>>>>>> - step >>>>>>>>>>>> >>>>>>>>>>>> Unfortunately I don't have the possibility to export >> them >>>>> since >>>>>>> we >>>>>>>>>> store >>>>>>>>>>>> them in CloudWatch Metrics which afaik doesn't offer raw >>>>>> exports. >>>>>>>>>>>> >>>>>>>>>>>> Best regards, >>>>>>>>>>>> Marco >>>>>>>>>>>> >>>>>>>>>>>> Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. >>>> Aug. >>>>>>> 2019, >>>>>>>>>> 19:43: >>>>>>>>>>>> >>>>>>>>>>>>> I would prefer to keep the language binding in the PR >>>>>> process. >>>>>>>>>> Perhaps >>>>>>>>>>> we >>>>>>>>>>>>> could do some analytics to see how much each of the >>>>> language >>>>>>>>> bindings >>>>>>>>>>> is >>>>>>>>>>>>> contributing to overall run time. >>>>>>>>>>>>> If we have some metrics on that, maybe we can come up >>>> with >>>>> a >>>>>>>>>> guideline >>>>>>>>>>> of >>>>>>>>>>>>> how much time each should take. Another possibility is >>>>>> leverage >>>>>>>> the >>>>>>>>>>>>> parallel builds more. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy < >>>>>>>>>>>> pedro.larroy.li...@gmail.com >>>>>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Carin. >>>>>>>>>>>>>> >>>>>>>>>>>>>> That's a good point, all things considered would >> your >>>>>>>> preference >>>>>>>>> be >>>>>>>>>>> to >>>>>>>>>>>>> keep >>>>>>>>>>>>>> the Clojure tests as part of the PR process or in >>>>> Nightly? >>>>>>>>>>>>>> Some options are having notifications here or in >> slack. >>>>> But >>>>>>> if >>>>>>>> we >>>>>>>>>>> think >>>>>>>>>>>>>> breakages would go unnoticed maybe is not a good >> idea >>>> to >>>>>>> fully >>>>>>>>>> remove >>>>>>>>>>>>>> bindings from the PR process and just streamline the >>>>>> process. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Pedro. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Aug 14, 2019 at 5:09 AM Carin Meier < >>>>>>>>> carinme...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Before any binding tests are moved to nightly, I >>>> think >>>>> we >>>>>>>> need >>>>>>>>> to >>>>>>>>>>>>> figure >>>>>>>>>>>>>>> out how the community can get proper >> notifications of >>>>>>> failure >>>>>>>>> and >>>>>>>>>>>>> success >>>>>>>>>>>>>>> on those nightly runs. Otherwise, I think that >>>>> breakages >>>>>>>> would >>>>>>>>> go >>>>>>>>>>>>>>> unnoticed. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -Carin >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy < >>>>>>>>>>>>>> pedro.larroy.li...@gmail.com >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Seems we are hitting some problems in CI. I >> propose >>>>> the >>>>>>>>>> following >>>>>>>>>>>>>> action >>>>>>>>>>>>>>>> items to remedy the situation and accelerate >> turn >>>>>> around >>>>>>>>> times >>>>>>>>>> in >>>>>>>>>>>> CI, >>>>>>>>>>>>>>>> reduce cost, complexity and probability of >> failure >>>>>>> blocking >>>>>>>>> PRs >>>>>>>>>>> and >>>>>>>>>>>>>>>> frustrating developers: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> * Upgrade Windows visual studio from VS 2015 to >> VS >>>>>> 2017. >>>>>>>> The >>>>>>>>>>>>>>>> build_windows.py infrastructure should easily >> work >>>>> with >>>>>>> the >>>>>>>>> new >>>>>>>>>>>>>> version. >>>>>>>>>>>>>>>> Currently some PRs are blocked by this: >>>>>>>>>>>>>>>> >>>>> https://github.com/apache/incubator-mxnet/issues/13958 >>>>>>>>>>>>>>>> * Move Gluon Model zoo tests to nightly. >> Tracked at >>>>>>>>>>>>>>>> >>>>> https://github.com/apache/incubator-mxnet/issues/15295 >>>>>>>>>>>>>>>> * Move non-python bindings tests to nightly. If >> a >>>>>> commit >>>>>>> is >>>>>>>>>>>> touching >>>>>>>>>>>>>>> other >>>>>>>>>>>>>>>> bindings, the reviewer should ask for a full run >>>>> which >>>>>>> can >>>>>>>> be >>>>>>>>>>> done >>>>>>>>>>>>>>> locally, >>>>>>>>>>>>>>>> use the label bot to trigger a full CI build, or >>>>> defer >>>>>> to >>>>>>>>>>> nightly. >>>>>>>>>>>>>>>> * Provide a couple of basic sanity performance >>>> tests >>>>> on >>>>>>>> small >>>>>>>>>>>> models >>>>>>>>>>>>>> that >>>>>>>>>>>>>>>> are run on CI and can be echoed by the label bot >>>> as a >>>>>>>> comment >>>>>>>>>> for >>>>>>>>>>>>> PRs. >>>>>>>>>>>>>>>> * Address unit tests that take more than 10-20s, >>>>>>> streamline >>>>>>>>>> them >>>>>>>>>>> or >>>>>>>>>>>>>> move >>>>>>>>>>>>>>>> them to nightly if it can't be done. >>>>>>>>>>>>>>>> * Open sourcing the remaining CI infrastructure >>>>> scripts >>>>>>> so >>>>>>>>> the >>>>>>>>>>>>>> community >>>>>>>>>>>>>>>> can contribute. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think our goal should be turnaround under >> 30min. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I would also like to touch base with the >> community >>>>> that >>>>>>>> some >>>>>>>>>> PRs >>>>>>>>>>>> are >>>>>>>>>>>>>> not >>>>>>>>>>>>>>>> being followed up by committers asking for >> changes. >>>>> For >>>>>>>>> example >>>>>>>>>>>> this >>>>>>>>>>>>> PR >>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>> importtant and is hanging for a long time. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>> https://github.com/apache/incubator-mxnet/pull/15051 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is another, less important but more >> trivial to >>>>>>> review: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>> https://github.com/apache/incubator-mxnet/pull/14940 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think comitters requesting changes and not >>>>> folllowing >>>>>>> up >>>>>>>> in >>>>>>>>>>>>>> reasonable >>>>>>>>>>>>>>>> time is not healthy for the project. I suggest >>>>>>> configuring >>>>>>>>>> github >>>>>>>>>>>>>>>> Notifications for a good SNR and following up. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Pedro. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>