No worries, auto scaling is taking care of that :) -Marco
Sheng Zha <szha....@gmail.com> schrieb am Do., 15. Aug. 2019, 19:50: > The AWS Batch approach should also help with hardware utilization as > machines are launched only when needed :) > > -sz > > > On Aug 15, 2019, at 9:11 AM, Marco de Abreu <marco.g.ab...@gmail.com> > wrote: > > > > Thanks Leonard. Naively dividing by test files would certainly be an easy > > and doable way before going into to proper nose parallelization. Great > idea! > > > > Scalability in terms of nodes is not an issue. Our system can handle at > > least 600 slaves (didn't want to go higher for obvious reasons). But I > > think we don't even have to go that far because most of the time, our > > machines are heavily under utilized due to the single-threaded nature of > > most tests. Thus, parallel test execution on the same machine would > already > > speed up the process by great lengths. > > > > -Marco > > > > P.S. the structure of the Jenkinsfiles seems pretty familiar :P i am glad > > my approach is considered helpful :) > > > > Leonard Lausen <l-softw...@lausen.nl> schrieb am Do., 15. Aug. 2019, > 18:59: > > > >> To parallelize across machines: For GluonNLP we started submitting test > >> jobs to AWS Batch. Just adding a for-loop over the units in the > >> Jenkinsfile [1] and submitting a job for each [2] works quite well. Then > >> Jenkins just waits for all jobs to finish and retrieves their status. > >> This works since AWS Batch added GPU support this April [3]. > >> > >> For MXNet, naively parallelizing over the files defining the test cases > >> that are in the longest running Pipeline stage may already help? > >> > >> [1]: > >> > https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53 > >> [2]: > https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py > >> [3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/ > >> > >> Marco de Abreu <marco.g.ab...@gmail.com> writes: > >> > >>> The first start wrt parallelization could certainly be start adding > >>> parallel test execution in nosetests. > >>> > >>> -Marco > >>> > >>> Aaron Markham <aaron.s.mark...@gmail.com> schrieb am Do., 15. Aug. > 2019, > >>> 05:39: > >>> > >>>> The PRs Thomas and I are working on for the new docs and website share > >> the > >>>> mxnet binary in the new CI pipelines we made. Speeds things up a lot. > >>>> > >>>> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivie...@gmail.com> > >> wrote: > >>>> > >>>>> I see it done daily now, and while I can’t share all the details, > it’s > >>>> not > >>>>> an incredibly complex thing, and involves not much more than nfs/efs > >>>>> sharing and remote ssh commands. All it takes is a little ingenuity > >> and > >>>>> some imagination. > >>>>> > >>>>> On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy < > >>>> pedro.larroy.li...@gmail.com > >>>>>> > >>>>> wrote: > >>>>> > >>>>>> Sounds good in theory. I think there are complex details with > >> regards > >>>> of > >>>>>> resource sharing during parallel execution. Still I think both ways > >> can > >>>>> be > >>>>>> explored. I think some tests run for unreasonably long times for > >> what > >>>>> they > >>>>>> are doing. We already scale parts of the pipeline horizontally > >> across > >>>>>> workers. > >>>>>> > >>>>>> > >>>>>> On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier < > >> cjolivie...@apache.org> > >>>>>> wrote: > >>>>>> > >>>>>>> +1 > >>>>>>> > >>>>>>> Rather than remove tests (which doesn’t scale as a solution), why > >> not > >>>>>> scale > >>>>>>> them horizontally so that they finish more quickly? Across > >> processes > >>>> or > >>>>>>> even on a pool of machines that aren’t necessarily the build > >> machine? > >>>>>>> > >>>>>>> On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu < > >>>>> marco.g.ab...@gmail.com > >>>>>>> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> With regards to time I rather prefer us spending a bit more > >> time on > >>>>>>>> maintenance than somebody running into an error that could've > >> been > >>>>>> caught > >>>>>>>> with a test. > >>>>>>>> > >>>>>>>> I mean, our Publishing pipeline for Scala GPU has been broken > >> for > >>>>> quite > >>>>>>>> some time now, but nobody noticed that. Basically my stance on > >> that > >>>>>>> matter > >>>>>>>> is that as soon as something is not blocking, you can also just > >>>>>>> deactivate > >>>>>>>> it since you don't have a forcing function in an open source > >>>> project. > >>>>>>>> People will rarely come back and fix the errors of some nightly > >>>> test > >>>>>> that > >>>>>>>> they introduced. > >>>>>>>> > >>>>>>>> -Marco > >>>>>>>> > >>>>>>>> Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. Aug. > >> 2019, > >>>>>> 21:59: > >>>>>>>> > >>>>>>>>> If a language binding test is failing for a not important > >> reason, > >>>>>> then > >>>>>>> it > >>>>>>>>> is too brittle and needs to be fixed (we have fixed some of > >> these > >>>>>> with > >>>>>>>> the > >>>>>>>>> Clojure package [1]). > >>>>>>>>> But in general, if we thinking of the MXNet project as one > >>>> project > >>>>>> that > >>>>>>>> is > >>>>>>>>> across all the language bindings, then we want to know if some > >>>>>>>> fundamental > >>>>>>>>> code change is going to break a downstream package. > >>>>>>>>> I can't speak for all the high level package binding > >> maintainers, > >>>>> but > >>>>>>> I'm > >>>>>>>>> always happy to pitch in to provide code fixes to help the > >> base > >>>> PR > >>>>>> get > >>>>>>>>> green. > >>>>>>>>> > >>>>>>>>> The time costs to maintain such a large CI project obviously > >>>> needs > >>>>> to > >>>>>>> be > >>>>>>>>> considered as well. > >>>>>>>>> > >>>>>>>>> [1] https://github.com/apache/incubator-mxnet/pull/15579 > >>>>>>>>> > >>>>>>>>> On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy < > >>>>>>>> pedro.larroy.li...@gmail.com > >>>>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> From what I have seen Clojure is 15 minutes, which I think > >> is > >>>>>>>> reasonable. > >>>>>>>>>> The only question is that when a binding such as R, Perl or > >>>>> Clojure > >>>>>>>>> fails, > >>>>>>>>>> some devs are a bit confused about how to fix them since > >> they > >>>> are > >>>>>> not > >>>>>>>>>> familiar with the testing tools and the language. > >>>>>>>>>> > >>>>>>>>>> On Wed, Aug 14, 2019 at 11:57 AM Carin Meier < > >>>>> carinme...@gmail.com > >>>>>>> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Great idea Marco! Anything that you think would be > >> valuable > >>>> to > >>>>>>> share > >>>>>>>>>> would > >>>>>>>>>>> be good. The duration of each node in the test stage > >> sounds > >>>>> like > >>>>>> a > >>>>>>>> good > >>>>>>>>>>> start. > >>>>>>>>>>> > >>>>>>>>>>> - Carin > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu < > >>>>>>>>> marco.g.ab...@gmail.com> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> we record a bunch of metrics about run statistics (down > >> to > >>>>> the > >>>>>>>>> duration > >>>>>>>>>>> of > >>>>>>>>>>>> every individual step). If you tell me which ones you're > >>>>>>>> particularly > >>>>>>>>>>>> interested in (probably total duration of each node in > >> the > >>>>> test > >>>>>>>>> stage), > >>>>>>>>>>> I'm > >>>>>>>>>>>> happy to provide them. > >>>>>>>>>>>> > >>>>>>>>>>>> Dimensions are (in hierarchical order): > >>>>>>>>>>>> - job > >>>>>>>>>>>> - branch > >>>>>>>>>>>> - stage > >>>>>>>>>>>> - node > >>>>>>>>>>>> - step > >>>>>>>>>>>> > >>>>>>>>>>>> Unfortunately I don't have the possibility to export > >> them > >>>>> since > >>>>>>> we > >>>>>>>>>> store > >>>>>>>>>>>> them in CloudWatch Metrics which afaik doesn't offer raw > >>>>>> exports. > >>>>>>>>>>>> > >>>>>>>>>>>> Best regards, > >>>>>>>>>>>> Marco > >>>>>>>>>>>> > >>>>>>>>>>>> Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. > >>>> Aug. > >>>>>>> 2019, > >>>>>>>>>> 19:43: > >>>>>>>>>>>> > >>>>>>>>>>>>> I would prefer to keep the language binding in the PR > >>>>>> process. > >>>>>>>>>> Perhaps > >>>>>>>>>>> we > >>>>>>>>>>>>> could do some analytics to see how much each of the > >>>>> language > >>>>>>>>> bindings > >>>>>>>>>>> is > >>>>>>>>>>>>> contributing to overall run time. > >>>>>>>>>>>>> If we have some metrics on that, maybe we can come up > >>>> with > >>>>> a > >>>>>>>>>> guideline > >>>>>>>>>>> of > >>>>>>>>>>>>> how much time each should take. Another possibility is > >>>>>> leverage > >>>>>>>> the > >>>>>>>>>>>>> parallel builds more. > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy < > >>>>>>>>>>>> pedro.larroy.li...@gmail.com > >>>>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Carin. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> That's a good point, all things considered would > >> your > >>>>>>>> preference > >>>>>>>>> be > >>>>>>>>>>> to > >>>>>>>>>>>>> keep > >>>>>>>>>>>>>> the Clojure tests as part of the PR process or in > >>>>> Nightly? > >>>>>>>>>>>>>> Some options are having notifications here or in > >> slack. > >>>>> But > >>>>>>> if > >>>>>>>> we > >>>>>>>>>>> think > >>>>>>>>>>>>>> breakages would go unnoticed maybe is not a good > >> idea > >>>> to > >>>>>>> fully > >>>>>>>>>> remove > >>>>>>>>>>>>>> bindings from the PR process and just streamline the > >>>>>> process. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Pedro. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Wed, Aug 14, 2019 at 5:09 AM Carin Meier < > >>>>>>>>> carinme...@gmail.com> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Before any binding tests are moved to nightly, I > >>>> think > >>>>> we > >>>>>>>> need > >>>>>>>>> to > >>>>>>>>>>>>> figure > >>>>>>>>>>>>>>> out how the community can get proper > >> notifications of > >>>>>>> failure > >>>>>>>>> and > >>>>>>>>>>>>> success > >>>>>>>>>>>>>>> on those nightly runs. Otherwise, I think that > >>>>> breakages > >>>>>>>> would > >>>>>>>>> go > >>>>>>>>>>>>>>> unnoticed. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> -Carin > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy < > >>>>>>>>>>>>>> pedro.larroy.li...@gmail.com > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hi > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Seems we are hitting some problems in CI. I > >> propose > >>>>> the > >>>>>>>>>> following > >>>>>>>>>>>>>> action > >>>>>>>>>>>>>>>> items to remedy the situation and accelerate > >> turn > >>>>>> around > >>>>>>>>> times > >>>>>>>>>> in > >>>>>>>>>>>> CI, > >>>>>>>>>>>>>>>> reduce cost, complexity and probability of > >> failure > >>>>>>> blocking > >>>>>>>>> PRs > >>>>>>>>>>> and > >>>>>>>>>>>>>>>> frustrating developers: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> * Upgrade Windows visual studio from VS 2015 to > >> VS > >>>>>> 2017. > >>>>>>>> The > >>>>>>>>>>>>>>>> build_windows.py infrastructure should easily > >> work > >>>>> with > >>>>>>> the > >>>>>>>>> new > >>>>>>>>>>>>>> version. > >>>>>>>>>>>>>>>> Currently some PRs are blocked by this: > >>>>>>>>>>>>>>>> > >>>>> https://github.com/apache/incubator-mxnet/issues/13958 > >>>>>>>>>>>>>>>> * Move Gluon Model zoo tests to nightly. > >> Tracked at > >>>>>>>>>>>>>>>> > >>>>> https://github.com/apache/incubator-mxnet/issues/15295 > >>>>>>>>>>>>>>>> * Move non-python bindings tests to nightly. If > >> a > >>>>>> commit > >>>>>>> is > >>>>>>>>>>>> touching > >>>>>>>>>>>>>>> other > >>>>>>>>>>>>>>>> bindings, the reviewer should ask for a full run > >>>>> which > >>>>>>> can > >>>>>>>> be > >>>>>>>>>>> done > >>>>>>>>>>>>>>> locally, > >>>>>>>>>>>>>>>> use the label bot to trigger a full CI build, or > >>>>> defer > >>>>>> to > >>>>>>>>>>> nightly. > >>>>>>>>>>>>>>>> * Provide a couple of basic sanity performance > >>>> tests > >>>>> on > >>>>>>>> small > >>>>>>>>>>>> models > >>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>> are run on CI and can be echoed by the label bot > >>>> as a > >>>>>>>> comment > >>>>>>>>>> for > >>>>>>>>>>>>> PRs. > >>>>>>>>>>>>>>>> * Address unit tests that take more than 10-20s, > >>>>>>> streamline > >>>>>>>>>> them > >>>>>>>>>>> or > >>>>>>>>>>>>>> move > >>>>>>>>>>>>>>>> them to nightly if it can't be done. > >>>>>>>>>>>>>>>> * Open sourcing the remaining CI infrastructure > >>>>> scripts > >>>>>>> so > >>>>>>>>> the > >>>>>>>>>>>>>> community > >>>>>>>>>>>>>>>> can contribute. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I think our goal should be turnaround under > >> 30min. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I would also like to touch base with the > >> community > >>>>> that > >>>>>>>> some > >>>>>>>>>> PRs > >>>>>>>>>>>> are > >>>>>>>>>>>>>> not > >>>>>>>>>>>>>>>> being followed up by committers asking for > >> changes. > >>>>> For > >>>>>>>>> example > >>>>>>>>>>>> this > >>>>>>>>>>>>> PR > >>>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>> importtant and is hanging for a long time. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>> https://github.com/apache/incubator-mxnet/pull/15051 > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> This is another, less important but more > >> trivial to > >>>>>>> review: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>> https://github.com/apache/incubator-mxnet/pull/14940 > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I think comitters requesting changes and not > >>>>> folllowing > >>>>>>> up > >>>>>>>> in > >>>>>>>>>>>>>> reasonable > >>>>>>>>>>>>>>>> time is not healthy for the project. I suggest > >>>>>>> configuring > >>>>>>>>>> github > >>>>>>>>>>>>>>>> Notifications for a good SNR and following up. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Regards. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Pedro. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >> >