The AWS Batch approach should also help with hardware utilization as machines 
are launched only when needed :)

-sz

> On Aug 15, 2019, at 9:11 AM, Marco de Abreu <marco.g.ab...@gmail.com> wrote:
> 
> Thanks Leonard. Naively dividing by test files would certainly be an easy
> and doable way before going into to proper nose parallelization. Great idea!
> 
> Scalability in terms of nodes is not an issue. Our system can handle at
> least 600 slaves (didn't want to go higher for obvious reasons). But I
> think we don't even have to go that far because most of the time, our
> machines are heavily under utilized due to the single-threaded nature of
> most tests. Thus, parallel test execution on the same machine would already
> speed up the process by great lengths.
> 
> -Marco
> 
> P.S. the structure of the Jenkinsfiles seems pretty familiar :P i am glad
> my approach is considered helpful :)
> 
> Leonard Lausen <l-softw...@lausen.nl> schrieb am Do., 15. Aug. 2019, 18:59:
> 
>> To parallelize across machines: For GluonNLP we started submitting test
>> jobs to AWS Batch. Just adding a for-loop over the units in the
>> Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
>> Jenkins just waits for all jobs to finish and retrieves their status.
>> This works since AWS Batch added GPU support this April [3].
>> 
>> For MXNet, naively parallelizing over the files defining the test cases
>> that are in the longest running Pipeline stage may already help?
>> 
>> [1]:
>> https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53
>> [2]: https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py
>> [3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/
>> 
>> Marco de Abreu <marco.g.ab...@gmail.com> writes:
>> 
>>> The first start wrt parallelization could certainly be start adding
>>> parallel test execution in nosetests.
>>> 
>>> -Marco
>>> 
>>> Aaron Markham <aaron.s.mark...@gmail.com> schrieb am Do., 15. Aug. 2019,
>>> 05:39:
>>> 
>>>> The PRs Thomas and I are working on for the new docs and website share
>> the
>>>> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
>>>> 
>>>> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cjolivie...@gmail.com>
>> wrote:
>>>> 
>>>>> I see it done daily now, and while I can’t share all the details, it’s
>>>> not
>>>>> an incredibly complex thing, and involves not much more than nfs/efs
>>>>> sharing and remote ssh commands.  All it takes is a little ingenuity
>> and
>>>>> some imagination.
>>>>> 
>>>>> On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
>>>> pedro.larroy.li...@gmail.com
>>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Sounds good in theory. I think there are complex details with
>> regards
>>>> of
>>>>>> resource sharing during parallel execution. Still I think both ways
>> can
>>>>> be
>>>>>> explored. I think some tests run for unreasonably long times for
>> what
>>>>> they
>>>>>> are doing. We already scale parts of the pipeline horizontally
>> across
>>>>>> workers.
>>>>>> 
>>>>>> 
>>>>>> On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
>> cjolivie...@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> +1
>>>>>>> 
>>>>>>> Rather than remove tests (which doesn’t scale as a solution), why
>> not
>>>>>> scale
>>>>>>> them horizontally so that they finish more quickly? Across
>> processes
>>>> or
>>>>>>> even on a pool of machines that aren’t necessarily the build
>> machine?
>>>>>>> 
>>>>>>> On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
>>>>> marco.g.ab...@gmail.com
>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> With regards to time I rather prefer us spending a bit more
>> time on
>>>>>>>> maintenance than somebody running into an error that could've
>> been
>>>>>> caught
>>>>>>>> with a test.
>>>>>>>> 
>>>>>>>> I mean, our Publishing pipeline for Scala GPU has been broken
>> for
>>>>> quite
>>>>>>>> some time now, but nobody noticed that. Basically my stance on
>> that
>>>>>>> matter
>>>>>>>> is that as soon as something is not blocking, you can also just
>>>>>>> deactivate
>>>>>>>> it since you don't have a forcing function in an open source
>>>> project.
>>>>>>>> People will rarely come back and fix the errors of some nightly
>>>> test
>>>>>> that
>>>>>>>> they introduced.
>>>>>>>> 
>>>>>>>> -Marco
>>>>>>>> 
>>>>>>>> Carin Meier <carinme...@gmail.com> schrieb am Mi., 14. Aug.
>> 2019,
>>>>>> 21:59:
>>>>>>>> 
>>>>>>>>> If a language binding test is failing for a not important
>> reason,
>>>>>> then
>>>>>>> it
>>>>>>>>> is too brittle and needs to be fixed (we have fixed some of
>> these
>>>>>> with
>>>>>>>> the
>>>>>>>>> Clojure package [1]).
>>>>>>>>> But in general, if we thinking of the MXNet project as one
>>>> project
>>>>>> that
>>>>>>>> is
>>>>>>>>> across all the language bindings, then we want to know if some
>>>>>>>> fundamental
>>>>>>>>> code change is going to break a downstream package.
>>>>>>>>> I can't speak for all the high level package binding
>> maintainers,
>>>>> but
>>>>>>> I'm
>>>>>>>>> always happy to pitch in to provide code fixes to help the
>> base
>>>> PR
>>>>>> get
>>>>>>>>> green.
>>>>>>>>> 
>>>>>>>>> The time costs to maintain such a large CI project obviously
>>>> needs
>>>>> to
>>>>>>> be
>>>>>>>>> considered as well.
>>>>>>>>> 
>>>>>>>>> [1] https://github.com/apache/incubator-mxnet/pull/15579
>>>>>>>>> 
>>>>>>>>> On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
>>>>>>>> pedro.larroy.li...@gmail.com
>>>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> From what I have seen Clojure is 15 minutes, which I think
>> is
>>>>>>>> reasonable.
>>>>>>>>>> The only question is that when a binding such as R, Perl or
>>>>> Clojure
>>>>>>>>> fails,
>>>>>>>>>> some devs are a bit confused about how to fix them since
>> they
>>>> are
>>>>>> not
>>>>>>>>>> familiar with the testing tools and the language.
>>>>>>>>>> 
>>>>>>>>>> On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
>>>>> carinme...@gmail.com
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Great idea Marco! Anything that you think would be
>> valuable
>>>> to
>>>>>>> share
>>>>>>>>>> would
>>>>>>>>>>> be good. The duration of each node in the test stage
>> sounds
>>>>> like
>>>>>> a
>>>>>>>> good
>>>>>>>>>>> start.
>>>>>>>>>>> 
>>>>>>>>>>> - Carin
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
>>>>>>>>> marco.g.ab...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> we record a bunch of metrics about run statistics (down
>> to
>>>>> the
>>>>>>>>> duration
>>>>>>>>>>> of
>>>>>>>>>>>> every individual step). If you tell me which ones you're
>>>>>>>> particularly
>>>>>>>>>>>> interested in (probably total duration of each node in
>> the
>>>>> test
>>>>>>>>> stage),
>>>>>>>>>>> I'm
>>>>>>>>>>>> happy to provide them.
>>>>>>>>>>>> 
>>>>>>>>>>>> Dimensions are (in hierarchical order):
>>>>>>>>>>>> - job
>>>>>>>>>>>> - branch
>>>>>>>>>>>> - stage
>>>>>>>>>>>> - node
>>>>>>>>>>>> - step
>>>>>>>>>>>> 
>>>>>>>>>>>> Unfortunately I don't have the possibility to export
>> them
>>>>> since
>>>>>>> we
>>>>>>>>>> store
>>>>>>>>>>>> them in CloudWatch Metrics which afaik doesn't offer raw
>>>>>> exports.
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Marco
>>>>>>>>>>>> 
>>>>>>>>>>>> Carin Meier <carinme...@gmail.com> schrieb am Mi., 14.
>>>> Aug.
>>>>>>> 2019,
>>>>>>>>>> 19:43:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I would prefer to keep the language binding in the PR
>>>>>> process.
>>>>>>>>>> Perhaps
>>>>>>>>>>> we
>>>>>>>>>>>>> could do some analytics to see how much each of the
>>>>> language
>>>>>>>>> bindings
>>>>>>>>>>> is
>>>>>>>>>>>>> contributing to overall run time.
>>>>>>>>>>>>> If we have some metrics on that, maybe we can come up
>>>> with
>>>>> a
>>>>>>>>>> guideline
>>>>>>>>>>> of
>>>>>>>>>>>>> how much time each should take. Another possibility is
>>>>>> leverage
>>>>>>>> the
>>>>>>>>>>>>> parallel builds more.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
>>>>>>>>>>>> pedro.larroy.li...@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Carin.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> That's a good point, all things considered would
>> your
>>>>>>>> preference
>>>>>>>>> be
>>>>>>>>>>> to
>>>>>>>>>>>>> keep
>>>>>>>>>>>>>> the Clojure tests as part of the PR process or in
>>>>> Nightly?
>>>>>>>>>>>>>> Some options are having notifications here or in
>> slack.
>>>>> But
>>>>>>> if
>>>>>>>> we
>>>>>>>>>>> think
>>>>>>>>>>>>>> breakages would go unnoticed maybe is not a good
>> idea
>>>> to
>>>>>>> fully
>>>>>>>>>> remove
>>>>>>>>>>>>>> bindings from the PR process and just streamline the
>>>>>> process.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Pedro.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
>>>>>>>>> carinme...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Before any binding tests are moved to nightly, I
>>>> think
>>>>> we
>>>>>>>> need
>>>>>>>>> to
>>>>>>>>>>>>> figure
>>>>>>>>>>>>>>> out how the community can get proper
>> notifications of
>>>>>>> failure
>>>>>>>>> and
>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> on those nightly runs. Otherwise, I think that
>>>>> breakages
>>>>>>>> would
>>>>>>>>> go
>>>>>>>>>>>>>>> unnoticed.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -Carin
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
>>>>>>>>>>>>>> pedro.larroy.li...@gmail.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Seems we are hitting some problems in CI. I
>> propose
>>>>> the
>>>>>>>>>> following
>>>>>>>>>>>>>> action
>>>>>>>>>>>>>>>> items to remedy the situation and accelerate
>> turn
>>>>>> around
>>>>>>>>> times
>>>>>>>>>> in
>>>>>>>>>>>> CI,
>>>>>>>>>>>>>>>> reduce cost, complexity and probability of
>> failure
>>>>>>> blocking
>>>>>>>>> PRs
>>>>>>>>>>> and
>>>>>>>>>>>>>>>> frustrating developers:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> * Upgrade Windows visual studio from VS 2015 to
>> VS
>>>>>> 2017.
>>>>>>>> The
>>>>>>>>>>>>>>>> build_windows.py infrastructure should easily
>> work
>>>>> with
>>>>>>> the
>>>>>>>>> new
>>>>>>>>>>>>>> version.
>>>>>>>>>>>>>>>> Currently some PRs are blocked by this:
>>>>>>>>>>>>>>>> 
>>>>> https://github.com/apache/incubator-mxnet/issues/13958
>>>>>>>>>>>>>>>> * Move Gluon Model zoo tests to nightly.
>> Tracked at
>>>>>>>>>>>>>>>> 
>>>>> https://github.com/apache/incubator-mxnet/issues/15295
>>>>>>>>>>>>>>>> * Move non-python bindings tests to nightly. If
>> a
>>>>>> commit
>>>>>>> is
>>>>>>>>>>>> touching
>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>> bindings, the reviewer should ask for a full run
>>>>> which
>>>>>>> can
>>>>>>>> be
>>>>>>>>>>> done
>>>>>>>>>>>>>>> locally,
>>>>>>>>>>>>>>>> use the label bot to trigger a full CI build, or
>>>>> defer
>>>>>> to
>>>>>>>>>>> nightly.
>>>>>>>>>>>>>>>> * Provide a couple of basic sanity performance
>>>> tests
>>>>> on
>>>>>>>> small
>>>>>>>>>>>> models
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> are run on CI and can be echoed by the label bot
>>>> as a
>>>>>>>> comment
>>>>>>>>>> for
>>>>>>>>>>>>> PRs.
>>>>>>>>>>>>>>>> * Address unit tests that take more than 10-20s,
>>>>>>> streamline
>>>>>>>>>> them
>>>>>>>>>>> or
>>>>>>>>>>>>>> move
>>>>>>>>>>>>>>>> them to nightly if it can't be done.
>>>>>>>>>>>>>>>> * Open sourcing the remaining CI infrastructure
>>>>> scripts
>>>>>>> so
>>>>>>>>> the
>>>>>>>>>>>>>> community
>>>>>>>>>>>>>>>> can contribute.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think our goal should be turnaround under
>> 30min.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I would also like to touch base with the
>> community
>>>>> that
>>>>>>>> some
>>>>>>>>>> PRs
>>>>>>>>>>>> are
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>> being followed up by committers asking for
>> changes.
>>>>> For
>>>>>>>>> example
>>>>>>>>>>>> this
>>>>>>>>>>>>> PR
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> importtant and is hanging for a long time.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>> https://github.com/apache/incubator-mxnet/pull/15051
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is another, less important but more
>> trivial to
>>>>>>> review:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>> https://github.com/apache/incubator-mxnet/pull/14940
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think comitters requesting changes and not
>>>>> folllowing
>>>>>>> up
>>>>>>>> in
>>>>>>>>>>>>>> reasonable
>>>>>>>>>>>>>>>> time is not healthy for the project. I suggest
>>>>>>> configuring
>>>>>>>>>> github
>>>>>>>>>>>>>>>> Notifications for a good SNR and following up.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Pedro.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 

Reply via email to