Re: CI and PRs

2019-08-15 Thread Leonard Lausen
To parallelize across machines: For GluonNLP we started submitting test
jobs to AWS Batch. Just adding a for-loop over the units in the
Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
Jenkins just waits for all jobs to finish and retrieves their status.
This works since AWS Batch added GPU support this April [3].

For MXNet, naively parallelizing over the files defining the test cases
that are in the longest running Pipeline stage may already help?

[1]: 
https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53
[2]: https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py
[3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/

Marco de Abreu  writes:

> The first start wrt parallelization could certainly be start adding
> parallel test execution in nosetests.
>
> -Marco
>
> Aaron Markham  schrieb am Do., 15. Aug. 2019,
> 05:39:
>
>> The PRs Thomas and I are working on for the new docs and website share the
>> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
>>
>> On Wed, Aug 14, 2019, 18:16 Chris Olivier  wrote:
>>
>> > I see it done daily now, and while I can’t share all the details, it’s
>> not
>> > an incredibly complex thing, and involves not much more than nfs/efs
>> > sharing and remote ssh commands.  All it takes is a little ingenuity and
>> > some imagination.
>> >
>> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
>> pedro.larroy.li...@gmail.com
>> > >
>> > wrote:
>> >
>> > > Sounds good in theory. I think there are complex details with regards
>> of
>> > > resource sharing during parallel execution. Still I think both ways can
>> > be
>> > > explored. I think some tests run for unreasonably long times for what
>> > they
>> > > are doing. We already scale parts of the pipeline horizontally across
>> > > workers.
>> > >
>> > >
>> > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier 
>> > > wrote:
>> > >
>> > > > +1
>> > > >
>> > > > Rather than remove tests (which doesn’t scale as a solution), why not
>> > > scale
>> > > > them horizontally so that they finish more quickly? Across processes
>> or
>> > > > even on a pool of machines that aren’t necessarily the build machine?
>> > > >
>> > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
>> > marco.g.ab...@gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > > > With regards to time I rather prefer us spending a bit more time on
>> > > > > maintenance than somebody running into an error that could've been
>> > > caught
>> > > > > with a test.
>> > > > >
>> > > > > I mean, our Publishing pipeline for Scala GPU has been broken for
>> > quite
>> > > > > some time now, but nobody noticed that. Basically my stance on that
>> > > > matter
>> > > > > is that as soon as something is not blocking, you can also just
>> > > > deactivate
>> > > > > it since you don't have a forcing function in an open source
>> project.
>> > > > > People will rarely come back and fix the errors of some nightly
>> test
>> > > that
>> > > > > they introduced.
>> > > > >
>> > > > > -Marco
>> > > > >
>> > > > > Carin Meier  schrieb am Mi., 14. Aug. 2019,
>> > > 21:59:
>> > > > >
>> > > > > > If a language binding test is failing for a not important reason,
>> > > then
>> > > > it
>> > > > > > is too brittle and needs to be fixed (we have fixed some of these
>> > > with
>> > > > > the
>> > > > > > Clojure package [1]).
>> > > > > > But in general, if we thinking of the MXNet project as one
>> project
>> > > that
>> > > > > is
>> > > > > > across all the language bindings, then we want to know if some
>> > > > > fundamental
>> > > > > > code change is going to break a downstream package.
>> > > > > > I can't speak for all the high level package binding maintainers,
>> > but
>> > > > I'm
>> > > > > > always happy to pitch in to provide code fixes to help the base
>> PR
>> > > get
>> > > > > > green.
>> > > > > >
>> > > > > > The time costs to maintain such a large CI project obviously
>> needs
>> > to
>> > > > be
>> > > > > > considered as well.
>> > > > > >
>> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
>> > > > > >
>> > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
>> > > > > pedro.larroy.li...@gmail.com
>> > > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > From what I have seen Clojure is 15 minutes, which I think is
>> > > > > reasonable.
>> > > > > > > The only question is that when a binding such as R, Perl or
>> > Clojure
>> > > > > > fails,
>> > > > > > > some devs are a bit confused about how to fix them since they
>> are
>> > > not
>> > > > > > > familiar with the testing tools and the language.
>> > > > > > >
>> > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
>> > carinme...@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > Great idea Marco! Anything that you think would be valuable
>> to
>> > > > share
>> > > > > > > would
>> > > > > > > > be good. The duration of each node in the test stage sou

Re: CI and PRs

2019-08-15 Thread Marco de Abreu
Thanks Leonard. Naively dividing by test files would certainly be an easy
and doable way before going into to proper nose parallelization. Great idea!

Scalability in terms of nodes is not an issue. Our system can handle at
least 600 slaves (didn't want to go higher for obvious reasons). But I
think we don't even have to go that far because most of the time, our
machines are heavily under utilized due to the single-threaded nature of
most tests. Thus, parallel test execution on the same machine would already
speed up the process by great lengths.

-Marco

P.S. the structure of the Jenkinsfiles seems pretty familiar :P i am glad
my approach is considered helpful :)

Leonard Lausen  schrieb am Do., 15. Aug. 2019, 18:59:

> To parallelize across machines: For GluonNLP we started submitting test
> jobs to AWS Batch. Just adding a for-loop over the units in the
> Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
> Jenkins just waits for all jobs to finish and retrieves their status.
> This works since AWS Batch added GPU support this April [3].
>
> For MXNet, naively parallelizing over the files defining the test cases
> that are in the longest running Pipeline stage may already help?
>
> [1]:
> https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53
> [2]: https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py
> [3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/
>
> Marco de Abreu  writes:
>
> > The first start wrt parallelization could certainly be start adding
> > parallel test execution in nosetests.
> >
> > -Marco
> >
> > Aaron Markham  schrieb am Do., 15. Aug. 2019,
> > 05:39:
> >
> >> The PRs Thomas and I are working on for the new docs and website share
> the
> >> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> >>
> >> On Wed, Aug 14, 2019, 18:16 Chris Olivier 
> wrote:
> >>
> >> > I see it done daily now, and while I can’t share all the details, it’s
> >> not
> >> > an incredibly complex thing, and involves not much more than nfs/efs
> >> > sharing and remote ssh commands.  All it takes is a little ingenuity
> and
> >> > some imagination.
> >> >
> >> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> >> pedro.larroy.li...@gmail.com
> >> > >
> >> > wrote:
> >> >
> >> > > Sounds good in theory. I think there are complex details with
> regards
> >> of
> >> > > resource sharing during parallel execution. Still I think both ways
> can
> >> > be
> >> > > explored. I think some tests run for unreasonably long times for
> what
> >> > they
> >> > > are doing. We already scale parts of the pipeline horizontally
> across
> >> > > workers.
> >> > >
> >> > >
> >> > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> cjolivie...@apache.org>
> >> > > wrote:
> >> > >
> >> > > > +1
> >> > > >
> >> > > > Rather than remove tests (which doesn’t scale as a solution), why
> not
> >> > > scale
> >> > > > them horizontally so that they finish more quickly? Across
> processes
> >> or
> >> > > > even on a pool of machines that aren’t necessarily the build
> machine?
> >> > > >
> >> > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> >> > marco.g.ab...@gmail.com
> >> > > >
> >> > > > wrote:
> >> > > >
> >> > > > > With regards to time I rather prefer us spending a bit more
> time on
> >> > > > > maintenance than somebody running into an error that could've
> been
> >> > > caught
> >> > > > > with a test.
> >> > > > >
> >> > > > > I mean, our Publishing pipeline for Scala GPU has been broken
> for
> >> > quite
> >> > > > > some time now, but nobody noticed that. Basically my stance on
> that
> >> > > > matter
> >> > > > > is that as soon as something is not blocking, you can also just
> >> > > > deactivate
> >> > > > > it since you don't have a forcing function in an open source
> >> project.
> >> > > > > People will rarely come back and fix the errors of some nightly
> >> test
> >> > > that
> >> > > > > they introduced.
> >> > > > >
> >> > > > > -Marco
> >> > > > >
> >> > > > > Carin Meier  schrieb am Mi., 14. Aug.
> 2019,
> >> > > 21:59:
> >> > > > >
> >> > > > > > If a language binding test is failing for a not important
> reason,
> >> > > then
> >> > > > it
> >> > > > > > is too brittle and needs to be fixed (we have fixed some of
> these
> >> > > with
> >> > > > > the
> >> > > > > > Clojure package [1]).
> >> > > > > > But in general, if we thinking of the MXNet project as one
> >> project
> >> > > that
> >> > > > > is
> >> > > > > > across all the language bindings, then we want to know if some
> >> > > > > fundamental
> >> > > > > > code change is going to break a downstream package.
> >> > > > > > I can't speak for all the high level package binding
> maintainers,
> >> > but
> >> > > > I'm
> >> > > > > > always happy to pitch in to provide code fixes to help the
> base
> >> PR
> >> > > get
> >> > > > > > green.
> >> > > > > >
> >> > > > > > The time costs to maintain such a large CI project obviously

Re: CI and PRs

2019-08-15 Thread Sheng Zha
The AWS Batch approach should also help with hardware utilization as machines 
are launched only when needed :)

-sz

> On Aug 15, 2019, at 9:11 AM, Marco de Abreu  wrote:
> 
> Thanks Leonard. Naively dividing by test files would certainly be an easy
> and doable way before going into to proper nose parallelization. Great idea!
> 
> Scalability in terms of nodes is not an issue. Our system can handle at
> least 600 slaves (didn't want to go higher for obvious reasons). But I
> think we don't even have to go that far because most of the time, our
> machines are heavily under utilized due to the single-threaded nature of
> most tests. Thus, parallel test execution on the same machine would already
> speed up the process by great lengths.
> 
> -Marco
> 
> P.S. the structure of the Jenkinsfiles seems pretty familiar :P i am glad
> my approach is considered helpful :)
> 
> Leonard Lausen  schrieb am Do., 15. Aug. 2019, 18:59:
> 
>> To parallelize across machines: For GluonNLP we started submitting test
>> jobs to AWS Batch. Just adding a for-loop over the units in the
>> Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
>> Jenkins just waits for all jobs to finish and retrieves their status.
>> This works since AWS Batch added GPU support this April [3].
>> 
>> For MXNet, naively parallelizing over the files defining the test cases
>> that are in the longest running Pipeline stage may already help?
>> 
>> [1]:
>> https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53
>> [2]: https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py
>> [3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/
>> 
>> Marco de Abreu  writes:
>> 
>>> The first start wrt parallelization could certainly be start adding
>>> parallel test execution in nosetests.
>>> 
>>> -Marco
>>> 
>>> Aaron Markham  schrieb am Do., 15. Aug. 2019,
>>> 05:39:
>>> 
 The PRs Thomas and I are working on for the new docs and website share
>> the
 mxnet binary in the new CI pipelines we made. Speeds things up a lot.
 
 On Wed, Aug 14, 2019, 18:16 Chris Olivier 
>> wrote:
 
> I see it done daily now, and while I can’t share all the details, it’s
 not
> an incredibly complex thing, and involves not much more than nfs/efs
> sharing and remote ssh commands.  All it takes is a little ingenuity
>> and
> some imagination.
> 
> On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
 pedro.larroy.li...@gmail.com
>> 
> wrote:
> 
>> Sounds good in theory. I think there are complex details with
>> regards
 of
>> resource sharing during parallel execution. Still I think both ways
>> can
> be
>> explored. I think some tests run for unreasonably long times for
>> what
> they
>> are doing. We already scale parts of the pipeline horizontally
>> across
>> workers.
>> 
>> 
>> On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
>> cjolivie...@apache.org>
>> wrote:
>> 
>>> +1
>>> 
>>> Rather than remove tests (which doesn’t scale as a solution), why
>> not
>> scale
>>> them horizontally so that they finish more quickly? Across
>> processes
 or
>>> even on a pool of machines that aren’t necessarily the build
>> machine?
>>> 
>>> On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> marco.g.ab...@gmail.com
>>> 
>>> wrote:
>>> 
 With regards to time I rather prefer us spending a bit more
>> time on
 maintenance than somebody running into an error that could've
>> been
>> caught
 with a test.
 
 I mean, our Publishing pipeline for Scala GPU has been broken
>> for
> quite
 some time now, but nobody noticed that. Basically my stance on
>> that
>>> matter
 is that as soon as something is not blocking, you can also just
>>> deactivate
 it since you don't have a forcing function in an open source
 project.
 People will rarely come back and fix the errors of some nightly
 test
>> that
 they introduced.
 
 -Marco
 
 Carin Meier  schrieb am Mi., 14. Aug.
>> 2019,
>> 21:59:
 
> If a language binding test is failing for a not important
>> reason,
>> then
>>> it
> is too brittle and needs to be fixed (we have fixed some of
>> these
>> with
 the
> Clojure package [1]).
> But in general, if we thinking of the MXNet project as one
 project
>> that
 is
> across all the language bindings, then we want to know if some
 fundamental
> code change is going to break a downstream package.
> I can't speak for all the high level package binding
>> maintainers,
> but
>>> I'm
> always happy to pitch in to provide code fixes to help the
>> base
 PR
>> get
> green.
> 
> The time cos

Re: CI and PRs

2019-08-15 Thread Marco de Abreu
No worries, auto scaling is taking care of that :)

-Marco

Sheng Zha  schrieb am Do., 15. Aug. 2019, 19:50:

> The AWS Batch approach should also help with hardware utilization as
> machines are launched only when needed :)
>
> -sz
>
> > On Aug 15, 2019, at 9:11 AM, Marco de Abreu 
> wrote:
> >
> > Thanks Leonard. Naively dividing by test files would certainly be an easy
> > and doable way before going into to proper nose parallelization. Great
> idea!
> >
> > Scalability in terms of nodes is not an issue. Our system can handle at
> > least 600 slaves (didn't want to go higher for obvious reasons). But I
> > think we don't even have to go that far because most of the time, our
> > machines are heavily under utilized due to the single-threaded nature of
> > most tests. Thus, parallel test execution on the same machine would
> already
> > speed up the process by great lengths.
> >
> > -Marco
> >
> > P.S. the structure of the Jenkinsfiles seems pretty familiar :P i am glad
> > my approach is considered helpful :)
> >
> > Leonard Lausen  schrieb am Do., 15. Aug. 2019,
> 18:59:
> >
> >> To parallelize across machines: For GluonNLP we started submitting test
> >> jobs to AWS Batch. Just adding a for-loop over the units in the
> >> Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
> >> Jenkins just waits for all jobs to finish and retrieves their status.
> >> This works since AWS Batch added GPU support this April [3].
> >>
> >> For MXNet, naively parallelizing over the files defining the test cases
> >> that are in the longest running Pipeline stage may already help?
> >>
> >> [1]:
> >>
> https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53
> >> [2]:
> https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py
> >> [3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/
> >>
> >> Marco de Abreu  writes:
> >>
> >>> The first start wrt parallelization could certainly be start adding
> >>> parallel test execution in nosetests.
> >>>
> >>> -Marco
> >>>
> >>> Aaron Markham  schrieb am Do., 15. Aug.
> 2019,
> >>> 05:39:
> >>>
>  The PRs Thomas and I are working on for the new docs and website share
> >> the
>  mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> 
>  On Wed, Aug 14, 2019, 18:16 Chris Olivier 
> >> wrote:
> 
> > I see it done daily now, and while I can’t share all the details,
> it’s
>  not
> > an incredibly complex thing, and involves not much more than nfs/efs
> > sharing and remote ssh commands.  All it takes is a little ingenuity
> >> and
> > some imagination.
> >
> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
>  pedro.larroy.li...@gmail.com
> >>
> > wrote:
> >
> >> Sounds good in theory. I think there are complex details with
> >> regards
>  of
> >> resource sharing during parallel execution. Still I think both ways
> >> can
> > be
> >> explored. I think some tests run for unreasonably long times for
> >> what
> > they
> >> are doing. We already scale parts of the pipeline horizontally
> >> across
> >> workers.
> >>
> >>
> >> On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> >> cjolivie...@apache.org>
> >> wrote:
> >>
> >>> +1
> >>>
> >>> Rather than remove tests (which doesn’t scale as a solution), why
> >> not
> >> scale
> >>> them horizontally so that they finish more quickly? Across
> >> processes
>  or
> >>> even on a pool of machines that aren’t necessarily the build
> >> machine?
> >>>
> >>> On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > marco.g.ab...@gmail.com
> >>>
> >>> wrote:
> >>>
>  With regards to time I rather prefer us spending a bit more
> >> time on
>  maintenance than somebody running into an error that could've
> >> been
> >> caught
>  with a test.
> 
>  I mean, our Publishing pipeline for Scala GPU has been broken
> >> for
> > quite
>  some time now, but nobody noticed that. Basically my stance on
> >> that
> >>> matter
>  is that as soon as something is not blocking, you can also just
> >>> deactivate
>  it since you don't have a forcing function in an open source
>  project.
>  People will rarely come back and fix the errors of some nightly
>  test
> >> that
>  they introduced.
> 
>  -Marco
> 
>  Carin Meier  schrieb am Mi., 14. Aug.
> >> 2019,
> >> 21:59:
> 
> > If a language binding test is failing for a not important
> >> reason,
> >> then
> >>> it
> > is too brittle and needs to be fixed (we have fixed some of
> >> these
> >> with
>  the
> > Clojure package [1]).
> > But in general, if we thinking of the MXNet project as one
>  project
> >> that
>  is
> > across all th

MXNet CI repository

2019-08-15 Thread Marco de Abreu
Hello,

I'd like to propose a repository where CI infrastructure code can be
stored. I'd propose "incubator-mxnet-ci". Is everybody fine with that name
or has a better idea?

Best regards
Marco


Re: MXNet CI repository

2019-08-15 Thread Chaitanya Bapat
+1
LGTM!

On Thu, 15 Aug 2019 at 11:01, Marco de Abreu  wrote:

> Hello,
>
> I'd like to propose a repository where CI infrastructure code can be
> stored. I'd propose "incubator-mxnet-ci". Is everybody fine with that name
> or has a better idea?
>
> Best regards
> Marco
>


-- 
*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

[image: https://www.linkedin.com//in/chaibapat25]
[image: https://www.facebook.com/chaibapat]
[image:
https://twitter.com/ChaiBapchya] [image:
https://www.linkedin.com//in/chaibapat25]



Re: MXNet CI repository

2019-08-15 Thread Carin Meier
+1

On Thu, Aug 15, 2019 at 2:37 PM Chaitanya Bapat 
wrote:

> +1
> LGTM!
>
> On Thu, 15 Aug 2019 at 11:01, Marco de Abreu 
> wrote:
>
> > Hello,
> >
> > I'd like to propose a repository where CI infrastructure code can be
> > stored. I'd propose "incubator-mxnet-ci". Is everybody fine with that
> name
> > or has a better idea?
> >
> > Best regards
> > Marco
> >
>
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> [image: https://www.facebook.com/chaibapat
> ]
> [image:
> https://twitter.com/ChaiBapchya] [image:
> https://www.linkedin.com//in/chaibapat25]
> 
>


Re: new website (RE: CI and PRs)

2019-08-15 Thread Aaron Markham
I'll start a different thread about the website. Sure, there's a lot
of overlap with CI. I learned a lot in the last few weeks having to
iterate on 7 different docs packages and trying to streamline the
build process in CI.

Here are my notes:

* Stash operations vs. archiving - recommendations in the docs suggest
that large artifacts should be archived; stash is super slow; archived
artifacts seems to be faster and can be used between pipelines. This
is helpful for the MXNet binary and for the Scala package, both of
which are used by various other docs packages. However, there's an
implication with the master server. Archived artifacts are stored
there, so if the pipeline is related to PR validation, this would be
unwieldy. If related to publishing final artifacts for specific
versions, well, that's probably ok.

* It seems that efficiency in development and testing can be gained by
checkpointing the docker containers after the dependencies are
installed. I can't stress how much time is lost while watching
`apt-get update` run for the millionth time when testing new CI
routines. It sort of makes me crazy(er).

* A version/branch parameter would be useful for the Jenkins pipelines
for generating docs artifacts from different branches.

* Publishing scripts seem to need a security refactor, or we don't
bother offering stand-alone access to them; running local versus on
Jenkins.

* I don't see any documentation on the S3 publishing steps and how to test this.

* After breaking out each docs package in its own pipeline, I see
opportunities to use the GitHub API to check the PR payload and be
selective about what tests to run.


On Wed, Aug 14, 2019 at 10:03 PM Zhao, Patric  wrote:
>
> Hi Aaron,
>
> Recently, we are working on improving the documents of CPU backend based on 
> the current website.
>
> I saw there're several PRs to update the new website and it's really great.
>
> Thus, I'd like to know when the new website will online.
> If it's very near, we will switch our works to the new website.
>
> Thanks,
>
> --Patric
>
>
> > -Original Message-
> > From: Aaron Markham 
> > Sent: Thursday, August 15, 2019 11:40 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: CI and PRs
> >
> > The PRs Thomas and I are working on for the new docs and website share
> > the mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> >
> > On Wed, Aug 14, 2019, 18:16 Chris Olivier  wrote:
> >
> > > I see it done daily now, and while I can’t share all the details, it’s
> > > not an incredibly complex thing, and involves not much more than
> > > nfs/efs sharing and remote ssh commands.  All it takes is a little
> > > ingenuity and some imagination.
> > >
> > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy
> > >  > > >
> > > wrote:
> > >
> > > > Sounds good in theory. I think there are complex details with
> > > > regards of resource sharing during parallel execution. Still I think
> > > > both ways can
> > > be
> > > > explored. I think some tests run for unreasonably long times for
> > > > what
> > > they
> > > > are doing. We already scale parts of the pipeline horizontally
> > > > across workers.
> > > >
> > > >
> > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier
> > > > 
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Rather than remove tests (which doesn’t scale as a solution), why
> > > > > not
> > > > scale
> > > > > them horizontally so that they finish more quickly? Across
> > > > > processes or even on a pool of machines that aren’t necessarily the
> > build machine?
> > > > >
> > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > marco.g.ab...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > With regards to time I rather prefer us spending a bit more time
> > > > > > on maintenance than somebody running into an error that could've
> > > > > > been
> > > > caught
> > > > > > with a test.
> > > > > >
> > > > > > I mean, our Publishing pipeline for Scala GPU has been broken
> > > > > > for
> > > quite
> > > > > > some time now, but nobody noticed that. Basically my stance on
> > > > > > that
> > > > > matter
> > > > > > is that as soon as something is not blocking, you can also just
> > > > > deactivate
> > > > > > it since you don't have a forcing function in an open source 
> > > > > > project.
> > > > > > People will rarely come back and fix the errors of some nightly
> > > > > > test
> > > > that
> > > > > > they introduced.
> > > > > >
> > > > > > -Marco
> > > > > >
> > > > > > Carin Meier  schrieb am Mi., 14. Aug.
> > > > > > 2019,
> > > > 21:59:
> > > > > >
> > > > > > > If a language binding test is failing for a not important
> > > > > > > reason,
> > > > then
> > > > > it
> > > > > > > is too brittle and needs to be fixed (we have fixed some of
> > > > > > > these
> > > > with
> > > > > > the
> > > > > > > Clojure package [1]).
> > > > > > > But in general, if we thinking of the MXNet project as one
> > > > > > > project
> > > > that
> > > > > > is

Another Gluon Brand

2019-08-15 Thread Carin Meier
Gluon came up in one of my feeds in the context https://gluonhq.com/ of
mobile solutions.

Thought I would bring it to the attention of the group in case it's
relevant to branding discussions.

- Carin


Re: Another Gluon Brand

2019-08-15 Thread Aaron Markham
Ouch. Someone either didn't file a trademark, or do a trademark
search. Or they don't care because there's no trademark. I'd forward
this to Apache legal.
+ Hen for advice.

On Thu, Aug 15, 2019 at 11:54 AM Carin Meier  wrote:
>
> Gluon came up in one of my feeds in the context https://gluonhq.com/ of
> mobile solutions.
>
> Thought I would bring it to the attention of the group in case it's
> relevant to branding discussions.
>
> - Carin


Re: new website (RE: CI and PRs)

2019-08-15 Thread Marco de Abreu
Hi,

thanks a lot for these great notes! I'm happy to give my comments about
them :)

* Archiving is *very VERY* bad for the CI master performance. It floods the
disk with data since archiving persists the data. We are now at the point
where we technically can't extend the volume any further (we exceeded the
4TB limit and had to delete old runs). Thus, stashing is the only option
that's not harmful to the systems performance.

* Yeah, agree. One way is to build a Dockerfile, push it to your own
Dockerhub account and then in the MXNet DOckerfile just make "FROM
yourdockerhub:blabla".

* We support the GitHub Multi-Branch Pipeline and basically use this across
all jobs. So adhering to that system will result in the git repository
within the workspace being scoped to the correct branch. As a rule of thumb
it's basically a red flag as soon as you call anything with regards to git
(e.g. checking out a different branch, creating a commit, merging another
branch, etc) within your payload. Happy to help if you would like to have
that elaborated.

* Could you elaborate on "Publishing scripts seem to need a security
refactor, or we don't bother offering stand-alone access to them; running
local versus on Jenkins."? I don't really understand what you mean here.

* Basically it's an s3 bucket with a TTL of 30 days that our CI slaves have
permission to push to. We basically just upload the entire folder that is
being created. Is there anything specifically you're looking for?

* That's awesome!

Best regards,
Marco

On Thu, Aug 15, 2019 at 8:52 PM Aaron Markham 
wrote:

> I'll start a different thread about the website. Sure, there's a lot
> of overlap with CI. I learned a lot in the last few weeks having to
> iterate on 7 different docs packages and trying to streamline the
> build process in CI.
>
> Here are my notes:
>
> * Stash operations vs. archiving - recommendations in the docs suggest
> that large artifacts should be archived; stash is super slow; archived
> artifacts seems to be faster and can be used between pipelines. This
> is helpful for the MXNet binary and for the Scala package, both of
> which are used by various other docs packages. However, there's an
> implication with the master server. Archived artifacts are stored
> there, so if the pipeline is related to PR validation, this would be
> unwieldy. If related to publishing final artifacts for specific
> versions, well, that's probably ok.
>
> * It seems that efficiency in development and testing can be gained by
> checkpointing the docker containers after the dependencies are
> installed. I can't stress how much time is lost while watching
> `apt-get update` run for the millionth time when testing new CI
> routines. It sort of makes me crazy(er).
>
> * A version/branch parameter would be useful for the Jenkins pipelines
> for generating docs artifacts from different branches.
>
> * Publishing scripts seem to need a security refactor, or we don't
> bother offering stand-alone access to them; running local versus on
> Jenkins.
>
> * I don't see any documentation on the S3 publishing steps and how to test
> this.
>
> * After breaking out each docs package in its own pipeline, I see
> opportunities to use the GitHub API to check the PR payload and be
> selective about what tests to run.
>
>
> On Wed, Aug 14, 2019 at 10:03 PM Zhao, Patric 
> wrote:
> >
> > Hi Aaron,
> >
> > Recently, we are working on improving the documents of CPU backend based
> on the current website.
> >
> > I saw there're several PRs to update the new website and it's really
> great.
> >
> > Thus, I'd like to know when the new website will online.
> > If it's very near, we will switch our works to the new website.
> >
> > Thanks,
> >
> > --Patric
> >
> >
> > > -Original Message-
> > > From: Aaron Markham 
> > > Sent: Thursday, August 15, 2019 11:40 AM
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: CI and PRs
> > >
> > > The PRs Thomas and I are working on for the new docs and website share
> > > the mxnet binary in the new CI pipelines we made. Speeds things up a
> lot.
> > >
> > > On Wed, Aug 14, 2019, 18:16 Chris Olivier 
> wrote:
> > >
> > > > I see it done daily now, and while I can’t share all the details,
> it’s
> > > > not an incredibly complex thing, and involves not much more than
> > > > nfs/efs sharing and remote ssh commands.  All it takes is a little
> > > > ingenuity and some imagination.
> > > >
> > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy
> > > >  > > > >
> > > > wrote:
> > > >
> > > > > Sounds good in theory. I think there are complex details with
> > > > > regards of resource sharing during parallel execution. Still I
> think
> > > > > both ways can
> > > > be
> > > > > explored. I think some tests run for unreasonably long times for
> > > > > what
> > > > they
> > > > > are doing. We already scale parts of the pipeline horizontally
> > > > > across workers.
> > > > >
> > > > >
> > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier
> >

Re: Another Gluon Brand

2019-08-15 Thread Marco de Abreu
Thanks for bringing it up, Carin!

I just checked the DNS record:
-
Domain Name: gluonhq.com
Registry Domain ID: 1891214953_DOMAIN_COM-VRSN
Registrar WHOIS Server: whois.godaddy.com
Registrar URL: http://www.godaddy.com
Updated Date: 2018-12-20T14:57:10Z
Creation Date: 2014-12-19T08:48:26Z
Registrar Registration Expiration Date: 2019-12-19T08:48:26Z
Registrar: GoDaddy.com, LLC
---

I also checked with the wayback-machine. The first record is on 21st of
March 2015.

-Marco

On Thu, Aug 15, 2019 at 8:59 PM Aaron Markham 
wrote:

> Ouch. Someone either didn't file a trademark, or do a trademark
> search. Or they don't care because there's no trademark. I'd forward
> this to Apache legal.
> + Hen for advice.
>
> On Thu, Aug 15, 2019 at 11:54 AM Carin Meier  wrote:
> >
> > Gluon came up in one of my feeds in the context https://gluonhq.com/ of
> > mobile solutions.
> >
> > Thought I would bring it to the attention of the group in case it's
> > relevant to branding discussions.
> >
> > - Carin
>


Re: new website (RE: CI and PRs)

2019-08-15 Thread Aaron Markham
For stash vs archive, maybe we need to look into attached storage
options and combine some deployments, so that artifacts get stored in
an accessible way. I notice sometimes the web portal for Jenkins
becomes unresponsive. If disk space is causing this, we need to
address it.

For the Dockerfile situation, build.py is hardcoded to use mxnet-ci
and tags that start with "build" in the name, like
mxnet-ci:build.Dockerfile.. This would need to be changed to
support developer supplied alternatives. While this should be an
option, I really do think that when we do a release, we should create
a stable Docker image that has the deps all pre-installed, and then go
with that until the next release.

For the version thing, I want to have the option to trigger the docs
and/or website build using a specified branch. I accomplish this now
by using the settings.ini file, but in the new flow, the docs builds
are orchestrated with Docker/CI and not by Sphinx. This makes it
easier to manage, but now I lost my versioning capability where I
could build any one or all versions of all docs. I'll downplay this
though. We really don't need to do that much anymore with how the new
site is constructed. This is more of a wish list item.
For S3, I get access denied when running tests on CI dev. We need the
secrets/config ported over so I don't have to run tests on prod. Plus,
it would be good to note in the Wiki what account has the bucket, how
to ask for access, and so forth.

Related to S3 and apache-site publishing - I see Maven has some
configuration for secrets in Jenkins, but we don't have this for
apache-site or S3. A consistent way to manage this would be good.
Publishing the website uses username/password that are added to
jenkins. When I tried to break out some of the logic to SCM, jenkins
would blast out my fork username and do other fun security stuff that
broke the scripts. Obviously, I wasn't doing it the way it wants me to
do it.

Cheers,
Aaron

On Thu, Aug 15, 2019 at 12:08 PM Marco de Abreu  wrote:
>
> Hi,
>
> thanks a lot for these great notes! I'm happy to give my comments about
> them :)
>
> * Archiving is *very VERY* bad for the CI master performance. It floods the
> disk with data since archiving persists the data. We are now at the point
> where we technically can't extend the volume any further (we exceeded the
> 4TB limit and had to delete old runs). Thus, stashing is the only option
> that's not harmful to the systems performance.
>
> * Yeah, agree. One way is to build a Dockerfile, push it to your own
> Dockerhub account and then in the MXNet DOckerfile just make "FROM
> yourdockerhub:blabla".
>
> * We support the GitHub Multi-Branch Pipeline and basically use this across
> all jobs. So adhering to that system will result in the git repository
> within the workspace being scoped to the correct branch. As a rule of thumb
> it's basically a red flag as soon as you call anything with regards to git
> (e.g. checking out a different branch, creating a commit, merging another
> branch, etc) within your payload. Happy to help if you would like to have
> that elaborated.
>
> * Could you elaborate on "Publishing scripts seem to need a security
> refactor, or we don't bother offering stand-alone access to them; running
> local versus on Jenkins."? I don't really understand what you mean here.
>
> * Basically it's an s3 bucket with a TTL of 30 days that our CI slaves have
> permission to push to. We basically just upload the entire folder that is
> being created. Is there anything specifically you're looking for?
>
> * That's awesome!
>
> Best regards,
> Marco
>
> On Thu, Aug 15, 2019 at 8:52 PM Aaron Markham 
> wrote:
>
> > I'll start a different thread about the website. Sure, there's a lot
> > of overlap with CI. I learned a lot in the last few weeks having to
> > iterate on 7 different docs packages and trying to streamline the
> > build process in CI.
> >
> > Here are my notes:
> >
> > * Stash operations vs. archiving - recommendations in the docs suggest
> > that large artifacts should be archived; stash is super slow; archived
> > artifacts seems to be faster and can be used between pipelines. This
> > is helpful for the MXNet binary and for the Scala package, both of
> > which are used by various other docs packages. However, there's an
> > implication with the master server. Archived artifacts are stored
> > there, so if the pipeline is related to PR validation, this would be
> > unwieldy. If related to publishing final artifacts for specific
> > versions, well, that's probably ok.
> >
> > * It seems that efficiency in development and testing can be gained by
> > checkpointing the docker containers after the dependencies are
> > installed. I can't stress how much time is lost while watching
> > `apt-get update` run for the millionth time when testing new CI
> > routines. It sort of makes me crazy(er).
> >
> > * A version/branch parameter would be useful for the Jenkins pipelines
> > for generating docs ar

Re: MXNet CI repository

2019-08-15 Thread Marco de Abreu
Repository has been created: https://github.com/apache/incubator-mxnet-ci

I will fill it soon.

-Marco

On Thu, Aug 15, 2019 at 8:43 PM Carin Meier  wrote:

> +1
>
> On Thu, Aug 15, 2019 at 2:37 PM Chaitanya Bapat 
> wrote:
>
> > +1
> > LGTM!
> >
> > On Thu, 15 Aug 2019 at 11:01, Marco de Abreu 
> > wrote:
> >
> > > Hello,
> > >
> > > I'd like to propose a repository where CI infrastructure code can be
> > > stored. I'd propose "incubator-mxnet-ci". Is everybody fine with that
> > name
> > > or has a better idea?
> > >
> > > Best regards
> > > Marco
> > >
> >
> >
> > --
> > *Chaitanya Prakash Bapat*
> > *+1 (973) 953-6299*
> >
> > [image: https://www.linkedin.com//in/chaibapat25]
> > [image:
> https://www.facebook.com/chaibapat
> > ]
> > [image:
> > https://twitter.com/ChaiBapchya]  >[image:
> > https://www.linkedin.com//in/chaibapat25]
> > 
> >
>


Re: new website (RE: CI and PRs)

2019-08-15 Thread Marco de Abreu
Hi Aaron,

wrt the Dockerfile situation. What you are saying is right, but my proposal
is the following:

Instead of having a fat Dockerfile like

---
FROM ubuntu:16
RUN apt update
...
---

You instead run that fat Dockerfile locally on your laptop and push it to
"aaronmarkham:elephantmaster9000"

Then, you update the above Dockerfile to the following:

---
FROM aaronmarkham:elephantmaster9000
---

Now, it will only download the prebuilt Dockerfile from your own repo. Of
course, this is just a temporary solution to allow fast iteration, but it
greatly speeds up your development. Once you're done with developing and
the PR is no longer WIP, you copy the fat Dockerfile content back and
remove the reference to your own repo.


Wrt the website. What's the issue with going to
http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/clang/job/mkldnn-v1.0/
(as
an example) and triggering the job? Jenkins will automatically make sure
that it checks out the appropriate branch etc. A job is always scoped to a
single git-commit. Thus, if you're trying to build multiple versions, you
trigger each job individually. If you need some kind of parent-job, you
kick off the parent-job that does it's stuff, kicks off all the branch-jobs
and then continues with its flow to consume whatever the branch-jobs
created. I hope I described it in an understandable fashion.

The dev account doesn't have access to S3. It's purposefully restricted
since it's non-productive system where we don't audit access that closely
and thus try to reduce the blast radius by limiting slave permissions.
Sheng is owning the S3 bucket.

I'm not sure about the SCM stuff, but lets chat offline about it since it
seems to be a more complicated issue.

-Marco


On Thu, Aug 15, 2019 at 9:32 PM Aaron Markham 
wrote:

> For stash vs archive, maybe we need to look into attached storage
> options and combine some deployments, so that artifacts get stored in
> an accessible way. I notice sometimes the web portal for Jenkins
> becomes unresponsive. If disk space is causing this, we need to
> address it.
>
> For the Dockerfile situation, build.py is hardcoded to use mxnet-ci
> and tags that start with "build" in the name, like
> mxnet-ci:build.Dockerfile.. This would need to be changed to
> support developer supplied alternatives. While this should be an
> option, I really do think that when we do a release, we should create
> a stable Docker image that has the deps all pre-installed, and then go
> with that until the next release.
>
> For the version thing, I want to have the option to trigger the docs
> and/or website build using a specified branch. I accomplish this now
> by using the settings.ini file, but in the new flow, the docs builds
> are orchestrated with Docker/CI and not by Sphinx. This makes it
> easier to manage, but now I lost my versioning capability where I
> could build any one or all versions of all docs. I'll downplay this
> though. We really don't need to do that much anymore with how the new
> site is constructed. This is more of a wish list item.
> For S3, I get access denied when running tests on CI dev. We need the
> secrets/config ported over so I don't have to run tests on prod. Plus,
> it would be good to note in the Wiki what account has the bucket, how
> to ask for access, and so forth.
>
> Related to S3 and apache-site publishing - I see Maven has some
> configuration for secrets in Jenkins, but we don't have this for
> apache-site or S3. A consistent way to manage this would be good.
> Publishing the website uses username/password that are added to
> jenkins. When I tried to break out some of the logic to SCM, jenkins
> would blast out my fork username and do other fun security stuff that
> broke the scripts. Obviously, I wasn't doing it the way it wants me to
> do it.
>
> Cheers,
> Aaron
>
> On Thu, Aug 15, 2019 at 12:08 PM Marco de Abreu 
> wrote:
> >
> > Hi,
> >
> > thanks a lot for these great notes! I'm happy to give my comments about
> > them :)
> >
> > * Archiving is *very VERY* bad for the CI master performance. It floods
> the
> > disk with data since archiving persists the data. We are now at the point
> > where we technically can't extend the volume any further (we exceeded the
> > 4TB limit and had to delete old runs). Thus, stashing is the only option
> > that's not harmful to the systems performance.
> >
> > * Yeah, agree. One way is to build a Dockerfile, push it to your own
> > Dockerhub account and then in the MXNet DOckerfile just make "FROM
> > yourdockerhub:blabla".
> >
> > * We support the GitHub Multi-Branch Pipeline and basically use this
> across
> > all jobs. So adhering to that system will result in the git repository
> > within the workspace being scoped to the correct branch. As a rule of
> thumb
> > it's basically a red flag as soon as you call anything with regards to
> git
> > (e.g. checking out a different branch, creating a commit, merging another
> > branch, etc) within your payload. Happy t

Re: CI and PRs

2019-08-15 Thread Pedro Larroy
Hi Aaron. Why speeds things up? What's the difference?

Pedro.

On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham 
wrote:

> The PRs Thomas and I are working on for the new docs and website share the
> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
>
> On Wed, Aug 14, 2019, 18:16 Chris Olivier  wrote:
>
> > I see it done daily now, and while I can’t share all the details, it’s
> not
> > an incredibly complex thing, and involves not much more than nfs/efs
> > sharing and remote ssh commands.  All it takes is a little ingenuity and
> > some imagination.
> >
> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> pedro.larroy.li...@gmail.com
> > >
> > wrote:
> >
> > > Sounds good in theory. I think there are complex details with regards
> of
> > > resource sharing during parallel execution. Still I think both ways can
> > be
> > > explored. I think some tests run for unreasonably long times for what
> > they
> > > are doing. We already scale parts of the pipeline horizontally across
> > > workers.
> > >
> > >
> > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier 
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Rather than remove tests (which doesn’t scale as a solution), why not
> > > scale
> > > > them horizontally so that they finish more quickly? Across processes
> or
> > > > even on a pool of machines that aren’t necessarily the build machine?
> > > >
> > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > marco.g.ab...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > With regards to time I rather prefer us spending a bit more time on
> > > > > maintenance than somebody running into an error that could've been
> > > caught
> > > > > with a test.
> > > > >
> > > > > I mean, our Publishing pipeline for Scala GPU has been broken for
> > quite
> > > > > some time now, but nobody noticed that. Basically my stance on that
> > > > matter
> > > > > is that as soon as something is not blocking, you can also just
> > > > deactivate
> > > > > it since you don't have a forcing function in an open source
> project.
> > > > > People will rarely come back and fix the errors of some nightly
> test
> > > that
> > > > > they introduced.
> > > > >
> > > > > -Marco
> > > > >
> > > > > Carin Meier  schrieb am Mi., 14. Aug. 2019,
> > > 21:59:
> > > > >
> > > > > > If a language binding test is failing for a not important reason,
> > > then
> > > > it
> > > > > > is too brittle and needs to be fixed (we have fixed some of these
> > > with
> > > > > the
> > > > > > Clojure package [1]).
> > > > > > But in general, if we thinking of the MXNet project as one
> project
> > > that
> > > > > is
> > > > > > across all the language bindings, then we want to know if some
> > > > > fundamental
> > > > > > code change is going to break a downstream package.
> > > > > > I can't speak for all the high level package binding maintainers,
> > but
> > > > I'm
> > > > > > always happy to pitch in to provide code fixes to help the base
> PR
> > > get
> > > > > > green.
> > > > > >
> > > > > > The time costs to maintain such a large CI project obviously
> needs
> > to
> > > > be
> > > > > > considered as well.
> > > > > >
> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > pedro.larroy.li...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > From what I have seen Clojure is 15 minutes, which I think is
> > > > > reasonable.
> > > > > > > The only question is that when a binding such as R, Perl or
> > Clojure
> > > > > > fails,
> > > > > > > some devs are a bit confused about how to fix them since they
> are
> > > not
> > > > > > > familiar with the testing tools and the language.
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > carinme...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Great idea Marco! Anything that you think would be valuable
> to
> > > > share
> > > > > > > would
> > > > > > > > be good. The duration of each node in the test stage sounds
> > like
> > > a
> > > > > good
> > > > > > > > start.
> > > > > > > >
> > > > > > > > - Carin
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > > marco.g.ab...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > we record a bunch of metrics about run statistics (down to
> > the
> > > > > > duration
> > > > > > > > of
> > > > > > > > > every individual step). If you tell me which ones you're
> > > > > particularly
> > > > > > > > > interested in (probably total duration of each node in the
> > test
> > > > > > stage),
> > > > > > > > I'm
> > > > > > > > > happy to provide them.
> > > > > > > > >
> > > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > > - job
> > > > > > > > > - branch
> > > > > > > > > - stage
> > > > > > > > > - node
> > > > > > > > > - step
> > > > > > > > >
> > > > > > > > > Unfor

Re: CI and PRs

2019-08-15 Thread Pedro Larroy
Hi Chris.
I suggest you send a PR to illustrate your proposal so we have a concrete
example to look into.
Pedro.

On Wed, Aug 14, 2019 at 6:16 PM Chris Olivier  wrote:

> I see it done daily now, and while I can’t share all the details, it’s not
> an incredibly complex thing, and involves not much more than nfs/efs
> sharing and remote ssh commands.  All it takes is a little ingenuity and
> some imagination.
>
> On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy  >
> wrote:
>
> > Sounds good in theory. I think there are complex details with regards of
> > resource sharing during parallel execution. Still I think both ways can
> be
> > explored. I think some tests run for unreasonably long times for what
> they
> > are doing. We already scale parts of the pipeline horizontally across
> > workers.
> >
> >
> > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier 
> > wrote:
> >
> > > +1
> > >
> > > Rather than remove tests (which doesn’t scale as a solution), why not
> > scale
> > > them horizontally so that they finish more quickly? Across processes or
> > > even on a pool of machines that aren’t necessarily the build machine?
> > >
> > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> marco.g.ab...@gmail.com
> > >
> > > wrote:
> > >
> > > > With regards to time I rather prefer us spending a bit more time on
> > > > maintenance than somebody running into an error that could've been
> > caught
> > > > with a test.
> > > >
> > > > I mean, our Publishing pipeline for Scala GPU has been broken for
> quite
> > > > some time now, but nobody noticed that. Basically my stance on that
> > > matter
> > > > is that as soon as something is not blocking, you can also just
> > > deactivate
> > > > it since you don't have a forcing function in an open source project.
> > > > People will rarely come back and fix the errors of some nightly test
> > that
> > > > they introduced.
> > > >
> > > > -Marco
> > > >
> > > > Carin Meier  schrieb am Mi., 14. Aug. 2019,
> > 21:59:
> > > >
> > > > > If a language binding test is failing for a not important reason,
> > then
> > > it
> > > > > is too brittle and needs to be fixed (we have fixed some of these
> > with
> > > > the
> > > > > Clojure package [1]).
> > > > > But in general, if we thinking of the MXNet project as one project
> > that
> > > > is
> > > > > across all the language bindings, then we want to know if some
> > > > fundamental
> > > > > code change is going to break a downstream package.
> > > > > I can't speak for all the high level package binding maintainers,
> but
> > > I'm
> > > > > always happy to pitch in to provide code fixes to help the base PR
> > get
> > > > > green.
> > > > >
> > > > > The time costs to maintain such a large CI project obviously needs
> to
> > > be
> > > > > considered as well.
> > > > >
> > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > >
> > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > From what I have seen Clojure is 15 minutes, which I think is
> > > > reasonable.
> > > > > > The only question is that when a binding such as R, Perl or
> Clojure
> > > > > fails,
> > > > > > some devs are a bit confused about how to fix them since they are
> > not
> > > > > > familiar with the testing tools and the language.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> carinme...@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Great idea Marco! Anything that you think would be valuable to
> > > share
> > > > > > would
> > > > > > > be good. The duration of each node in the test stage sounds
> like
> > a
> > > > good
> > > > > > > start.
> > > > > > >
> > > > > > > - Carin
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > marco.g.ab...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > we record a bunch of metrics about run statistics (down to
> the
> > > > > duration
> > > > > > > of
> > > > > > > > every individual step). If you tell me which ones you're
> > > > particularly
> > > > > > > > interested in (probably total duration of each node in the
> test
> > > > > stage),
> > > > > > > I'm
> > > > > > > > happy to provide them.
> > > > > > > >
> > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > - job
> > > > > > > > - branch
> > > > > > > > - stage
> > > > > > > > - node
> > > > > > > > - step
> > > > > > > >
> > > > > > > > Unfortunately I don't have the possibility to export them
> since
> > > we
> > > > > > store
> > > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw
> > exports.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Marco
> > > > > > > >
> > > > > > > > Carin Meier  schrieb am Mi., 14. Aug.
> > > 2019,
> > > > > > 19:43:
> > > > > > > >
> > > > > > > > > I would prefer to keep the language binding in the PR
> > process.
> > > > > > Perhaps
> > > > > > > we
> 

Re: MXNet CI repository

2019-08-15 Thread Pedro Larroy
Nice.

On Thu, Aug 15, 2019 at 12:47 PM Marco de Abreu 
wrote:

> Repository has been created: https://github.com/apache/incubator-mxnet-ci
>
> I will fill it soon.
>
> -Marco
>
> On Thu, Aug 15, 2019 at 8:43 PM Carin Meier  wrote:
>
> > +1
> >
> > On Thu, Aug 15, 2019 at 2:37 PM Chaitanya Bapat 
> > wrote:
> >
> > > +1
> > > LGTM!
> > >
> > > On Thu, 15 Aug 2019 at 11:01, Marco de Abreu 
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I'd like to propose a repository where CI infrastructure code can be
> > > > stored. I'd propose "incubator-mxnet-ci". Is everybody fine with that
> > > name
> > > > or has a better idea?
> > > >
> > > > Best regards
> > > > Marco
> > > >
> > >
> > >
> > > --
> > > *Chaitanya Prakash Bapat*
> > > *+1 (973) 953-6299*
> > >
> > > [image: https://www.linkedin.com//in/chaibapat25]
> > > [image:
> > https://www.facebook.com/chaibapat
> > > ]
> > > [image:
> > > https://twitter.com/ChaiBapchya]  > >[image:
> > > https://www.linkedin.com//in/chaibapat25]
> > > 
> > >
> >
>


Re: CI and PRs

2019-08-15 Thread Aaron Markham
Many of the CI pipelines follow this pattern:
Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
repeat steps 1-3 over and over?

Now, some tests use a stashed binary and docker cache. And I see this work
locally, but for the most part, on CI, you're gonna sit through a
dependency install.

I noticed that almost all jobs use an ubuntu setup that is fully loaded.
Without cache, it can take 10 or more minutes to build.  So I made a lite
version. Takes only a few minutes instead.

In some cases archiving worked great to share across pipelines, but as
Marco mentioned we need a storage solution to make that happen. We can't
archive every intermediate artifact for each PR.

On Thu, Aug 15, 2019, 13:47 Pedro Larroy 
wrote:

> Hi Aaron. Why speeds things up? What's the difference?
>
> Pedro.
>
> On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham 
> wrote:
>
> > The PRs Thomas and I are working on for the new docs and website share
> the
> > mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> >
> > On Wed, Aug 14, 2019, 18:16 Chris Olivier  wrote:
> >
> > > I see it done daily now, and while I can’t share all the details, it’s
> > not
> > > an incredibly complex thing, and involves not much more than nfs/efs
> > > sharing and remote ssh commands.  All it takes is a little ingenuity
> and
> > > some imagination.
> > >
> > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > pedro.larroy.li...@gmail.com
> > > >
> > > wrote:
> > >
> > > > Sounds good in theory. I think there are complex details with regards
> > of
> > > > resource sharing during parallel execution. Still I think both ways
> can
> > > be
> > > > explored. I think some tests run for unreasonably long times for what
> > > they
> > > > are doing. We already scale parts of the pipeline horizontally across
> > > > workers.
> > > >
> > > >
> > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> cjolivie...@apache.org>
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Rather than remove tests (which doesn’t scale as a solution), why
> not
> > > > scale
> > > > > them horizontally so that they finish more quickly? Across
> processes
> > or
> > > > > even on a pool of machines that aren’t necessarily the build
> machine?
> > > > >
> > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > marco.g.ab...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > With regards to time I rather prefer us spending a bit more time
> on
> > > > > > maintenance than somebody running into an error that could've
> been
> > > > caught
> > > > > > with a test.
> > > > > >
> > > > > > I mean, our Publishing pipeline for Scala GPU has been broken for
> > > quite
> > > > > > some time now, but nobody noticed that. Basically my stance on
> that
> > > > > matter
> > > > > > is that as soon as something is not blocking, you can also just
> > > > > deactivate
> > > > > > it since you don't have a forcing function in an open source
> > project.
> > > > > > People will rarely come back and fix the errors of some nightly
> > test
> > > > that
> > > > > > they introduced.
> > > > > >
> > > > > > -Marco
> > > > > >
> > > > > > Carin Meier  schrieb am Mi., 14. Aug.
> 2019,
> > > > 21:59:
> > > > > >
> > > > > > > If a language binding test is failing for a not important
> reason,
> > > > then
> > > > > it
> > > > > > > is too brittle and needs to be fixed (we have fixed some of
> these
> > > > with
> > > > > > the
> > > > > > > Clojure package [1]).
> > > > > > > But in general, if we thinking of the MXNet project as one
> > project
> > > > that
> > > > > > is
> > > > > > > across all the language bindings, then we want to know if some
> > > > > > fundamental
> > > > > > > code change is going to break a downstream package.
> > > > > > > I can't speak for all the high level package binding
> maintainers,
> > > but
> > > > > I'm
> > > > > > > always happy to pitch in to provide code fixes to help the base
> > PR
> > > > get
> > > > > > > green.
> > > > > > >
> > > > > > > The time costs to maintain such a large CI project obviously
> > needs
> > > to
> > > > > be
> > > > > > > considered as well.
> > > > > > >
> > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > > pedro.larroy.li...@gmail.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > From what I have seen Clojure is 15 minutes, which I think is
> > > > > > reasonable.
> > > > > > > > The only question is that when a binding such as R, Perl or
> > > Clojure
> > > > > > > fails,
> > > > > > > > some devs are a bit confused about how to fix them since they
> > are
> > > > not
> > > > > > > > familiar with the testing tools and the language.
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > > carinme...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Great idea Marco! Anything that you think would be valuabl

Re: CI and PRs

2019-08-15 Thread Marco de Abreu
Do I understand it correctly that you are saying that the Docker cache
doesn't work properly and regularly reinstalls dependencies? Or do you mean
that you only have cache misses when you modify the dependencies - which
would be expected?

-Marco

On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham 
wrote:

> Many of the CI pipelines follow this pattern:
> Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
> repeat steps 1-3 over and over?
>
> Now, some tests use a stashed binary and docker cache. And I see this work
> locally, but for the most part, on CI, you're gonna sit through a
> dependency install.
>
> I noticed that almost all jobs use an ubuntu setup that is fully loaded.
> Without cache, it can take 10 or more minutes to build.  So I made a lite
> version. Takes only a few minutes instead.
>
> In some cases archiving worked great to share across pipelines, but as
> Marco mentioned we need a storage solution to make that happen. We can't
> archive every intermediate artifact for each PR.
>
> On Thu, Aug 15, 2019, 13:47 Pedro Larroy 
> wrote:
>
> > Hi Aaron. Why speeds things up? What's the difference?
> >
> > Pedro.
> >
> > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham  >
> > wrote:
> >
> > > The PRs Thomas and I are working on for the new docs and website share
> > the
> > > mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> > >
> > > On Wed, Aug 14, 2019, 18:16 Chris Olivier 
> wrote:
> > >
> > > > I see it done daily now, and while I can’t share all the details,
> it’s
> > > not
> > > > an incredibly complex thing, and involves not much more than nfs/efs
> > > > sharing and remote ssh commands.  All it takes is a little ingenuity
> > and
> > > > some imagination.
> > > >
> > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > pedro.larroy.li...@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Sounds good in theory. I think there are complex details with
> regards
> > > of
> > > > > resource sharing during parallel execution. Still I think both ways
> > can
> > > > be
> > > > > explored. I think some tests run for unreasonably long times for
> what
> > > > they
> > > > > are doing. We already scale parts of the pipeline horizontally
> across
> > > > > workers.
> > > > >
> > > > >
> > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > cjolivie...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > Rather than remove tests (which doesn’t scale as a solution), why
> > not
> > > > > scale
> > > > > > them horizontally so that they finish more quickly? Across
> > processes
> > > or
> > > > > > even on a pool of machines that aren’t necessarily the build
> > machine?
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > > marco.g.ab...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > With regards to time I rather prefer us spending a bit more
> time
> > on
> > > > > > > maintenance than somebody running into an error that could've
> > been
> > > > > caught
> > > > > > > with a test.
> > > > > > >
> > > > > > > I mean, our Publishing pipeline for Scala GPU has been broken
> for
> > > > quite
> > > > > > > some time now, but nobody noticed that. Basically my stance on
> > that
> > > > > > matter
> > > > > > > is that as soon as something is not blocking, you can also just
> > > > > > deactivate
> > > > > > > it since you don't have a forcing function in an open source
> > > project.
> > > > > > > People will rarely come back and fix the errors of some nightly
> > > test
> > > > > that
> > > > > > > they introduced.
> > > > > > >
> > > > > > > -Marco
> > > > > > >
> > > > > > > Carin Meier  schrieb am Mi., 14. Aug.
> > 2019,
> > > > > 21:59:
> > > > > > >
> > > > > > > > If a language binding test is failing for a not important
> > reason,
> > > > > then
> > > > > > it
> > > > > > > > is too brittle and needs to be fixed (we have fixed some of
> > these
> > > > > with
> > > > > > > the
> > > > > > > > Clojure package [1]).
> > > > > > > > But in general, if we thinking of the MXNet project as one
> > > project
> > > > > that
> > > > > > > is
> > > > > > > > across all the language bindings, then we want to know if
> some
> > > > > > > fundamental
> > > > > > > > code change is going to break a downstream package.
> > > > > > > > I can't speak for all the high level package binding
> > maintainers,
> > > > but
> > > > > > I'm
> > > > > > > > always happy to pitch in to provide code fixes to help the
> base
> > > PR
> > > > > get
> > > > > > > > green.
> > > > > > > >
> > > > > > > > The time costs to maintain such a large CI project obviously
> > > needs
> > > > to
> > > > > > be
> > > > > > > > considered as well.
> > > > > > > >
> > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > > > pedro.larroy.li...@gmail.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > >

MxNet/XLA

2019-08-15 Thread Chris Olivier
Tensorflow and pytorch seem to have XLA compatibility (pytorch probably is
not as stable as tensorflow in this respect, I imagine), and maybe others
that I don’t know about directly. Is anyone currently working on XLA
support for mxnet?


-Chris


Re: CI and PRs

2019-08-15 Thread Aaron Markham
When you create a new Dockerfile and use that on CI, it doesn't seem
to cache some of the steps... like this:

Step 13/15 : RUN /work/ubuntu_docs.sh
 ---> Running in a1e522f3283b
 [91m+ echo 'Installing dependencies...'
+ apt-get update
 [0mInstalling dependencies.

Or this

Step 4/13 : RUN /work/ubuntu_core.sh
 ---> Running in e7882d7aa750
 [91m+ apt-get update

I get if I was changing those scripts, but then I'd think it should
cache after running it once... but, no.


On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu  wrote:
>
> Do I understand it correctly that you are saying that the Docker cache
> doesn't work properly and regularly reinstalls dependencies? Or do you mean
> that you only have cache misses when you modify the dependencies - which
> would be expected?
>
> -Marco
>
> On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham 
> wrote:
>
> > Many of the CI pipelines follow this pattern:
> > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
> > repeat steps 1-3 over and over?
> >
> > Now, some tests use a stashed binary and docker cache. And I see this work
> > locally, but for the most part, on CI, you're gonna sit through a
> > dependency install.
> >
> > I noticed that almost all jobs use an ubuntu setup that is fully loaded.
> > Without cache, it can take 10 or more minutes to build.  So I made a lite
> > version. Takes only a few minutes instead.
> >
> > In some cases archiving worked great to share across pipelines, but as
> > Marco mentioned we need a storage solution to make that happen. We can't
> > archive every intermediate artifact for each PR.
> >
> > On Thu, Aug 15, 2019, 13:47 Pedro Larroy 
> > wrote:
> >
> > > Hi Aaron. Why speeds things up? What's the difference?
> > >
> > > Pedro.
> > >
> > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham  > >
> > > wrote:
> > >
> > > > The PRs Thomas and I are working on for the new docs and website share
> > > the
> > > > mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> > > >
> > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier 
> > wrote:
> > > >
> > > > > I see it done daily now, and while I can’t share all the details,
> > it’s
> > > > not
> > > > > an incredibly complex thing, and involves not much more than nfs/efs
> > > > > sharing and remote ssh commands.  All it takes is a little ingenuity
> > > and
> > > > > some imagination.
> > > > >
> > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > > pedro.larroy.li...@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Sounds good in theory. I think there are complex details with
> > regards
> > > > of
> > > > > > resource sharing during parallel execution. Still I think both ways
> > > can
> > > > > be
> > > > > > explored. I think some tests run for unreasonably long times for
> > what
> > > > > they
> > > > > > are doing. We already scale parts of the pipeline horizontally
> > across
> > > > > > workers.
> > > > > >
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > > cjolivie...@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > Rather than remove tests (which doesn’t scale as a solution), why
> > > not
> > > > > > scale
> > > > > > > them horizontally so that they finish more quickly? Across
> > > processes
> > > > or
> > > > > > > even on a pool of machines that aren’t necessarily the build
> > > machine?
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > > > marco.g.ab...@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > With regards to time I rather prefer us spending a bit more
> > time
> > > on
> > > > > > > > maintenance than somebody running into an error that could've
> > > been
> > > > > > caught
> > > > > > > > with a test.
> > > > > > > >
> > > > > > > > I mean, our Publishing pipeline for Scala GPU has been broken
> > for
> > > > > quite
> > > > > > > > some time now, but nobody noticed that. Basically my stance on
> > > that
> > > > > > > matter
> > > > > > > > is that as soon as something is not blocking, you can also just
> > > > > > > deactivate
> > > > > > > > it since you don't have a forcing function in an open source
> > > > project.
> > > > > > > > People will rarely come back and fix the errors of some nightly
> > > > test
> > > > > > that
> > > > > > > > they introduced.
> > > > > > > >
> > > > > > > > -Marco
> > > > > > > >
> > > > > > > > Carin Meier  schrieb am Mi., 14. Aug.
> > > 2019,
> > > > > > 21:59:
> > > > > > > >
> > > > > > > > > If a language binding test is failing for a not important
> > > reason,
> > > > > > then
> > > > > > > it
> > > > > > > > > is too brittle and needs to be fixed (we have fixed some of
> > > these
> > > > > > with
> > > > > > > > the
> > > > > > > > > Clojure package [1]).
> > > > > > > > > But in general, if we thinking of the MXNet project as one
> > > > project
> > > > > > that
> > > > > > > > is
> > > > > > > > > across all the language bindings, then we

Re: CI and PRs

2019-08-15 Thread Marco de Abreu
It's rerunning as soon as that particular script has been modified. Since
the following steps depend on it, it means that once step 4 has a cache
mismatch, steps 5-15 are also no longer valid.

Our cache is always controlled by master. This means that the only thing
that matters is the diff between your branch and master and not the fact
whether it already has been run once. A single Jenkins run will juggle with
over 100gb of Docker images. If we held a cache that records every single
occurrence, the storage requirements and traffic would be very expensive.
Thus, the most efficient and less error prone approach was to make master
be the branch that defines the cache.

-Marco

Aaron Markham  schrieb am Fr., 16. Aug. 2019,
04:06:

> When you create a new Dockerfile and use that on CI, it doesn't seem
> to cache some of the steps... like this:
>
> Step 13/15 : RUN /work/ubuntu_docs.sh
>  ---> Running in a1e522f3283b
>  [91m+ echo 'Installing dependencies...'
> + apt-get update
>  [0mInstalling dependencies.
>
> Or this
>
> Step 4/13 : RUN /work/ubuntu_core.sh
>  ---> Running in e7882d7aa750
>  [91m+ apt-get update
>
> I get if I was changing those scripts, but then I'd think it should
> cache after running it once... but, no.
>
>
> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu 
> wrote:
> >
> > Do I understand it correctly that you are saying that the Docker cache
> > doesn't work properly and regularly reinstalls dependencies? Or do you
> mean
> > that you only have cache misses when you modify the dependencies - which
> > would be expected?
> >
> > -Marco
> >
> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> aaron.s.mark...@gmail.com>
> > wrote:
> >
> > > Many of the CI pipelines follow this pattern:
> > > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
> > > repeat steps 1-3 over and over?
> > >
> > > Now, some tests use a stashed binary and docker cache. And I see this
> work
> > > locally, but for the most part, on CI, you're gonna sit through a
> > > dependency install.
> > >
> > > I noticed that almost all jobs use an ubuntu setup that is fully
> loaded.
> > > Without cache, it can take 10 or more minutes to build.  So I made a
> lite
> > > version. Takes only a few minutes instead.
> > >
> > > In some cases archiving worked great to share across pipelines, but as
> > > Marco mentioned we need a storage solution to make that happen. We
> can't
> > > archive every intermediate artifact for each PR.
> > >
> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy  >
> > > wrote:
> > >
> > > > Hi Aaron. Why speeds things up? What's the difference?
> > > >
> > > > Pedro.
> > > >
> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> aaron.s.mark...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > The PRs Thomas and I are working on for the new docs and website
> share
> > > > the
> > > > > mxnet binary in the new CI pipelines we made. Speeds things up a
> lot.
> > > > >
> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier 
> > > wrote:
> > > > >
> > > > > > I see it done daily now, and while I can’t share all the details,
> > > it’s
> > > > > not
> > > > > > an incredibly complex thing, and involves not much more than
> nfs/efs
> > > > > > sharing and remote ssh commands.  All it takes is a little
> ingenuity
> > > > and
> > > > > > some imagination.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > > > pedro.larroy.li...@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Sounds good in theory. I think there are complex details with
> > > regards
> > > > > of
> > > > > > > resource sharing during parallel execution. Still I think both
> ways
> > > > can
> > > > > > be
> > > > > > > explored. I think some tests run for unreasonably long times
> for
> > > what
> > > > > > they
> > > > > > > are doing. We already scale parts of the pipeline horizontally
> > > across
> > > > > > > workers.
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > > > cjolivie...@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > Rather than remove tests (which doesn’t scale as a
> solution), why
> > > > not
> > > > > > > scale
> > > > > > > > them horizontally so that they finish more quickly? Across
> > > > processes
> > > > > or
> > > > > > > > even on a pool of machines that aren’t necessarily the build
> > > > machine?
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > > > > marco.g.ab...@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > With regards to time I rather prefer us spending a bit more
> > > time
> > > > on
> > > > > > > > > maintenance than somebody running into an error that
> could've
> > > > been
> > > > > > > caught
> > > > > > > > > with a test.
> > > > > > > > >
> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has been
> broken
> > > for
> > > > > > quite
> > > > > > > >