Hi Chris, you are reading some confrontational or negative things where
there is no bad intention and just diverse opinions and ways to express
them.
We went with Marco for a beer and dinner together and talked about this and
we had a good exchange of technical ideas and opinions with mutual respe
Pedro,
I don’t see where Marco says that he “designed and implemented all aspects
of CI by himself”. I do think, however, that it’s fair to say that Marco
was in charge of the design and most likely made the majority of design
decisions as the CI was being built, especially around those tenents t
Thanks for your response Marco, I think you have totally missed my original
point which was basically that someone volunteering effort on the CI is as
important as someone contributing a feature. From my perspective this
hasn't been the case, and we had to rely a lot on you and Sheng to submit
fixe
I've heard this request multiple times and so far, I'm having issues
understanding the direct correlation between having committer permissions
and being able to manage CI.
When I designed the CI, one of the tenets was maintainability and
accessbility for the community: I wanted to avoid that someb
As Marco has open sourced the bulk of the CI infrastructure donated from
Amazon to the community, I would like to raise the recommendation that the
community takes action to help volunteers working on the CI have a better
experience. In the past, it's my impression that there hasn't been much
actio
Hi Aaron. This is difficult to diagnose, because I don't know what to do
when the hash of the layer in docker doesn't match and decides to rebuild
it. the r script seems not to have changed. I have observed this in the
past and I think is due to bugs in docker. Maybe Kellen is able to give
some t
Is -R already in there?
Here's an example of it happening to me right now I am making
minor changes to the runtime_functions logic for handling the R docs
output. I pull the fix, then run the container, but I see the R deps
layer re-running. I didn't touch that. Why it that running again?
>Fr
Also, I forgot, another workaround is that I added the -R flag to the build
logic (build.py) so the container is not rebuilt for manual use.
On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy
wrote:
>
> Hi Aaron.
>
> As Marco explained, if you are in master the cache usually works, there's
> two issu
Hi Aaron.
As Marco explained, if you are in master the cache usually works, there's
two issues that I have observed:
1 - Docker doesn't automatically pull the base image (ex. ubuntu:16.04) so
if your cached base which is used in the FROM statement becomes outdated
your caching won't work. (Using
It's rerunning as soon as that particular script has been modified. Since
the following steps depend on it, it means that once step 4 has a cache
mismatch, steps 5-15 are also no longer valid.
Our cache is always controlled by master. This means that the only thing
that matters is the diff between
When you create a new Dockerfile and use that on CI, it doesn't seem
to cache some of the steps... like this:
Step 13/15 : RUN /work/ubuntu_docs.sh
---> Running in a1e522f3283b
[91m+ echo 'Installing dependencies...'
+ apt-get update
[0mInstalling dependencies.
Or this
Step 4/13 : RUN /wo
Do I understand it correctly that you are saying that the Docker cache
doesn't work properly and regularly reinstalls dependencies? Or do you mean
that you only have cache misses when you modify the dependencies - which
would be expected?
-Marco
On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham
wro
Many of the CI pipelines follow this pattern:
Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
repeat steps 1-3 over and over?
Now, some tests use a stashed binary and docker cache. And I see this work
locally, but for the most part, on CI, you're gonna sit through a
dependen
Hi Chris.
I suggest you send a PR to illustrate your proposal so we have a concrete
example to look into.
Pedro.
On Wed, Aug 14, 2019 at 6:16 PM Chris Olivier wrote:
> I see it done daily now, and while I can’t share all the details, it’s not
> an incredibly complex thing, and involves not much
Hi Aaron. Why speeds things up? What's the difference?
Pedro.
On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham
wrote:
> The PRs Thomas and I are working on for the new docs and website share the
> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
>
> On Wed, Aug 14, 2019, 18:16
ny documentation on the S3 publishing steps and how to
> test
> > > this.
> > >
> > > * After breaking out each docs package in its own pipeline, I see
> > > opportunities to use the GitHub API to check the PR payload and be
> > > selective about wha
019 at 10:03 PM Zhao, Patric
> > wrote:
> > >
> > > Hi Aaron,
> > >
> > > Recently, we are working on improving the documents of CPU backend based
> > on the current website.
> > >
> > > I saw there're several PRs to update the new websit
hen the new website will online.
> > If it's very near, we will switch our works to the new website.
> >
> > Thanks,
> >
> > --Patric
> >
> >
> > > -Original Message-
> > > From: Aaron Markham
> > > Sent: Thursday, Augu
our works to the new website.
>
> Thanks,
>
> --Patric
>
>
> > -Original Message-
> > From: Aaron Markham
> > Sent: Thursday, August 15, 2019 11:40 AM
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: CI and PRs
> >
> > The PRs Thomas
No worries, auto scaling is taking care of that :)
-Marco
Sheng Zha schrieb am Do., 15. Aug. 2019, 19:50:
> The AWS Batch approach should also help with hardware utilization as
> machines are launched only when needed :)
>
> -sz
>
> > On Aug 15, 2019, at 9:11 AM, Marco de Abreu
> wrote:
> >
>
The AWS Batch approach should also help with hardware utilization as machines
are launched only when needed :)
-sz
> On Aug 15, 2019, at 9:11 AM, Marco de Abreu wrote:
>
> Thanks Leonard. Naively dividing by test files would certainly be an easy
> and doable way before going into to proper nos
Thanks Leonard. Naively dividing by test files would certainly be an easy
and doable way before going into to proper nose parallelization. Great idea!
Scalability in terms of nodes is not an issue. Our system can handle at
least 600 slaves (didn't want to go higher for obvious reasons). But I
thin
To parallelize across machines: For GluonNLP we started submitting test
jobs to AWS Batch. Just adding a for-loop over the units in the
Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
Jenkins just waits for all jobs to finish and retrieves their status.
This works since AWS
The first start wrt parallelization could certainly be start adding
parallel test execution in nosetests.
-Marco
Aaron Markham schrieb am Do., 15. Aug. 2019,
05:39:
> The PRs Thomas and I are working on for the new docs and website share the
> mxnet binary in the new CI pipelines we made. Speed
tch our works to the new website.
Thanks,
--Patric
> -Original Message-
> From: Aaron Markham
> Sent: Thursday, August 15, 2019 11:40 AM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: CI and PRs
>
> The PRs Thomas and I are working on for the new docs and websi
The PRs Thomas and I are working on for the new docs and website share the
mxnet binary in the new CI pipelines we made. Speeds things up a lot.
On Wed, Aug 14, 2019, 18:16 Chris Olivier wrote:
> I see it done daily now, and while I can’t share all the details, it’s not
> an incredibly complex t
I see it done daily now, and while I can’t share all the details, it’s not
an incredibly complex thing, and involves not much more than nfs/efs
sharing and remote ssh commands. All it takes is a little ingenuity and
some imagination.
On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy
wrote:
> Sounds
Sounds good in theory. I think there are complex details with regards of
resource sharing during parallel execution. Still I think both ways can be
explored. I think some tests run for unreasonably long times for what they
are doing. We already scale parts of the pipeline horizontally across
worker
+1
Rather than remove tests (which doesn’t scale as a solution), why not scale
them horizontally so that they finish more quickly? Across processes or
even on a pool of machines that aren’t necessarily the build machine?
On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu
wrote:
> With regards to t
Hi Marco.
I have to agree with you on that, from past experience.
What do you suggest for maintenance? Do we need a watermark that fails the
validation if the total runtime exceeds a high threshold?
Pedro.
On Wed, Aug 14, 2019 at 1:03 PM Marco de Abreu
wrote:
> With regards to time I rather p
With regards to time I rather prefer us spending a bit more time on
maintenance than somebody running into an error that could've been caught
with a test.
I mean, our Publishing pipeline for Scala GPU has been broken for quite
some time now, but nobody noticed that. Basically my stance on that mat
If a language binding test is failing for a not important reason, then it
is too brittle and needs to be fixed (we have fixed some of these with the
Clojure package [1]).
But in general, if we thinking of the MXNet project as one project that is
across all the language bindings, then we want to kno
>From what I have seen Clojure is 15 minutes, which I think is reasonable.
The only question is that when a binding such as R, Perl or Clojure fails,
some devs are a bit confused about how to fix them since they are not
familiar with the testing tools and the language.
On Wed, Aug 14, 2019 at 11:5
Great idea Marco! Anything that you think would be valuable to share would
be good. The duration of each node in the test stage sounds like a good
start.
- Carin
On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu
wrote:
> Hi,
>
> we record a bunch of metrics about run statistics (down to the durati
Hi,
we record a bunch of metrics about run statistics (down to the duration of
every individual step). If you tell me which ones you're particularly
interested in (probably total duration of each node in the test stage), I'm
happy to provide them.
Dimensions are (in hierarchical order):
- job
- b
I would prefer to keep the language binding in the PR process. Perhaps we
could do some analytics to see how much each of the language bindings is
contributing to overall run time.
If we have some metrics on that, maybe we can come up with a guideline of
how much time each should take. Another poss
Yes another point is that pushing again to the PR should cancel previous
builds which is now not happening which wastes resources.
Any ideas how to make connection errors more robust? The Ivy cache for JVM
packages for example could be pre-populated in the workers. It's a balance
between complexit
Hi Carin.
That's a good point, all things considered would your preference be to keep
the Clojure tests as part of the PR process or in Nightly?
Some options are having notifications here or in slack. But if we think
breakages would go unnoticed maybe is not a good idea to fully remove
bindings fr
Pedro,
great job of summarizing the set of tasks to restore CI's glory!
As far as your list goes,
> * Address unit tests that take more than 10-20s, streamline them or move
> them to nightly if it can't be done.
I would like to call out this request specifically. I'm tracking # of
timeouts that
Before any binding tests are moved to nightly, I think we need to figure
out how the community can get proper notifications of failure and success
on those nightly runs. Otherwise, I think that breakages would go unnoticed.
-Carin
On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy
wrote:
> Hi
>
> See
40 matches
Mail list logo