Re: [Proposal] Stabilizing Apache MXNet CI build system

Bhavin Thaker Wed, 01 Nov 2017 09:42:12 -0700

Few comments/suggestions:

1) Can  we have this nice list of todo items on the Apache MXNet wiki page
to track them better?

2) Can we have a set of owners for each set of tests and source code
directory? One of the problems I have observed is that when there is a test
failure, it is difficult to find an owner who will take the responsibility
of fixing the test OR identifying the culprit code promptly -- this causes
the master to continue to fail for many days.

3) Specifically, we need an owner for the Windows setup -- nobody seems to
know much about it -- please feel free to correct me if required.

4) +1 to have a list of all feature requests on Jira or a similar commonly
and easily accessible system.

5) -1 to the branching model -- I was the gatekeeper for the branching
model at Informix for the database kernel code to be merged to master along
with my day-job of being a database kernel engineer for around 9 months and
hence have the opinion that a branching model just shifts the burden from
one place to another. We don't have a dedicated team to do the branching
model. If we really need a buildable master everyday, then we could just
tag every successful build as last_clean_build on master -- use this tag to
get a clean master at any time. How many Apache projects are doing
development on separate branches?

6) FYI: Rahul (rahul003@) has fixed various warnings with this PR:
https://github.com/apache/incubator-mxnet/pull/7109 and has a test added
that fails for any warning found. We can build on top of his work.

7) FYI: For the unit-tests problems, Meghna identified that some of the
unit-test run times have increased significantly in the recent builds. We
need volunteers to help diagnose the root-cause here:

Unit Test Task

Build #337

Build #500

Build #556

Python 2: GPU win

25

38

40

Python 3: GPU Win

15

38

46

Python2: CPU

25

35

80

Python3: CPU

14

28

72

R: CPU

20

34

24

R: GPU

5

24

24

8) Ensure that all PRs submitted have corresponding documentation on
http://mxnet.io for it.  It may be fine to have documentation follow the
code changes as long as there is ownership that this task will be done in a
timely manner.  For example, I have requested the Nvidia team to submit PRs
to update documentation on http://mxnet.io for the Volta changes to MXNet.

9) Ensure that mega-PRs have some level of design or architecture
document(s) shared on the Apache MXNet wiki. The mega-PR must have both
unit-tests and nightly/integration tests submitted to demonstrate
high-quality level.

10) Finally, how do we get ownership for code submitted to MXNet? When
something fails in a code segment that only a small set of folks know
about, what is the expected SLA for a response from them? When users deploy
MXNet in production environments, they will expect some form of SLA for
support and a patch release.

Regards,
Bhavin Thaker.

On Wed, Nov 1, 2017 at 8:20 AM, Pedro Larroy <pedro.larroy.li...@gmail.com>
wrote:

> +1  That would be great.
>
> On Mon, Oct 30, 2017 at 5:35 PM, Hen <bay...@apache.org> wrote:
> > How about we ask for a new mxnet repo to store all the config in?
> >
> > On Fri, Oct 27, 2017 at 05:30 Pedro Larroy <pedro.larroy.li...@gmail.com
> >
> > wrote:
> >
> >> Just to provide a high level overview of the ideas and proposals
> >> coming from different sources for the requirements for testing and
> >> validation of builds:
> >>
> >> * Have terraform files for the testing infrastructure. Infrastructure
> >> as code (IaC). Minus not emulated / nor cloud based, embedded
> >> hardware. ("single command" replication of the testing infrastructure,
> >> no manual steps).
> >>
> >> * CI software based on Jenkins, unless someone thinks there's a better
> >> alternative.
> >>
> >> * Use autoscaling groups and improve staggered build + test steps to
> >> achieve higher parallelism and shorter feedback times.
> >>
> >> * Switch to a branching model based on stable master + integration
> >> branch. PRs are merged into dev/integration which runs extended
> >> nightly tests, which are
> >> then merged into master, preferably in an automated way after
> >> successful extended testing.
> >> Master is always tested, and always buildable. Release branches or
> >> tags in master as usual for releases.
> >>
> >> * Build + test feedback time targeting less than 15 minutes.
> >> (Currently a build in a 16x core takes 7m). This involves lot of
> >> refactoring of tests, move expensive tests / big smoke tests to
> >> nightlies on the integration branch, also tests on IoT devices / power
> >> and performance regressions...
> >>
> >> * Add code coverage and other quality metrics.
> >>
> >> * Eliminate warnings and treat warnings as errors. We have spent time
> >> tracking down "undefined behaviour" bugs that could have been caught
> >> by compiler warnings.
> >>
> >> Is there something I'm missing or additional things that come to your
> >> mind that you would wish to add?
> >>
> >> Pedro.
> >>
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Reply via email to