Re: [Proposal] Stabilizing Apache MXNet CI build system

Hen Wed, 01 Nov 2017 11:06:56 -0700

Some inline thoughts.

On Wed, Nov 1, 2017 at 9:41 AM, Bhavin Thaker <bhavintha...@gmail.com>
wrote:


> Few comments/suggestions:
>
> 1) Can  we have this nice list of todo items on the Apache MXNet wiki page
> to track them better?
>
> 2) Can we have a set of owners for each set of tests and source code
> directory? One of the problems I have observed is that when there is a test
> failure, it is difficult to find an owner who will take the responsibility
> of fixing the test OR identifying the culprit code promptly -- this causes
> the master to continue to fail for many days.
>

On this one, we're all volunteers and there shouldn't be situations of
"Bob's permission is needed to edit this file", or "We're waiting on Alice
to do that work". The project as a whole owns this.

Agreed that this can cause a tragedy of the commons, but raising the bar on
being a committer to someone who has the privilege of 24/7 time on the
project is worse.

As an employer of contributors, something you could do internally at Amazon
is to identify experts who own (from Amazon's point of view) contributions
to that area and they can be the ones you poke on an issue (internally).


>
> 3) Specifically, we need an owner for the Windows setup -- nobody seems to
> know much about it -- please feel free to correct me if required.
>

If there's no one in the community who can support it, then a) we should
seek someone (help wanted etc) on the lists/website/twitter, and b) if that
fails, we should move it to a contrib/deprecated path.


>
> 4) +1 to have a list of all feature requests on Jira or a similar commonly
> and easily accessible system.
>
> 5) -1 to the branching model -- I was the gatekeeper for the branching
> model at Informix for the database kernel code to be merged to master along
> with my day-job of being a database kernel engineer for around 9 months and
> hence have the opinion that a branching model just shifts the burden from
> one place to another. We don't have a dedicated team to do the branching
> model. If we really need a buildable master everyday, then we could just
> tag every successful build as last_clean_build on master -- use this tag to
> get a clean master at any time. How many Apache projects are doing
> development on separate branches?
>

Typically I would expect separate branch develop to happen when a project
is experimenting with multiple futures. Most projects do have multiple
branches (I'd guess typically only 2) to support bugfixes to older versions
and new code on newer versions though.


>
> 6) FYI: Rahul (rahul003@) has fixed various warnings with this PR:
> https://github.com/apache/incubator-mxnet/pull/7109 and has a test added
> that fails for any warning found. We can build on top of his work.
>
> 7) FYI: For the unit-tests problems, Meghna identified that some of the
> unit-test run times have increased significantly in the recent builds. We
> need volunteers to help diagnose the root-cause here:
>
> Unit Test Task
>
> Build #337
>
> Build #500
>
> Build #556
>
> Python 2: GPU win
>
> 25
>
> 38
>
> 40
>
> Python 3: GPU Win
>
> 15
>
> 38
>
> 46
>
> Python2: CPU
>
> 25
>
> 35
>
> 80
>
> Python3: CPU
>
> 14
>
> 28
>
> 72
>
> R: CPU
>
> 20
>
> 34
>
> 24
>
> R: GPU
>
> 5
>
> 24
>
> 24
>
>
> 8) Ensure that all PRs submitted have corresponding documentation on
> http://mxnet.io for it.  It may be fine to have documentation follow the
> code changes as long as there is ownership that this task will be done in a
> timely manner.  For example, I have requested the Nvidia team to submit PRs
> to update documentation on http://mxnet.io for the Volta changes to MXNet.
>

Why not expect documentation as a part of the PR?


>
>
> 9) Ensure that mega-PRs have some level of design or architecture
> document(s) shared on the Apache MXNet wiki. The mega-PR must have both
> unit-tests and nightly/integration tests submitted to demonstrate
> high-quality level.
>

+1. These are the ones that should be having a dev@ discussion.


>
>
> 10) Finally, how do we get ownership for code submitted to MXNet? When
> something fails in a code segment that only a small set of folks know
> about, what is the expected SLA for a response from them? When users deploy
> MXNet in production environments, they will expect some form of SLA for
> support and a patch release.
>

Users can expect what they want. What they get is best effort/good
intentions. If they want someone to supply an SLA, then they can pay a
vendor who repackages MXNet/builds upon MXNet for that service.

Part of the value of Open Source is that users can always fix the issue
themselves, they are not beholden to a third party to fix it for them (and
thus need an SLA). For something like OpenOffice there is an obvious issue
there, many of its users would need longer to come up to speed to fix the
issue and the likely reply; but for MXNet, many of its users do know how to
code and don't need to go learn a programming language before starting to
look at the bug. This is also why it's very important that the MXNet
documentation explains how to get the source, how to build the source, and
how to contribute.

Security vulnerabilities are a little different. While good intentions
remain, it's assumed that a healthy project can fulfill the good
intentions, and repeated security issues without resolution will quickly
raise the question of whether the project is not mature enough for its user
base.

Hen

Re: [Proposal] Stabilizing Apache MXNet CI build system

Reply via email to