I'm going to split this out and raise it as a separate issue.

On 29 August 2012 19:35, Jun Ping Du <j...@vmware.com> wrote:

> Hi Chris and all,
>    Thanks for initiating the discussion. Can I say something in a
> prospective of contributor but not a committer or PMC member?
>    First, I have a feeling that current hadoop project process is good for
> contributors to deliver a bug fix but not so easy to deliver a big feature.
> I have great experience in bug fixing work that can get quickly response
> from committers and checked in. However, I feel a little frustrated in
> delivering a feature (~5K LOC, very important for hadoop running well on
> virtualization infrastructure) across common, hdfs, map reduce and yarn.
> Firstly, you have to figure out different committers you should turn for
> help on each component, then convince them your ideas and work with them in
> reviewing and committing the code. Each committers should understand the
> completed story and learn the code pending on review as well as that
> already checked in. If some committers are super busy, then the feature
> looks like pending forever. Thus, due to my current experience, I may have
> to say this process is not so friendly to contributors who come from
> different organizations with different backgrounds but have the same wish
> to contribute more to Apache hadoop.
>

One of the problems here is that a 5KLOC patch is a major change -and
regardless of whether you are a committer or not, you're going to hit a lot
of inertia. My fairly large service lifecycle
patch(https://issues.apache.org/jira/browse/HDFS-326 )
never survived, and I put a lot of effort in there as a committer. That was
with something that I was visibly doing in a branch of apache SVN, merging
and regression testing every week, syncing things, testing on my own
infrastructure, etc.

Turning up with a large diff without any previous involvement in the
project or collaborative development is going to hit a wall in pretty much
every OSS project, the big issues not just being "why" and "what does it
break", but "how is a patch this big going to be maintained?" and "how is
it going to be tested on anything other than the specific platform it's
been worked on". Any test plan that requires custom hardware,
infrastructure &c is tricky. It's hard enough making the jump from the
normal test suite to testing with real workloads on production-scale
clusters, if you start needing specific CPU designs, GPUs, non-standard
OS/JVM, etc, then it becomes impossible to regression test these for a
release.

To make things worse, Hadoop is a critical piece of so many companies
infrastructure; Yahoo!, Facebook, Twitter, LinkedIn, &c. The value of the
code is not the cost of implementation, it is the value of all the data
stored in HDFS,

This is why the barrier to entry of code is much, much lower in contrib/
than it is into the core -and the normal way to isolate work is to design
another extension point into which these things can go, where people can be
confident that changes won't break things, and where someone else takes on
the costs of maintenance and testing their custom extensions.


>    Based on this, for spinning out hadoop sub-project to TLPs, I would
> glad to see we will have concisely committer list for each projects then
> committers can be more focus (more bandwidth may be?) and contributors can
> know who they should turn to get quick response and help there. On the
> other hand, I would concern it may take more complexity to dependencies for
> features that across sub-project today as you should figure out branches
> for each TLP but it is hard to estimate when code can come alive in each
> branch of TLP (may take the similar complexity to committers as well).
>    I don't have many good suggestions but would be glad to see the process
> can be more smoothly for contributor's work no matter what decision we are
> making today. Just 2 cents.



I do agree we need a better way of having larger activities that span more
of the system being developed and then successfully committed.

Some of the what-not-to-do & what-to-do has been hinted at in the bottom of
Defining Hadoop ( http://wiki.apache.org/hadoop/Defining%20Hadoop ), but
there's no formalisation of how to do more major works within the Hadoop
codebase.

Of the big changes that have worked, they are


   1. HDFS 2's HA and ongoing improvements: collaborative dev on the list
   with incremental changes going on in trunk, RTC with lots of tests. This
   isn't finished, and the test problem there is that functional testing of
   all failure modes requires software-controlled fencing devices and switches
   -and tests to generated the expected failure space.
   2. YARN: Arun on his own branch, CTR, merge once mostly stable, and
   completely replacing MRv1.

How then do we get (a) more dev projects working and integrated by the
current committers, and (b) a process in which people who are not yet
contributors/committers can develop non-trivial changes to the project in a
way that it is done with the knowledge, support and mentorship of the rest
of the community?

This topic has arisen before -and never reached a good answer. How can we
incubate new pieces of work in the project and mentor external
contributions?


-steve

Reply via email to