Hi Matei,

Good call on scaling the project itself. Identifying domain experts in
different areas is a good thing. But I have some questions about the
implementation. Here's my understanding of the proposal:

(1) The PMC votes on a list of components and their maintainers. Changes to
that list requires PMC approval.
(2) No committer shall commit changes to a component without a +1 from a
maintainer of that component.

I see good reasons for #1, to help people navigate the project and identify
expertise. For #2, I'd like to understand what problem it's trying to
solve. Do we have rogue committers committing to areas that they don't know
much about? If that's the case, we should address it directly, instead of
adding new processes.

To point out the obvious, it completely changes what "committers" means in
Spark. Do we have clear promotion criteria from "committer" to
"maintainer"? Is there a max number of maintainers per area Currently, as
committers gains expertise in new areas, they could start reviewing code in
those areas and give +1. This encourages more contributions and
cross-component knowledge sharing. Under the new proposal, they now have to
be promoted to "maintainers" first. That reduces our review bandwidth.

Again, if there is a quality issue with code reviews, let's talk to those
committers and help them do better. There are non-process ways to solve the
problem.

So I think we shouldn't require "maintainer +1". I do like the idea of
having explicit maintainers on a volunteer basis. These maintainers should
watch their jira and PR traffic, and be very active in design & API
discussions. That leads to better consistency and long-term design choices.

Cheers,
bc

On Wed, Nov 5, 2014 at 5:31 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Hi all,
>
> I wanted to share a discussion we've been having on the PMC list, as well
> as call for an official vote on it on a public list. Basically, as the
> Spark project scales up, we need to define a model to make sure there is
> still great oversight of key components (in particular internal
> architecture and public APIs), and to this end I've proposed implementing a
> maintainer model for some of these components, similar to other large
> projects.
>
> As background on this, Spark has grown a lot since joining Apache. We've
> had over 80 contributors/month for the past 3 months, which I believe makes
> us the most active project in contributors/month at Apache, as well as over
> 500 patches/month. The codebase has also grown significantly, with new
> libraries for SQL, ML, graphs and more.
>
> In this kind of large project, one common way to scale development is to
> assign "maintainers" to oversee key components, where each patch to that
> component needs to get sign-off from at least one of its maintainers. Most
> existing large projects do this -- at Apache, some large ones with this
> model are CloudStack (the second-most active project overall), Subversion,
> and Kafka, and other examples include Linux and Python. This is also
> by-and-large how Spark operates today -- most components have a de-facto
> maintainer.
>
> IMO, adopting this model would have two benefits:
>
> 1) Consistent oversight of design for that component, especially regarding
> architecture and API. This process would ensure that the component's
> maintainers see all proposed changes and consider them to fit together in a
> good way.
>
> 2) More structure for new contributors and committers -- in particular, it
> would be easy to look up who’s responsible for each module and ask them for
> reviews, etc, rather than having patches slip between the cracks.
>
> We'd like to start with in a light-weight manner, where the model only
> applies to certain key components (e.g. scheduler, shuffle) and user-facing
> APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand
> it if we deem it useful. The specific mechanics would be as follows:
>
> - Some components in Spark will have maintainers assigned to them, where
> one of the maintainers needs to sign off on each patch to the component.
> - Each component with maintainers will have at least 2 maintainers.
> - Maintainers will be assigned from the most active and knowledgeable
> committers on that component by the PMC. The PMC can vote to add / remove
> maintainers, and maintained components, through consensus.
> - Maintainers are expected to be active in responding to patches for their
> components, though they do not need to be the main reviewers for them (e.g.
> they might just sign off on architecture / API). To prevent inactive
> maintainers from blocking the project, if a maintainer isn't responding in
> a reasonable time period (say 2 weeks), other committers can merge the
> patch, and the PMC will want to discuss adding another maintainer.
>
> If you'd like to see examples for this model, check out the following
> projects:
> - CloudStack:
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
> <
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
> >
> - Subversion:
> https://subversion.apache.org/docs/community-guide/roles.html <
> https://subversion.apache.org/docs/community-guide/roles.html>
>
> Finally, I wanted to list our current proposal for initial components and
> maintainers. It would be good to get feedback on other components we might
> add, but please note that personnel discussions (e.g. "I don't think Matei
> should maintain *that* component) should only happen on the private list.
> The initial components were chosen to include all public APIs and the main
> core components, and the maintainers were chosen from the most active
> contributors to those modules.
>
> - Spark core public API: Matei, Patrick, Reynold
> - Job scheduler: Matei, Kay, Patrick
> - Shuffle and network: Reynold, Aaron, Matei
> - Block manager: Reynold, Aaron
> - YARN: Tom, Andrew Or
> - Python: Josh, Matei
> - MLlib: Xiangrui, Matei
> - SQL: Michael, Reynold
> - Streaming: TD, Matei
> - GraphX: Ankur, Joey, Reynold
>
> I'd like to formally call a [VOTE] on this model, to last 72 hours. The
> [VOTE] will end on Nov 8, 2014 at 6 PM PST.
>
> Matei

Reply via email to