+1, Sounds good.

Now I know whom to ping for what, even if I did not follow the whole
history of the project very carefully.

Prashant Sharma



On Thu, Nov 6, 2014 at 7:01 AM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> Hi all,
>
> I wanted to share a discussion we've been having on the PMC list, as well
> as call for an official vote on it on a public list. Basically, as the
> Spark project scales up, we need to define a model to make sure there is
> still great oversight of key components (in particular internal
> architecture and public APIs), and to this end I've proposed implementing a
> maintainer model for some of these components, similar to other large
> projects.
>
> As background on this, Spark has grown a lot since joining Apache. We've
> had over 80 contributors/month for the past 3 months, which I believe makes
> us the most active project in contributors/month at Apache, as well as over
> 500 patches/month. The codebase has also grown significantly, with new
> libraries for SQL, ML, graphs and more.
>
> In this kind of large project, one common way to scale development is to
> assign "maintainers" to oversee key components, where each patch to that
> component needs to get sign-off from at least one of its maintainers. Most
> existing large projects do this -- at Apache, some large ones with this
> model are CloudStack (the second-most active project overall), Subversion,
> and Kafka, and other examples include Linux and Python. This is also
> by-and-large how Spark operates today -- most components have a de-facto
> maintainer.
>
> IMO, adopting this model would have two benefits:
>
> 1) Consistent oversight of design for that component, especially regarding
> architecture and API. This process would ensure that the component's
> maintainers see all proposed changes and consider them to fit together in a
> good way.
>
> 2) More structure for new contributors and committers -- in particular, it
> would be easy to look up who’s responsible for each module and ask them for
> reviews, etc, rather than having patches slip between the cracks.
>
> We'd like to start with in a light-weight manner, where the model only
> applies to certain key components (e.g. scheduler, shuffle) and user-facing
> APIs (MLlib, GraphX, etc). Over time, as the project grows, we can expand
> it if we deem it useful. The specific mechanics would be as follows:
>
> - Some components in Spark will have maintainers assigned to them, where
> one of the maintainers needs to sign off on each patch to the component.
> - Each component with maintainers will have at least 2 maintainers.
> - Maintainers will be assigned from the most active and knowledgeable
> committers on that component by the PMC. The PMC can vote to add / remove
> maintainers, and maintained components, through consensus.
> - Maintainers are expected to be active in responding to patches for their
> components, though they do not need to be the main reviewers for them (e.g.
> they might just sign off on architecture / API). To prevent inactive
> maintainers from blocking the project, if a maintainer isn't responding in
> a reasonable time period (say 2 weeks), other committers can merge the
> patch, and the PMC will want to discuss adding another maintainer.
>
> If you'd like to see examples for this model, check out the following
> projects:
> - CloudStack:
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
> <
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/CloudStack+Maintainers+Guide
> >
> - Subversion:
> https://subversion.apache.org/docs/community-guide/roles.html <
> https://subversion.apache.org/docs/community-guide/roles.html>
>
> Finally, I wanted to list our current proposal for initial components and
> maintainers. It would be good to get feedback on other components we might
> add, but please note that personnel discussions (e.g. "I don't think Matei
> should maintain *that* component) should only happen on the private list.
> The initial components were chosen to include all public APIs and the main
> core components, and the maintainers were chosen from the most active
> contributors to those modules.
>
> - Spark core public API: Matei, Patrick, Reynold
> - Job scheduler: Matei, Kay, Patrick
> - Shuffle and network: Reynold, Aaron, Matei
> - Block manager: Reynold, Aaron
> - YARN: Tom, Andrew Or
> - Python: Josh, Matei
> - MLlib: Xiangrui, Matei
> - SQL: Michael, Reynold
> - Streaming: TD, Matei
> - GraphX: Ankur, Joey, Reynold
>
> I'd like to formally call a [VOTE] on this model, to last 72 hours. The
> [VOTE] will end on Nov 8, 2014 at 6 PM PST.
>
> Matei

Reply via email to