Subprojects and TLP status

Chris Douglas Tue, 13 Apr 2010 23:47:04 -0700

Most of Hadoop's subprojects have discussed becoming top-level Apache
projects (TLPs) in the last few weeks. Most have expressed a desire to
remain in Hadoop. The salient parts of the discussions I've read tend
to focus on three aspects: a technical dependence on Hadoop,
additional overhead as a TLP, and visibility both within the Hadoop
ecosystem and in the open source community generally.


Life as a TLP: this is not much harder than being a Hadoop subproject,
and the Apache preferences being tossed around- particularly
"insufficiently diverse"- are not blockers. Every subproject needs to
write a section of the report Hadoop sends to the board; almost the
same report, sent to a new address. The initial cost is similarly
light: copy bylaws, send a few notes to INFRA, and follow some
directions. I think the estimated costs are far higher than they will
be in practice. Inertia is a powerful force, but it should be
overcome. The directions are here, and should not intimidating:

http://apache.org/dev/project-creation.html

Visibility: the Hadoop site does not need to change. For each
subproject, we can literally change the hyperlinks to point to the new
page and be done. Long-term, linking to all ASF projects that run on
Hadoop from a prominent page is something we all want. So particularly
in the medium-term that most are considering: visibility through the
website will not change. Each subproject will still be linked from the
front page.

Hadoop would not be nearly as popular as it is without Zookeeper,
HBase, Hive, and Pig. All statistics on work in shared MapReduce
clusters show that users vastly prefer running Pig and Hive queries to
writing MapReduce jobs. HBase continues to push features in HDFS that
increase its adoption and relevance outside MapReduce, while sharing
some of its NoSQL limelight. Zookeeper is not only a linchpin in real
workloads, but many proposals for future features require it. The
bottom line is that MapReduce and HDFS need these projects for
visibility and adoption in precisely the same way. I don't think
separate TLPs will uncouple the broader community from one another.

Technical dependence: this has two dimensions. First, influencing
MapReduce and HDFS. This is nonsense. Earning influence by
contributing to a subproject is the only way to push code changes;
nobody from any of these projects has violated that by unilaterally
committing to HDFS or MapReduce, anyway. And anyone cynical enough to
believe that MapReduce and HDFS would deliberately screw over or
ignore dependent projects because they don't have PMC members is
plainly unsuited to community-driven development. I understand that
these projects need to protect their users, but lobbying rights are
not an actual benefit.

Second, being a coherent part of the Hadoop ecosystem. It is (mostly)
true that Hadoop current offers a set of mutually compatible
frameworks. It is not true that moving them to separate Apache
projects would make solutions less coherent or affect existing or
future users at all. The cohesion between projects' governance is
sufficiently weak to justify independent units, but the real
dependencies between the projects are strong enough to keep us engaged
with one another. And it's not as if other projects- Cascading, for
example- aren't also organisms adapted and specialized for life in
Hadoop.

Arguments on technical dependence are ignoring the nature of the
existing interactions. Besides, weak technical dependencies are not a
necessary prerequisite for a subproject's independence.

As for what was *not* said in these discussions, there is no argument
that every one of these subprojects has a distinct, autonomous
community. There was also no argument that the Hadoop PMC offers any
valuable oversight, given that the representatives of its fiefdoms are
too consumed by provincial matters to participate in neighboring
governance. Most releases I've voted on: I run the unit tests, check
the signature, verify the checksum, and know literally nothing else
about its content. I have often never heard the names of many proposed
committers and even some proposed PMC members. Right now, subprojects
with enough PMC members essentially vote out their own releases and
vote in their own committers: TLPs in all but name.

The Hadoop club- in conferences, meetups, technical debates, etc.- is
broad, diverse, and intertwined, but communities of developers have
already clustered around subprojects. Allowing that each cluster
should govern itself is a dry, practical matter, not an existential
crisis. -C

Subprojects and TLP status

Reply via email to