+1 Thank you for writing this, Chris.
Tom On Tue, Apr 13, 2010 at 11:46 PM, Chris Douglas <[email protected]> wrote: > Most of Hadoop's subprojects have discussed becoming top-level Apache > projects (TLPs) in the last few weeks. Most have expressed a desire to > remain in Hadoop. The salient parts of the discussions I've read tend > to focus on three aspects: a technical dependence on Hadoop, > additional overhead as a TLP, and visibility both within the Hadoop > ecosystem and in the open source community generally. > > Life as a TLP: this is not much harder than being a Hadoop subproject, > and the Apache preferences being tossed around- particularly > "insufficiently diverse"- are not blockers. Every subproject needs to > write a section of the report Hadoop sends to the board; almost the > same report, sent to a new address. The initial cost is similarly > light: copy bylaws, send a few notes to INFRA, and follow some > directions. I think the estimated costs are far higher than they will > be in practice. Inertia is a powerful force, but it should be > overcome. The directions are here, and should not intimidating: > > http://apache.org/dev/project-creation.html > > Visibility: the Hadoop site does not need to change. For each > subproject, we can literally change the hyperlinks to point to the new > page and be done. Long-term, linking to all ASF projects that run on > Hadoop from a prominent page is something we all want. So particularly > in the medium-term that most are considering: visibility through the > website will not change. Each subproject will still be linked from the > front page. > > Hadoop would not be nearly as popular as it is without Zookeeper, > HBase, Hive, and Pig. All statistics on work in shared MapReduce > clusters show that users vastly prefer running Pig and Hive queries to > writing MapReduce jobs. HBase continues to push features in HDFS that > increase its adoption and relevance outside MapReduce, while sharing > some of its NoSQL limelight. Zookeeper is not only a linchpin in real > workloads, but many proposals for future features require it. The > bottom line is that MapReduce and HDFS need these projects for > visibility and adoption in precisely the same way. I don't think > separate TLPs will uncouple the broader community from one another. > > Technical dependence: this has two dimensions. First, influencing > MapReduce and HDFS. This is nonsense. Earning influence by > contributing to a subproject is the only way to push code changes; > nobody from any of these projects has violated that by unilaterally > committing to HDFS or MapReduce, anyway. And anyone cynical enough to > believe that MapReduce and HDFS would deliberately screw over or > ignore dependent projects because they don't have PMC members is > plainly unsuited to community-driven development. I understand that > these projects need to protect their users, but lobbying rights are > not an actual benefit. > > Second, being a coherent part of the Hadoop ecosystem. It is (mostly) > true that Hadoop current offers a set of mutually compatible > frameworks. It is not true that moving them to separate Apache > projects would make solutions less coherent or affect existing or > future users at all. The cohesion between projects' governance is > sufficiently weak to justify independent units, but the real > dependencies between the projects are strong enough to keep us engaged > with one another. And it's not as if other projects- Cascading, for > example- aren't also organisms adapted and specialized for life in > Hadoop. > > Arguments on technical dependence are ignoring the nature of the > existing interactions. Besides, weak technical dependencies are not a > necessary prerequisite for a subproject's independence. > > As for what was *not* said in these discussions, there is no argument > that every one of these subprojects has a distinct, autonomous > community. There was also no argument that the Hadoop PMC offers any > valuable oversight, given that the representatives of its fiefdoms are > too consumed by provincial matters to participate in neighboring > governance. Most releases I've voted on: I run the unit tests, check > the signature, verify the checksum, and know literally nothing else > about its content. I have often never heard the names of many proposed > committers and even some proposed PMC members. Right now, subprojects > with enough PMC members essentially vote out their own releases and > vote in their own committers: TLPs in all but name. > > The Hadoop club- in conferences, meetups, technical debates, etc.- is > broad, diverse, and intertwined, but communities of developers have > already clustered around subprojects. Allowing that each cluster > should govern itself is a dry, practical matter, not an existential > crisis. -C >
