Most of Hadoop's subprojects have discussed becoming top-level Apache projects (TLPs) in the last few weeks. Most have expressed a desire to remain in Hadoop. The salient parts of the discussions I've read tend to focus on three aspects: a technical dependence on Hadoop, additional overhead as a TLP, and visibility both within the Hadoop ecosystem and in the open source community generally.
Life as a TLP: this is not much harder than being a Hadoop subproject, and the Apache preferences being tossed around- particularly "insufficiently diverse"- are not blockers. Every subproject needs to write a section of the report Hadoop sends to the board; almost the same report, sent to a new address. The initial cost is similarly light: copy bylaws, send a few notes to INFRA, and follow some directions. I think the estimated costs are far higher than they will be in practice. Inertia is a powerful force, but it should be overcome. The directions are here, and should not intimidating: http://apache.org/dev/project-creation.html Visibility: the Hadoop site does not need to change. For each subproject, we can literally change the hyperlinks to point to the new page and be done. Long-term, linking to all ASF projects that run on Hadoop from a prominent page is something we all want. So particularly in the medium-term that most are considering: visibility through the website will not change. Each subproject will still be linked from the front page. Hadoop would not be nearly as popular as it is without Zookeeper, HBase, Hive, and Pig. All statistics on work in shared MapReduce clusters show that users vastly prefer running Pig and Hive queries to writing MapReduce jobs. HBase continues to push features in HDFS that increase its adoption and relevance outside MapReduce, while sharing some of its NoSQL limelight. Zookeeper is not only a linchpin in real workloads, but many proposals for future features require it. The bottom line is that MapReduce and HDFS need these projects for visibility and adoption in precisely the same way. I don't think separate TLPs will uncouple the broader community from one another. Technical dependence: this has two dimensions. First, influencing MapReduce and HDFS. This is nonsense. Earning influence by contributing to a subproject is the only way to push code changes; nobody from any of these projects has violated that by unilaterally committing to HDFS or MapReduce, anyway. And anyone cynical enough to believe that MapReduce and HDFS would deliberately screw over or ignore dependent projects because they don't have PMC members is plainly unsuited to community-driven development. I understand that these projects need to protect their users, but lobbying rights are not an actual benefit. Second, being a coherent part of the Hadoop ecosystem. It is (mostly) true that Hadoop current offers a set of mutually compatible frameworks. It is not true that moving them to separate Apache projects would make solutions less coherent or affect existing or future users at all. The cohesion between projects' governance is sufficiently weak to justify independent units, but the real dependencies between the projects are strong enough to keep us engaged with one another. And it's not as if other projects- Cascading, for example- aren't also organisms adapted and specialized for life in Hadoop. Arguments on technical dependence are ignoring the nature of the existing interactions. Besides, weak technical dependencies are not a necessary prerequisite for a subproject's independence. As for what was *not* said in these discussions, there is no argument that every one of these subprojects has a distinct, autonomous community. There was also no argument that the Hadoop PMC offers any valuable oversight, given that the representatives of its fiefdoms are too consumed by provincial matters to participate in neighboring governance. Most releases I've voted on: I run the unit tests, check the signature, verify the checksum, and know literally nothing else about its content. I have often never heard the names of many proposed committers and even some proposed PMC members. Right now, subprojects with enough PMC members essentially vote out their own releases and vote in their own committers: TLPs in all but name. The Hadoop club- in conferences, meetups, technical debates, etc.- is broad, diverse, and intertwined, but communities of developers have already clustered around subprojects. Allowing that each cluster should govern itself is a dry, practical matter, not an existential crisis. -C
