[x] +1, bring Crunch into Incubator (non-binding) Regards, Mike
On Wednesday, May 23, 2012 at 11:45 AM, Josh Wills wrote: > I would like to call a vote for accepting "Apache Crunch" for > incubation in the Apache Incubator. The full proposal is available > below. We ask the Incubator PMC to sponsor it, with phunt as > Champion, and phunt, tomwhite, and acmurthy volunteering to be > Mentors. > > Please cast your vote: > > [ ] +1, bring Crunch into Incubator > [ ] +0, I don't care either way, > [ ] -1, do not bring Crunch into Incubator, because... > > This vote will be open for 72 hours and only votes from the Incubator > PMC are binding. > > http://wiki.apache.org/incubator/CrunchProposal > > Proposal text from the wiki: > ------------------------------ ---------------------------------------------------------------------------------------- > = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala = > > == Abstract == > > Crunch is a Java library for writing, testing, and running pipelines > of !MapReduce jobs on Apache Hadoop. > > == Proposal == > > Crunch is a Java library for writing, testing, and running pipelines > of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a > high-level API for writing and testing complex !MapReduce jobs that > require multiple processing stages. It has a simple, flexible, and > extensible data model that makes it ideal for processing data that > does not naturally fit into a relational structure, such as time > series and serialized object formats like JSON and Avro. It supports > running pipelines either as a series of !MapReduce jobs on an Apache > Hadoop cluster or in memory on a single machine for fast testing and > debugging. > > == Background == > > Crunch was initially developed by Cloudera to simplify the process of > creating sequences of dependent !MapReduce jobs, especially jobs that > processed non-relational data like time series. Its design was based > on a paper Google published about a Java library they developed called > !FlumeJava that was created in order to solve a similar class of > problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache > 2.0 licensed project in October 2011. During this time Crunch has been > formally released twice, as versions 0.1.0 (October 2010) and 0.2.0 > (February 2012), with an incremental update to version 0.2.1 (March > 2012) . These releases are also distributed by Cloudera as source and > binaries from Cloudera's Maven repository. > > == Rationale == > > Most of the interesting analytical and data processing tasks that are > run on an Apache Hadoop cluster require a series of !MapReduce jobs to > be executed in sequence. Developers who are creating these pipelines > today need to manually assign the sequence of tasks to perform in a > dependent chain of !MapReduce jobs, even though there are a number of > well-known patterns for fusing dependent computations together into a > single !MapReduce stage and for performing common types of joins and > aggregations. This results in !MapReduce pipelines that are more > difficult to test, maintain, and extend to support new functionality. > > Furthermore, the type of data that is being stored and processed using > Apache Hadoop is evolving. Although Hadoop was originally used for > storing large volumes of structured text in the form of webpages and > log files, it is now common for Hadoop to store complex, structured > data formats such as JSON, Apache Avro, and Apache Thrift. These > formats allow developers to work with serialized objects in > programming languages like Java, C++, and Python, and allow for new > types of analysis to be performed on complex data types. Hadoop has > also been adopted by the scientific research community, who are using > Hadoop to process time series data, structured binary files in the > HDF5 format, and large medical and satellite images. > > Crunch addresses these challenges by providing a lightweight and > extensible Java API for defining the stages of a data processing > pipeline, which can then be run on an Apache Hadoop cluster as a > sequence of dependent !MapReduce jobs, or in-memory on a single > machine to facilitate fast testing and debugging. Crunch relies on a > small set of primitive abstractions that represent immutable, > distributed collections of objects. Developers define functions that > are applied to those objects in order to generate new immutable, > distributed collections of objects. Crunch also provides a library of > common !MapReduce patterns for performing efficient joins and > aggregation operations over these distributed collections that > developers may integrate into their own pipelines. Crunch also > provides native support for processing structured binary data formats > like JSON, Apache Avro, and Apache Thrift, and is designed to be > extensible to support working with any kind of data format that Java > supports in its native form. > > == Initial Goals == > > Crunch is currently in its first major release with a considerable > number of enhancement requests, tasks, and issues recorded towards its > future development. The initial goal of this project will be to > continue to build community in the spirit of the "Apache Way", and to > address the highly requested features and bug-fixes towards the next > dot release. > > Some goals include: > * To stand up a sustaining Apache-based community around the Crunch codebase. > * Improved documentation of Java libraries and best practices. > * Support the ability to "fuse" logically independent pipeline stages > that aggregate the same data in different ways into a single > !MapReduce job. > * Performance, usability, and robustness improvements. > * Improving diagnostic reporting and debugging for individual !MapReduce jobs. > * Providing a centralized place for contributed extensions and > domain-specific applications. > > = Current Status = > > == Meritocracy == > > Crunch was initially developed by Josh Wills in September 2011 at > Cloudera. Developers external to Cloudera provided feedback, suggested > features and fixes and implemented extensions of Crunch. Cloudera's > engineering team has since maintained the project with Josh Wills, Tom > White, and Brock Noland dedicated towards its improvement. > Contributors to Crunch include developers from multiple organizations, > including businesses and universities. > > == Community == > > Crunch is currently used by a number of organizations all over the > world. Crunch has an active and growing user and developer community > with active participation in > [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user ]] > and [[ https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer]] > mailing lists. > > Since open sourcing the project, there have been eight individuals > from five organizations who have contributed code. > > == Core Developers == > > The core developers for Crunch are: > * Brock Noland: Wrote many of the test cases, user documentation, and > contributed several bug fixes. > * Josh Wills: Josh wrote much of the original Crunch code. > * Gabriel Reid: Gabriel significantly improved Crunch's handling of > Avro data and has contributed several bug fixes for the core planner. > * Tom White: Tom added several libraries for common !MapReduce > pipeline operations, including the sort library and a library of set > operations. > * Christian Tzolov: Christian has contributed several bug fixes for > the Avro serialization module and the unit testing framework. > * Robert Chu: Robert did the left/right/outer join implementations > for Crunch and fixed several bugs in the runtime configuration logic. > > Several of the core developers of Crunch have contributed towards > Hadoop or related Apache projects and are familiar with Apache > principles and philosophy for community driven software development. > > == Alignment == > > Crunch complements several current Apache projects. It complements > Hadoop !MapReduce by providing a higher-level API for developing > complex data processing pipelines that require a sequence of > !MapReduce jobs to perform. Crunch also supports Apache HBase in order > to simplify the process of writing !MapReduce jobs that execute over > HBase tables. Crunch makes extensive use of the Apache Avro data > format as an internal data representation process that makes > !MapReduce jobs execute quickly and efficiently. > > = Known Risks = > > == Orphaned Products == > > Crunch is already deployed in production at multiple companies and > they are actively participating in creating new features. Crunch is > getting traction with developers and thus the risks of it being > orphaned are minimal. > > == Inexperience with Open Source == > > All code developed for Crunch has been open sourced by Cloudera under > Apache 2.0 license. All committers to Crunch are intimately familiar > with the Apache model for open-source development and are experienced > with working with new contributors. > > == Homogeneous Developers == > > The initial set of committers is from a reduced set of organizations. > However, we expect that once approved for incubation, the project will > attract new contributors from diverse organizations and will thus grow > organically. The submission of patches from developers from several > different organizations is a strong indication that Crunch will be > widely adopted. > > == Reliance on Salaried Developers == > > It is expected that Crunch will be developed on salaried and volunteer > time, although all of the initial developers will work on it mainly on > salaried time. > > == Relationships with Other Apache Products == > > Crunch depends upon other Apache Projects: Apache Hadoop, Apache > HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache > Commons components. Its build depends upon Apache Maven. > > Crunch's functionality has some indirect or direct overlap with the > functionality of Apache Pig and Apache Hive but has several > significant differences in terms of their user community and the types > of data they are designed to work with. Both Hive and Pig are > high-level languages that are designed to allow non-programmers to > quickly create and run !MapReduce jobs. Crunch is a Java library whose > primary community is Java developers who are creating scalable data > pipelines and !MapReduce-based applications. Additionally, Hive and > Pig both employ a relational, tuple-oriented data model on top of > HDFS, which introduces overhead and limits expressive power for > developers who are working with serialized objects and non-relational > data types. Crunch uses a lower-level data model that gives developers > the freedom to work with data in a format that is optimized for the > problem they are trying to solve. > > == An Excessive Fascination with the Apache Brand == > > We would like Crunch to become an Apache project to further foster a > healthy community of contributors and consumers around the project. > Since Crunch directly interacts with many Apache Hadoop-related > projects and solves an important problem of many Hadoop users, > residing in the Apache Software Foundation will increase interaction > with the larger community. > > = Documentation = > > * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki > * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch > * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/ > > = Initial Source = > > * https://github.com/cloudera/crunch/tree/ > > == Source and Intellectual Property Submission Plan == > > * The initial source is already licensed under the Apache License, > Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt > > == External Dependencies == > > The required external dependencies are all Apache License or > compatible licenses. Following components with non-Apache licenses are > enumerated: > > * com.google.protobuf : New BSD > * org.hamcrest: New BSD > * org.slf4j: MIT-like License > > Non-Apache build tools that are used by Crunch are as follows: > > * Cobertura: GNU GPLv2 > > Note that Cobertura is optional and is only used for calculating unit > test coverage. > > == Cryptography == > > Crunch uses standard APIs and tools for SSH and SSL communication > where necessary. > > = Required Resources = > > == Mailing lists == > > * crunch-private (with moderated subscriptions) > * crunch-dev > * crunch-commits > * crunch-user > > == Github Repositories == > > http://github.com/apache/crunch > git://git.apache.org/crunch.git (http://git.apache.org/crunch.git) > > == Issue Tracking == > > JIRA Crunch (CRUNCH) > > == Other Resources == > > The existing code already has unit and integration tests so we would > like a Jenkins instance to run them whenever a new patch is submitted. > This can be added after project creation. > > = Initial Committers = > > * Brock Noland (brock at cloudera dot com) > * Josh Wills (jwills at cloudera dot com) > * Gabriel Reid (gabriel dot reid at gmail dot com) > * Tom White (tom at cloudera dot com) > * Christian Tzolov (christian dot tzolov at gmail dot com) > * Robert Chu (robert at wibidata dot com) > * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com) > > = Affiliations = > > * Brock Noland, Cloudera > * Josh Wills, Cloudera > * Gabriel Reid, !TomTom > * Tom White, Cloudera > * Christian Tzolov, !TomTom > * Robert Chu, !WibiData > * Vinod Kumar Vavilapalli, Hortonworks > > = Sponsors = > > == Champion == > > * Patrick Hunt > > == Nominated Mentors == > > * Tom White > * Patrick Hunt > * Arun Murthy > > == Sponsoring Entity == > > * Apache Incubator PMC > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org (mailto: general-unsubscr...@incubator.apache.org) > For additional commands, e-mail: general-h...@incubator.apache.org(mailto: general-h...@incubator.apache.org)