Re: [VOTE] Accept Crunch into the Apache Incubator

Vinod Kumar Vavilapalli Wed, 23 May 2012 14:34:34 -0700

+1 (non-binding)

+Vinod


On May 23, 2012, at 11:45 AM, Josh Wills wrote:

> I would like to call a vote for accepting "Apache Crunch" for
> incubation in the Apache Incubator. The full proposal is available
> below.  We ask the Incubator PMC to sponsor it, with phunt as
> Champion, and phunt, tomwhite, and acmurthy volunteering to be
> Mentors.
> 
> Please cast your vote:
> 
> [ ] +1, bring Crunch into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Crunch into Incubator, because...
> 
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.
> 
> http://wiki.apache.org/incubator/CrunchProposal
> 
> Proposal text from the wiki:
> ----------------------------------------------------------------------------------------------------------------------
> = Crunch - Easy, Efficient MapReduce Pipelines in Java and Scala =
> 
> == Abstract ==
> 
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop.
> 
> == Proposal ==
> 
> Crunch is a Java library for writing, testing, and running pipelines
> of !MapReduce jobs on Apache Hadoop. Its main goal is to provide a
> high-level API for writing and testing complex !MapReduce jobs that
> require multiple processing stages.  It has a simple, flexible, and
> extensible data model that makes it ideal for processing data that
> does not naturally fit into a relational structure, such as time
> series and serialized object formats like JSON and Avro. It supports
> running pipelines either as a series of !MapReduce jobs on an Apache
> Hadoop cluster or in memory on a single machine for fast testing and
> debugging.
> 
> == Background ==
> 
> Crunch was initially developed by Cloudera to simplify the process of
> creating sequences of dependent !MapReduce jobs, especially jobs that
> processed non-relational data like time series. Its design was based
> on a paper Google published about a Java library they developed called
> !FlumeJava that was created in order to solve a similar class of
> problems. Crunch was open-sourced by Cloudera on !GitHub as an Apache
> 2.0 licensed project in October 2011. During this time Crunch has been
> formally released twice, as versions 0.1.0 (October 2010) and 0.2.0
> (February 2012), with an incremental update to version 0.2.1 (March
> 2012) .  These releases are also distributed by Cloudera as source and
> binaries from Cloudera's Maven repository.
> 
> == Rationale ==
> 
> Most of the interesting analytical and data processing tasks that are
> run on an Apache Hadoop cluster require a series of !MapReduce jobs to
> be executed in sequence. Developers who are creating these pipelines
> today need to manually assign the sequence of tasks to perform in a
> dependent chain of !MapReduce jobs, even though there are a number of
> well-known patterns for fusing dependent computations together into a
> single !MapReduce stage and for performing common types of joins and
> aggregations. This results in !MapReduce pipelines that are more
> difficult to test, maintain, and extend to support new functionality.
> 
> Furthermore, the type of data that is being stored and processed using
> Apache Hadoop is evolving. Although Hadoop was originally used for
> storing large volumes of structured text in the form of webpages and
> log files, it is now common for Hadoop to store complex, structured
> data formats such as JSON, Apache Avro, and Apache Thrift. These
> formats allow developers to work with serialized objects in
> programming languages like Java, C++, and Python, and allow for new
> types of analysis to be performed on complex data types. Hadoop has
> also been adopted by the scientific research community, who are using
> Hadoop to process time series data, structured binary files in the
> HDF5 format, and large medical and satellite images.
> 
> Crunch addresses these challenges by providing a lightweight and
> extensible Java API for defining the stages of a data processing
> pipeline, which can then be run on an Apache Hadoop cluster as a
> sequence of dependent !MapReduce jobs, or in-memory on a single
> machine to facilitate fast testing and debugging. Crunch relies on a
> small set of primitive abstractions that represent immutable,
> distributed collections of objects. Developers define functions that
> are applied to those objects in order to generate new immutable,
> distributed collections of objects. Crunch also provides a library of
> common !MapReduce patterns for performing efficient joins and
> aggregation operations over these distributed collections that
> developers may integrate into their own pipelines. Crunch also
> provides native support for processing structured binary data formats
> like JSON, Apache Avro, and Apache Thrift, and is designed to be
> extensible to support working with any kind of data format that Java
> supports in its native form.
> 
> == Initial Goals ==
> 
> Crunch is currently in its first major release with a considerable
> number of enhancement requests, tasks, and issues recorded towards its
> future development. The initial goal of this project will be to
> continue to build community in the spirit of the "Apache Way", and to
> address the highly requested features and bug-fixes towards the next
> dot release.
> 
> Some goals include:
> * To stand up a sustaining Apache-based community around the Crunch codebase.
> * Improved documentation of Java libraries and best practices.
> * Support the ability to "fuse" logically independent pipeline stages
> that aggregate the same data in different ways into a single
> !MapReduce job.
> * Performance, usability, and robustness improvements.
> * Improving diagnostic reporting and debugging for individual !MapReduce jobs.
> * Providing a centralized place for contributed extensions and
> domain-specific applications.
> 
> = Current Status =
> 
> == Meritocracy ==
> 
> Crunch was initially developed by Josh Wills in September 2011 at
> Cloudera. Developers external to Cloudera provided feedback, suggested
> features and fixes and implemented extensions of Crunch. Cloudera's
> engineering team has since maintained the project with Josh Wills, Tom
> White, and Brock Noland dedicated towards its improvement.
> Contributors to Crunch include developers from multiple organizations,
> including businesses and universities.
> 
> == Community ==
> 
> Crunch is currently used by a number of organizations all over the
> world. Crunch has an active and growing user and developer community
> with active participation in
> [[https://groups.google.com/a/cloudera.org/group/crunch-users/topics|user]]
> and 
> [[https://groups.google.com/a/cloudera.org/group/crunch-dev/topics|developer]]
> mailing lists.
> 
> Since open sourcing the project, there have been eight individuals
> from five organizations who have contributed code.
> 
> == Core Developers ==
> 
> The core developers for Crunch are:
> * Brock Noland: Wrote many of the test cases, user documentation, and
> contributed several bug fixes.
> * Josh Wills: Josh wrote much of the original Crunch code.
> * Gabriel Reid: Gabriel significantly improved Crunch's handling of
> Avro data and has contributed several bug fixes for the core planner.
> * Tom White: Tom added several libraries for common !MapReduce
> pipeline operations, including the sort library and a library of set
> operations.
> * Christian Tzolov: Christian has contributed several bug fixes for
> the Avro serialization module and the unit testing framework.
> * Robert Chu: Robert did the left/right/outer join implementations
> for Crunch and fixed several bugs in the runtime configuration logic.
> 
> Several of the core developers of Crunch have contributed towards
> Hadoop or related Apache projects and are familiar with Apache
> principles and philosophy for community driven software development.
> 
> == Alignment ==
> 
> Crunch complements several current Apache projects. It complements
> Hadoop !MapReduce by providing a higher-level API for developing
> complex data processing pipelines that require a sequence of
> !MapReduce jobs to perform. Crunch also supports Apache HBase in order
> to simplify the process of writing !MapReduce jobs that execute over
> HBase tables. Crunch makes extensive use of the Apache Avro data
> format as an internal data representation process that makes
> !MapReduce jobs execute quickly and efficiently.
> 
> = Known Risks =
> 
> == Orphaned Products ==
> 
> Crunch is already deployed in production at multiple companies and
> they are actively participating in creating new features. Crunch is
> getting traction with developers and thus the risks of it being
> orphaned are minimal.
> 
> == Inexperience with Open Source ==
> 
> All code developed for Crunch has been open sourced by Cloudera under
> Apache 2.0 license.  All committers to Crunch are intimately familiar
> with the Apache model for open-source development and are experienced
> with working with new contributors.
> 
> == Homogeneous Developers ==
> 
> The initial set of committers is from a reduced set of organizations.
> However, we expect that once approved for incubation, the project will
> attract new contributors from diverse organizations and will thus grow
> organically. The submission of patches from developers from several
> different organizations is a strong indication that Crunch will be
> widely adopted.
> 
> == Reliance on Salaried Developers ==
> 
> It is expected that Crunch will be developed on salaried and volunteer
> time, although all of the initial developers will work on it mainly on
> salaried time.
> 
> == Relationships with Other Apache Products ==
> 
> Crunch depends upon other Apache Projects: Apache Hadoop, Apache
> HBase, Apache Log4J, Apache Thrift, Apache Avro, and multiple Apache
> Commons components. Its build depends upon Apache Maven.
> 
> Crunch's functionality has some indirect or direct overlap with the
> functionality of Apache Pig and Apache Hive but has several
> significant differences in terms of their user community and the types
> of data they are designed to work with.  Both Hive and Pig are
> high-level languages that are designed to allow non-programmers to
> quickly create and run !MapReduce jobs. Crunch is a Java library whose
> primary community is Java developers who are creating scalable data
> pipelines and !MapReduce-based applications. Additionally, Hive and
> Pig both employ a relational, tuple-oriented data model on top of
> HDFS, which introduces overhead and limits expressive power for
> developers who are working with serialized objects and non-relational
> data types. Crunch uses a lower-level data model that gives developers
> the freedom to work with data in a format that is optimized for the
> problem they are trying to solve.
> 
> == An Excessive Fascination with the Apache Brand ==
> 
> We would like Crunch to become an Apache project to further foster a
> healthy community of contributors and consumers around the project.
> Since Crunch directly interacts with many Apache Hadoop-related
> projects and solves an important problem of many Hadoop users,
> residing in the Apache Software Foundation will increase interaction
> with the larger community.
> 
> = Documentation =
> 
> * Crunch wiki at GitHub: https://github.com/cloudera/crunch/wiki
> * Crunch jira at Cloudera: https://issues.cloudera.org/browse/crunch
> * Crunch javadoc at GitHub: http://cloudera.github.com/crunch/apidocs/
> 
> = Initial Source =
> 
> * https://github.com/cloudera/crunch/tree/
> 
> == Source and Intellectual Property Submission Plan ==
> 
> * The initial source is already licensed under the Apache License,
> Version 2.0. https://github.com/cloudera/crunch/blob/master/LICENSE.txt
> 
> == External Dependencies ==
> 
> The required external dependencies are all Apache License or
> compatible licenses. Following components with non-Apache licenses are
> enumerated:
> 
> * com.google.protobuf : New BSD
> * org.hamcrest: New BSD
> * org.slf4j: MIT-like License
> 
> Non-Apache build tools that are used by Crunch are as follows:
> 
> * Cobertura: GNU GPLv2
> 
> Note that Cobertura is optional and is only used for calculating unit
> test coverage.
> 
> == Cryptography ==
> 
> Crunch uses standard APIs and tools for SSH and SSL communication
> where necessary.
> 
> = Required  Resources =
> 
> == Mailing lists ==
> 
> * crunch-private (with moderated subscriptions)
> * crunch-dev
> * crunch-commits
> * crunch-user
> 
> == Github Repositories ==
> 
> http://github.com/apache/crunch
> git://git.apache.org/crunch.git
> 
> == Issue Tracking ==
> 
> JIRA Crunch (CRUNCH)
> 
> == Other Resources ==
> 
> The existing code already has unit and integration tests so we would
> like a Jenkins instance to run them whenever a new patch is submitted.
> This can be added after project creation.
> 
> = Initial Committers =
> 
> * Brock Noland (brock at cloudera dot com)
> * Josh Wills (jwills at cloudera dot com)
> * Gabriel Reid (gabriel dot reid at gmail dot com)
> * Tom White (tom at cloudera dot com)
> * Christian Tzolov (christian dot tzolov at gmail dot com)
> * Robert Chu (robert at wibidata dot com)
> * Vinod Kumar Vavilapalli (vinodkv at hortonworks dot com)
> 
> = Affiliations =
> 
> * Brock Noland, Cloudera
> * Josh Wills, Cloudera
> * Gabriel Reid, !TomTom
> * Tom White, Cloudera
> * Christian Tzolov, !TomTom
> * Robert Chu, !WibiData
> * Vinod Kumar Vavilapalli, Hortonworks
> 
> = Sponsors =
> 
> == Champion ==
> 
> * Patrick Hunt
> 
> == Nominated Mentors ==
> 
> * Tom White
> * Patrick Hunt
> * Arun Murthy
> 
> == Sponsoring Entity ==
> 
> * Apache Incubator PMC
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [VOTE] Accept Crunch into the Apache Incubator

Reply via email to