[VOTE] accept Pig into Incubator

Doug Cutting Tue, 25 Sep 2007 10:20:38 -0700

I would like to call the Incubator PMC to vote to incubate the proposedPig project. Discussion on this list evidenced broad interest in thisproject, which bodes well for its ability to build a diverse developercommunity.


http://wiki.apache.org/incubator/PigProposal


+1

Doug

-----------------------------------------------------------

= Proposal for Pig Project =

== Abstract ==

Pig is a platform for analyzing large data sets.

== Proposal ==

The Pig project consists of high-level languages for expressing dataanalysis programs, coupled with infrastructure for evaluating theseprograms. The salient property of Pig programs is that their structureis amenable to substantial parallelization, which in turns enables themto handle very large data sets.

At the present time, Pig's infrastructure layer consists of a compilerthat produces sequences of Map-Reduce programs, for which large-scaleparallel implementations already exist (e.g., the Hadoop subproject).Pig's language layer currently consists of a textual language called PigLatin, which has the following key properties:

1. ''Ease of programming''. It is trivial to achieve parallelexecution of simple, "embarrassingly parallel" data analysis tasks.Complex tasks comprised of multiple interrelated data transformationsare explicitly encoded as data flow sequences, making them easy towrite, understand, and maintain.2. ''Optimization opportunities''. The way in which tasks are encodedpermits the system to optimize their execution automatically, allowingthe user to focus on semantics rather than efficiency.3. ''Extensibility''. Users can create their own functions to dospecial-purpose processing.


== Background ==

Pig started as a research project at Yahoo! in May of 2006 to combineideas in parallel databases and distributed computing. The firstinternal release took place in July 2006. The first release was a simplefront-end to the Hadoop Map/Reduce framework. The following releasesadded new features and evolved the language based on user feedback. InJuly 2007, pig was taken over by a development team and the firstproduction version is due to be released on 9/28/07.

Since its inception, we had observed a steady growth of the usercommunity within Yahoo!. In April 2007, Pig was released under aBSD-type license. Several external parties are using this version andhave expressed interest in collaborating on its development.


== Rationale ==

In an information-centric world, innovation is driven by ad-hoc analysisof large data sets. For example, search engine companies routinelydeploy and refine services based on analyzing the recorded behavior ofusers, publishers, and advertisers. The rate of innovation depends onthe efficiency with which data can be

analyzed.

To analyze large data sets efficiently, one needs parallelism. Thecheapest and most scalable form of parallelism is cluster computing.Unfortunately, programming for a cluster computing environment isdifficult and time-consuming. Pig makes it easy to harness the power ofcluster computing for ad-hoc data analysis.

While other language exist that try to achieve the same goals, webelieve that Pig provides more flexibility and gives more control to theend user.

SQL typically requires (1) importing data from a user's preferred formatinto a database system's internal format (2) well-structured, normalizeddata with a declared schema, and (3) programs expressed in declarativeSELECT-FROM-WHERE blocks. In contrast, Pig Latin facilitates (1)interoperability, i.e. data may be read/written in a format accepted byother applications such as text editors or graph generators (2)flexibility, i.e. data may be loosely structured or have structure that isdefined operationally, and (3) adoption by programmers who findprocedural programming more natural than declarative programming.

Sawzall is a scripting language used at Google on top of Map-Reduce. Asawzall program has a fairly rigid structure consisting of a filteringphase (the map step) followed by an aggregation phase (the reduce step).Furthermore, only the filtering phase can be written by the user, andonly a pre-built set of aggregations are available (new ones arenon-trivial to add). While Pig Latin has similar higher level primitiveslike filtering and aggregation, an arbitrary number of them can beflexibly chained together in a Pig Latin program, and all primitives canuse user-defined functions with equal ease. Further, Pig Latin hasadditional primitives such as cogrouping, that allow operations such asjoins (which require multiple programs in Sawzall) to be written in asingle line in Pig Latin. Further, Pig Latin is designedto be embedded into other languages, and can use functions written inother languages. Thus, in contrast to Sawzall, it directly caters to alarge community of developers without having to make them learn anentirely new programming language.


== Current Status ==

=== Meritocracy ===

Pig was started as a project that was developed by Yahoo! research team.Recently we have added a development team that works in harmony with theresearch team with both teams actively and successfully contributing tothe project. We are planning to create the environment that encouragesmeritocracy and is consistent with the meritocracy principles of Apache.Within the team we have people actively participating in the Hadoopsubproject.


=== Community ===

Pig has an active user community within Yahoo! that has been steadilygrowing. Pig also attracted external users since its release under aBSD-type license. Several external parties are using the product andhave expressed interest in collaborating on its development.

Also, since the current version of Pig is built on top of the Hadoop webelieve that we will be able to quickly extend our community byattracting both the Hadoop users and developers to the project.


=== Core Developers ===

Our contributors come from both research and development world and mosthave background in database internals and large scale distributed systems.


=== Alignment ===

Yahoo! seeks to develop Pig collaboratively with others, not to controland maintain it independently. Apache offers the best legal and socialframework for such community-based software development.

Also, the current version of Pig runs on top of the Hadoop's Map-Reduceinfrastructure which is part of Apache. We believe there would be a lotof synergy between the projects both in terms of users and developers.


== Known Risks ==
=== Orphaned products ===

All current contributors are part of Yahoo which is a major player inthe space and is committed to grid computing. Also we expect high degreeof synergy with Hadoop subproject.


=== Inexperience with Open Source ===

Two of the committers have extensive experience with open source andApache. The rest are new to open source and will be guided through theprocess by the team members with experience.


=== Homogenous Developers ===

The current list of committers is confined to Yahoo employees. Our planis to recruit more committers once the project gets on the way.


=== Reliance on Salaried Developers ===

Currently, all contributors are Yahoo employees. By extending thedevelopment community we are hoping to mitigate this risk.


=== Relationships with Other Apache Products ===

Pig is built on top of Hadoop and we expect deep collaboration withHadoop subproject.


=== An Excessive Fascination with the Apache Brand ===

Yahoo already have a strong brand and is not interested in Apache as away to gain visibility. Yahoo! seeks to develop Pig collaboratively withothers, not to control and maintain it independently. Apache offers thebest legal and social framework for such community-based softwaredevelopment.


== Documentation ==

http://research.yahoo.com/project/pig

== Initial Source ==

The initial source will be donated by Yahoo Inc. The donating companywill contribute the initial code base once the proposal is accepted andnecessary infrastructure has been set up.


== External Dependencies ==

 1. bzip2: http://www.kohsuke.org/bzip2/:Apache license
 2. javacc: https://javacc.dev.java.net/:BSD license
 3. hadoop: http://lucene.apache.org/hadoop/:Apache license
 4. log4j: http://logging.apache.org/log4j/: Apache license

5. jsch: http://www.jcraft.com/jsch: BSD style license:http://www.jcraft.com/jsch/LICENSE.txt


== Required Resources ==
== Mailing lists ==

We would need the following mailing lists
 1. pig-private (with moderated subscriptions)
 2. pig-dev
 3. pig-commits
 4. pig-user

=== Subversion Directory ===

https://svn.apache.org/repos/asf/incubator/pig

=== Issue Tracking ===

JIRA PIG (PIG)

== Initial Committers ==

 1. Nigel Daley ([EMAIL PROTECTED])
 2. Alan Gates ([EMAIL PROTECTED])
 3. Olga Natkovich ([EMAIL PROTECTED])
 4. Chris Olston ([EMAIL PROTECTED])
 5. Owen O'Malley ([EMAIL PROTECTED])
 6. Ben Reed ([EMAIL PROTECTED])
 7. Utkarsh Srivastava ([EMAIL PROTECTED])

== Affiliation ==

All initial committers are affiliated with Yahoo!

== Sponsors ==

=== Champion ===

Doug Cutting

=== Nominated Mentors ===

   1. Doug Cutting
   2. Torsten Curdt
   3. Bertrand Delacretaz
   4. Yoav Shapira
   5. Sylvain Wallez

=== Sponsoring Entity ===

Incubator


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[VOTE] accept Pig into Incubator

Reply via email to