Re: Incubator Proposal: Pig

Brian McCallister Sun, 23 Sep 2007 09:22:11 -0700

+1 -- I'd offer to help as much as I can, but I know how little thatis right now :-(


Definitely support (and will probably use at least ;-)


-Brian

On Sep 18, 2007, at 12:52 PM, Olga Natkovich wrote:

Hi,
Yahoo! research and development teams have developed a proposalbelow. The
proposal is also available on wiki at
<http://wiki.apache.org/incubator/PigProposal>
http://wiki.apache.org/incubator/PigProposal.
We would like to ask that the ASF consider forming a podlingaccording to
the proposal.

Thanks,

Olga Natkovich
 <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED]
----------------------------------------------------------------------------
---------

= Pig Open Source Proposal =

== Abstract ==

Pig is a platform for analyzing large data sets.

== Proposal ==

The Pig project consists of high-level languages for expressing data
analysis programs, coupled with infrastructure for evaluating these
programs. The salient property of Pig programs is that theirstructure isamenable to substantial parallelization, which in turns enablesthem to
handle very large data sets.
At the present time, Pig's infrastructure layer consists of acompiler thatproduces sequences of Map-Reduce programs, for which large-scaleparallelimplementations already exist (e.g., the Hadoop subproject). Pig'slanguagelayer currently consists of a textual language called Pig Latin,which has
the following key properties:
1. ''Ease of programming''. It is trivial to achieve parallelexecution of
simple, "embarrassingly parallel" data analysis tasks. Complex tasks
comprised of multiple interrelated data transformations are explicitly
encoded as data flow sequences, making them easy to write,understand, and
maintain.
 2. ''Optimization opportunities''. The way in which tasks are encoded
permits the system to optimize their execution automatically,allowing the
user to focus on semantics rather than efficiency.
 3. ''Extensibility''. Users can create their own functions to do
special-purpose processing.

== Background ==
Pig started as a research project at Yahoo! in May of 2006 tocombine ideasin parallel databases and distributed computing. The first internalreleasetook place in July 2006. The first release was a simple front-endto theHadoop Map/Reduce framework. The following releases added newfeatures andevolved the language based on user feedback. In July 2007, pig wastakenover by a development team and the first production version is dueto be
released on 9/28/07.
Since its inception, we had observed a steady growth of the usercommunitywithin Yahoo!. In April 2007, Pig was released under a BSD-typelicense.Several external parties are using this version and have expressedinterest
in collaborating on its development.

== Rationale ==
In an information-centric world, innovation is driven by ad-hocanalysis oflarge data sets. For example, search engine companies routinelydeploy and
refine services based on analyzing the recorded behavior of users,
publishers, and advertisers. The rate of innovation depends on the
efficiency with which data can be
analyzed.
To analyze large data sets efficiently, one needs parallelism. Thecheapestand most scalable form of parallelism is cluster computing.Unfortunately,
programming for a cluster computing environment is difficult and
time-consuming. Pig makes it easy to harness the power of clustercomputing
for ad-hoc data analysis.
While other language exist that try to achieve the same goals, webelievethat Pig provides more flexibility and gives more control to theend user.
SQL typically requires (1) importing data from a user's preferredformatinto a database system's internal format (2) well-structured,normalized
data with a declared schema, and (3) programs expressed in declarative
SELECT-FROM-WHERE blocks. In contrast, Pig Latin facilitates (1)
interoperability, i.e. data may be read/written in a formataccepted byother applications such as text editors or graph generators (2)flexibility,
i.e. data may be loosely structured or have structure that is
defined operationally, and (3) adoption by programmers who findprocedural
programming more natural than declarative programming.

Sawzall is a scripting language used at Google on top of Map-Reduce. A
sawzall program has a fairly rigid structure consisting of afiltering phase
(the map step) followed by an aggregation phase (the reduce step).
Furthermore, only the filtering phase can be written by the user,and only apre-built set of aggregations are available (new ones are non-trivial toadd). While Pig Latin has similar higher level primitives likefiltering andaggregation, an arbitrary number of them can be flexibly chainedtogether ina Pig Latin program, and all primitives can use user-definedfunctions withequal ease. Further, Pig Latin has additional primitives such ascogrouping,that allow operations such as joins (which require multipleprograms inSawzall) to be written in a single line in Pig Latin. Further, PigLatin is
designed
to be embedded into other languages, and can use functions writtenin other
languages. Thus, in contrast to Sawzall, it directly caters to a large
community of developers without having to make them learn anentirely new
programming language.

== Current Status ==

=== Meritocracy ===
Pig was started as a project that was developed by Yahoo! researchteam.Recently we have added a development team that works in harmonywith theresearch team with both teams actively and successfullycontributing to the
project. We are planning to create the environment that encourages
meritocracy and is consistent with the meritocracy principles ofApache.
Within the team we have people actively participating in the Hadoop
subproject.

=== Community ===

Pig has an active user community within Yahoo! that has been steadily
growing. Pig also attracted external users since its release under a
BSD-type license. Several external parties are using the productand have
expressed interest in collaborating on its development.
Also, since the current version of Pig is built on top of theHadoop webelieve that we will be able to quickly extend our community byattracting
both the Hadoop users and developers to the project.

=== Core Developers ===
Our contributors come from both research and development world andmost have
background in database internals and large scale distributed systems.

=== Alignment ===
Yahoo! seeks to develop Pig collaboratively with others, not tocontrol and
maintain it independently.  Apache offers the best legal and social
framework for such community-based software development.
Also, the current version of Pig runs on top of the Hadoop's Map-Reduceinfrastructure which is part of Apache. We believe there would be alot of
synergy between the projects both in terms of users and developers.

== Known Risks ==
=== Orphaned products ===
All current contributors are part of Yahoo which is a major playerin thespace and is committed to grid computing. Also we expect highdegree of
synergy with Hadoop subproject.

=== Inexperience with Open Source ===
Two of the committers have extensive experience with open sourceand Apache.The rest are new to open source and will be guided through theprocess by
the team members with experience.

=== Homogenous Developers ===
The current list of committers is confined to Yahoo employees. Ourplan is
to recruit more committers once the project gets on the way.

=== Reliance on Salaried Developers ===

Currently, all contributors are Yahoo employees. By extending the
development community we are hoping to mitigate this risk.

=== Relationships with Other Apache Products ===
Pig is built on top of Hadoop and we expect deep collaboration withHadoop
subproject.

=== An Excessive Fascination with the Apache Brand ===
Yahoo already have a strong brand and is not interested in Apacheas a wayto gain visibility. Yahoo! seeks to develop Pig collaborativelywith others,not to control and maintain it independently. Apache offers thebest legal
and social framework for such community-based software development.

== Documentation ==

http://research.yahoo.com/project/pig

== Initial Source ==
The initial source will be donated by Yahoo Inc. The donatingcompany willcontribute the initial code base once the proposal is accepted andnecessary
infrastructure has been set up.

== External Dependencies ==

 1. bzip2:  <http://www.kohsuke.org/bzip2/:Apache>
http://www.kohsuke.org/bzip2/:Apache license
 2. javacc:  <https://javacc.dev.java.net/:BSD>
https://javacc.dev.java.net/:BSD license
 3. hadoop:  <http://lucene.apache.org/hadoop/:Apache>
http://lucene.apache.org/hadoop/:Apache license
 4. log4j:  <http://logging.apache.org/log4j/>
http://logging.apache.org/log4j/: Apache license
5. jsch: <http://www.jcraft.com/jsch> http://www.jcraft.com/jsch:BSD
style license:  <http://www.jcraft.com/jsch/LICENSE.txt>
http://www.jcraft.com/jsch/LICENSE.txt

== Required Resources ==
== Mailing lists ==

We would need the following mailing lists
 1. pig-private (with moderated subscriptions)
 2. pig-dev
 3. pig-commits
 4. pig-user

=== Subversion Directory ===

https://svn.apache.org/repos/asf/incubator/pig

=== Issue Tracking ===

JIRA PIG (PIG)

== Initial Committers ==

 1. Nigel Daley ( <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED])
 2. Alan Gates ( <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED])
 3. Olga Natkovich ( <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED])
 4. Chris Olston ( <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED])
 5. Owen O'Malley ( <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED])
 6. Ben Reed ( <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED])
 7. Utkarsh Srivastava ( <mailto:[EMAIL PROTECTED]>
[EMAIL PROTECTED])

== Affiliation ==

All initial committers are affiliated with Yahoo!

== Sponsors ==

=== Champion ===

Doug Cutting

=== Nominated Mentors ===

Doug Cutting

=== Sponsoring Entity ===

Incubator



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Incubator Proposal: Pig

Reply via email to