Re: [Pig Wiki] Update of "ProposedProjects" by AlanGates

Alan Gates Thu, 16 Apr 2009 09:27:24 -0700

Your understanding of the proposal is correct. The goal would be toproduce Java code rather than a pipeline configuration. But thereasoning is not so that users can then take that and modifythemselves. There's nothing preventing them from doing it, but it hasa couple of major drawbacks.

1) Code generators generally generate horrific looking code, becausethey are going for speed and compactness not human maintainability.Trying to work in that code would be very difficult.

2) If you start adding code to generated code, you can no longer usethe original Pig Latin. You are from that point forward stuck inJava, since you can't backport your Java into the Pig Latin.

The proposal is designed to test the performance of Pig based ongenerated Java (or for that matter any other language, it need not beJava). For the idea you suggest, the NATIVE keyword (proposed here https://issues.apache.org/jira/browse/PIG-506)is a better solution.


Alan.

On Apr 16, 2009, at 12:54 AM, nitesh bhatia wrote:

Hi
Can you briefly explain what is required in the first project? Afterreadingthe description my impression is, currently when we are executingcommandson Pig Shell, Pig is first converting to map-reduce jobs and thenfeeding itto hadoop. In this project are we proposing that, the execution planmade byPig will be first converted to a java file for map-reduce procedureand then
feed onto hadoop network ?
If this is the case then I am sure it will be great help to users asthisfunctionality can be used to write complicated map-reduce jobs veryeasily.Initially user can write the Pig scripts / commands required for hisjob andget the map-reduce java files. Then he can edit map-reduce files toextendthe functionality and add extra procedures that are not provided byPig but
can be executed over hadoop.

--nitesh
On Wed, Apr 15, 2009 at 9:57 PM, Apache Wiki <wikidi...@apache.org>wrote:
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for
change notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/ProposedProjects

New page:
= Proposed Pig Projects =
This page describes projects what we (the committers) would like tosee
added
to Pig. The scale of these projects vary, but they are largerprojects,
usually on the weeks or months scale.  We have not yet filed
[https://issues.apache.org/jira/browse/PIG JIRAs] for some of these
because they are still in the vague idea stage.  As they become more
concrete,
[https://issues.apache.org/jira/browse/PIG JIRAs] will be filed forthem.
We welcome contributers to take on one of these projects. If youwould
like
to do so, please file a JIRA (if one does not already exist for the
project)
with a proposed solution. Pig's committers will work with you fromthere
to
help refine your solution. Once a solution is agreed upon, you canbegin
implementation.
If you see a project here that you would like to see Pig implementbut you
are
not in a position to implement the solution right now, feel free tovote
for
the project. Add your name to the list of supporters. This willhelpcontributers looking for a project to select one that will benefitmany
users.
If you would like to propose a project for Pig, feel free to add tothis
list.
If it is a smaller project, or something you plan to begin work on
immediately, filing a [https://issues.apache.org/jira/browse/PIGJIRA] is
a better route.

|| Catagory || Project || JIRA || Proposed By || Votes For ||
|| Execution || Pig currently executes scripts by building apipeline ofpre-built operators and running data through those operators in mapreducejobs. We need to investigate instead have Pig generate java codespecificto a job, and then compiling that code and using it to run the mapreduce
jobs. || || Many conference attendees || gates ||
|| Language || Currently only DISTINCT, ORDER BY, and FILTER areallowedinside FOREACH. All operators should be allowed in FOREACH. (Limitis beingworked on [https://issues.apache.org/jira/browse/PIG-741 741] || ||gates
|| ||
|| Optimization || Speed up comparison of tuples during shuffle forORDERBY || [https://issues.apache.org/jira/browse/PIG-659 659] || olgan|| |||| Optimization || Order by should be changed to not use POPackageto put
all of the tuples in a bag on the reduce side, as the bag is just
immediately flattened. It can instead work like join does for thelast
input in the join. || || gates || ||
|| Optimization || Often in a Pig script that produces a chain ofMR jobs,the map phases of 2nd and subsequent jobs very little. What littlethey doshould be pushed into the proceeding reduce and the map replaced bytheidentity mapper. Initial tests showed that the identity mapper was50%faster than using a Pig mapper (because Pig uses the loader toparse out
tuples even if the map itself is empty). || [
https://issues.apache.org/jira/browse/PIG-480 480] || olgan ||gates |||| Optimization || Use hand crafted calls to do string to integeror floatconversions. Initial tests showed these could be done about 8xfaster than
String.toIntger() and String.toFloat(). || [
https://issues.apache.org/jira/browse/PIG-482 482] || olgan ||gates ||
|| Optimization || Currently Pig always samples for and ORDER BY to
determine how to partition, and then runs another job to do thesort. For
small enough inputs, it should just sort with a single reducer. || [
https://issues.apache.org/jira/browse/PIG-483 483] || olgan || ||
|| Optimization || In many cases data to be joined is alreadysorted andpartitioned on the same key. Pig needs to be able to takeadvantage of thisand do these joins in the map. The join could be done by samplingone inputto determine the value of the join key at the beginning of everyHDFS block.This would form an index. Then in a second MR job can be run withtheother input. Based on the key seen in the second input, theappropriateblocks of the first input can also be loaded into the map and thejoin done.
|| || gates || ||
|| Optimization || The combiner is not currently used if FILTER isin the
FOREACH.  In some cases it could still be used.  || [
https://issues.apache.org/jira/browse/PIG-479 479] || olgan || ||
|| Optimization || Currently when types of data are declared Piginserts a
FOREACH immediately after the LOAD that does the conversions.  These
conversions should be delayed until the field is actually used. || [
https://issues.apache.org/jira/browse/PIG-410 410] || olgan ||gates |||| Optimization || When an order by is not the only operation in apigscript, it is done in two additional MR jobs. The first jobsamples using asampling loader, the second does the sort. The sample is used toconstructa partitioner that equally balances the data in the sort. Thesampler needsto be changed to be a !EvalFunc instead of a loader. This way asplit canbe but in the proceeding MR job, with the main data being writtenout andthe other part flowing to the sampler func, which can then writeout the
sample.  The final MR job can then be the sort. || || gates || ||
|| Optimization || When an order by is the only operation in a pigscriptit is currently done in 3 MR jobs. The first converts it toBinStorageformat (because the sample loader reads that format), the secondsamples,and the third sorts. Once the changes mentioned above to make thesampleran !EvalFunc are done it should be changed to be done in 2 MR jobsinsteadof 3. || [https://issues.apache.org/jira/browse/PIG-460 460] ||gates ||
||
|| Optimization || The Pig optimizer should be used to determine when
fields in a record are no longer needed and put in FOREACHstatements to
project out the unecessary data as early as possible. || [
https://issues.apache.org/jira/browse/PIG-466 466] || olgan || ||
|| Optimization || The Pig optimizers needs to call fieldsToRead sothat
Load functions that can do column skipping do it. || || gates || ||
|| Scalability || Pig's default join (symmetric hash) currentlydepends onbeing able to fit all of the values for a given join key for one oftheinputs into memory. (It does try to spill to disk in the casewhere itcannot fit them all into memory. In practice this often fails asit is notgood at understanding when memory is low enough that it shouldspill. Evenin the case where it does not fail, spilling to disk and rereadingfrom diskis very slow.) If instances of keys with a large number of valueswere
broken up so that the row set could fit in memory and then shipped to
multiple reducers. A sampling pass would need to be done first todetermine
which keys to break up. || || chris olston || gates ||
--
Nitesh Bhatia
Dhirubhai Ambani Institute of Information & Communication Technology
Gandhinagar
Gujarat

"Life is never perfect. It just depends where you draw the line."

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun

Re: [Pig Wiki] Update of "ProposedProjects" by AlanGates

Reply via email to