Re: Avoiding serialization/de-serialization in pig

2010-06-30 Thread Thejas Nair



On 6/28/10 5:51 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 
 I have a feeling that propagating schemas when known, and using them to for
 (de)serialization instead of reflecting every field, would also be a big
 win.
 
 Thoughts on just using Avro for the internal PigStorage?

When I profiled pig queries, I don't see much time being spent in
DataType.findType(Object o), where the type of object is determined using
instanceof. (I am assuming you were referring to that).

But we can still optimize the cases where schema is known (ie all rows have
same schema) by not storing the type with each field in the serialization
format . Avro stores the schema separately, so I assume it has this
optimization. But in the case where schema is not known, we would need to
store the type information for every row.
When query plan is generated, we would need to determine which serialization
format is to be used.

-Thejas





Avoiding serialization/de-serialization in pig

2010-06-28 Thread Thejas Nair
I have created a wiki which puts together some ideas that can help in
improving performance by avoiding/delaying serialization/de-serialization .

http://wiki.apache.org/pig/AvoidingSedes

These are ideas that don't involve changes to optimizer. Most of them
involve changes in the load/store functions.

Your feedback is welcome.

Thanks,
Thejas



Re: Begin a discussion about Pig as a top level project

2010-04-02 Thread Thejas Nair
I agree with Alan and Dmitriy - Pig is tightly coupled with hadoop, and
heavily influenced by its roadmap. I think it makes sense to continue as a
sub-project of hadoop.

-Thejas



On 3/31/10 4:04 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 Over time, Pig is increasing its coupling to Hadoop (for good reasons),
 rather than decreasing it. If and when Pig becomes a viable entity without
 hadoop around, it might make sense as a TLP. As is, I think becoming a TLP
 will only introduce unnecessary administrative and bureaucratic headaches.
 So my vote is also -1.
 
 -Dmitriy
 
 
 
 On Wed, Mar 31, 2010 at 2:38 PM, Alan Gates ga...@yahoo-inc.com wrote:
 
 So far I haven't seen any feedback on this.  Apache has asked the Hadoop
 PMC to submit input in April on whether some subprojects should be promoted
 to TLPs.  We, the Pig community, need to give feedback to the Hadoop PMC on
 how we feel about this.  Please make your voice heard.
 
 So now I'll head my own call and give my thoughts on it.
 
 The biggest advantage I see to being a TLP is a direct connection to
 Apache.  Right now all of the Pig team's interaction with Apache is through
 the Hadoop PMC.  Being directly connected to Apache would benefit Pig team
 members who would have a better view into Apache.  It would also raise our
 profile in Apache and thus make other projects more aware of us.
 
 However, I am concerned about loosing Pig's explicit connection to Hadoop.
  This concern has a couple of dimensions.  One, Hadoop and MapReduce are the
 current flavor of the month in computing.  Given that Pig shares a name with
 the common farm animal, it's hard to be sure based on search statistics.
  But Google trends shows that hadoop is searched on much more frequently
 than hadoop pig or apache pig (see
 http://www.google.com/trends?q=hadoop%2Chadoop+pig).  I am guessing that
 most Pig users come from Hadoop users who discover Pig via Hadoop's website.
  Loosing that subproject tab on Hadoop's front page may radically lower the
 number of users coming to Pig to check out our project.  I would argue that
 this benefits Hadoop as well, since high level languages like Pig Latin have
 the potential to greatly extend the user base and usability of Hadoop.
 
 Two, being explicitly connected to Hadoop keeps our two communities aware
 of each others needs.  There are features proposed for MR that would greatly
 help Pig.  By staying in the Hadoop community Pig is better positioned to
 advocate for and help implement and test those features.  The response to
 this will be that Pig developers can still subscribe to Hadoop mailing
 lists, submit patches, etc.  That is, they can still be part of the Hadoop
 community.  Which reinforces my point that it makes more sense to leave Pig
 in the Hadoop community since Pig developers will need to be part of that
 community anyway.
 
 Finally, philosophically it makes sense to me that projects that are
 tightly connected belong together.  It strikes me as strange to have Pig as
 a TLP completely dependent on another TLP.  Hadoop was originally a
 subproject of Lucene.  It moved out to be a TLP when it became obvious that
 Hadoop had become independent of and useful apart from Lucene.  Pig is not
 in that position relative to Hadoop.
 
 So, I'm -1 on Pig moving out.  But this is a soft -1.  I'm open to being
 persuaded that I'm wrong or my concerns can be addressed while still having
 Pig as a TLP.
 
 Alan.
 
 
 On Mar 19, 2010, at 10:59 AM, Alan Gates wrote:
 
  You have probably heard by now that there is a discussion going on in the
 Hadoop PMC as to whether a number of the subprojects (Hbase, Avro,
 Zookeeper, Hive, and Pig) should move out from under the Hadoop umbrella and
 become top level Apache projects (TLP).  This discussion has picked up
 recently since the Apache board has clearly communicated to the Hadoop PMC
 that it is concerned that Hadoop is acting as an umbrella project with many
 disjoint subprojects underneath it.  They are concerned that this gives
 Apache little insight into the health and happenings of the subproject
 communities which in turn means Apache cannot properly mentor those
 communities.
 
 The purpose of this email is to start a discussion within the Pig
 community about this topic.  Let me cover first what becoming TLP would mean
 for Pig, and then I'll go into what options I think we as a community have.
 
 Becoming a TLP would mean that Pig would itself have a PMC that would
 report directly to the Apache board.  Who would be on the PMC would be
 something we as a community would need to decide.  Common options would be
 to say all active committers are on the PMC, or all active committers who
 have been a committer for at least a year.  We would also need to elect a
 chair of the PMC.  This lucky person would have no additional power, but
 would have the additional responsibility of writing quarterly reports on
 Pig's status for Apache board meetings, as well as coordinating with 

Re: LoadFunc.skipNext() function for faster sampling ?

2009-11-03 Thread Thejas Nair
Yes, that should work. I will use InputFormat.getNext from the SampleLoader
to skip the records.
Thanks,
Thejas


On 11/3/09 6:39 PM, Alan Gates ga...@yahoo-inc.com wrote:

 We definitely want to avoid parsing every tuple when sampling.  But do
 we need to implement a special function for it?  Pig will have access
 to the InputFormat instance, correct?  Can it not call
 InputFormat.getNext the desired number of times (which will not parse
 the tuple) and then call LoadFunc.getNext to get the next parsed tuple?
 
 Alan.
 
 On Nov 3, 2009, at 4:28 PM, Thejas Nair wrote:
 
 In the new implementation of SampleLoader subclasses (used by order-
 by,
 skew-join ..) as part of the loader redesign, we are not only
 reading all
 the records input but also parsing them as pig tuples.
 
 This is because the SampleLoaders are wrappers around the actual input
 loaders specified in the query. We can make things much faster by
 having a
 skipNext() function (or skipNext(int numSkip) ) which will avoid
 parsing the
 record into a pig tuple.
 LoadFunc could optionally implement this (easy to implement)
 function (which
 will be part of an interface) for improving speed of queries such as
 order-by.
 
 -Thejas
 
 



LoadFunc.skipNext() function for faster sampling ?

2009-11-03 Thread Thejas Nair
In the new implementation of SampleLoader subclasses (used by order-by,
skew-join ..) as part of the loader redesign, we are not only reading all
the records input but also parsing them as pig tuples.

This is because the SampleLoaders are wrappers around the actual input
loaders specified in the query. We can make things much faster by having a
skipNext() function (or skipNext(int numSkip) ) which will avoid parsing the
record into a pig tuple.
LoadFunc could optionally implement this (easy to implement) function (which
will be part of an interface) for improving speed of queries such as
order-by.

-Thejas



Definition of equality of bags

2009-11-02 Thread Thejas Nair
I could not find any documentation (in piglatin manual) on what the
definition of equality of bags is (or what it should be), does the order of
tuples in the bag matter ? But the definition of a bag does not imply any
ordering.

This has implication on the definition of join/cogroup/group on bags.

Thanks,
Thejas




Re: Definition of equality of bags

2009-11-02 Thread Thejas Nair
Looks like the join/cogroup/group is not defined on bags. I assume this is
because the equality on bags is not defined.

It gives the error in map-reduce mode, but does not in local mode.
Since pig is likely to get rid of custom local mode implementation and use
hadoop local mode and that should fix it, I am not filing a jira.
-Thejas



On 11/2/09 9:19 AM, Thejas Nair te...@yahoo-inc.com wrote:

 I could not find any documentation (in piglatin manual) on what the
 definition of equality of bags is (or what it should be), does the order of
 tuples in the bag matter ? But the definition of a bag does not imply any
 ordering.
 
 This has implication on the definition of join/cogroup/group on bags.
 
 Thanks,
 Thejas
 
 



Re: switching to different parser in Pig

2009-08-25 Thread Thejas Nair
Jflex is covered by GPL, but code generated by it is not. Only the code that
is generated by Jflex goes into pig.jar.
We can't checkin Jflex.jar into svn, ivy will be setup to download it from
maven repository.
-Thejas



On 8/25/09 11:57 AM, Dmitriy Ryaboy dvrya...@cloudera.com wrote:

 Santosh,
 Am I missing something about Jflex licensing? I thought that it being
 GPL, we can't package it with apache-licensed software, which prevents
 it from being a viable option (regardless of technical merits)
 
 -Dmitriy
 
 On Tue, Aug 25, 2009 at 1:58 PM, Santhosh Srinivasans...@yahoo-inc.com 
 wrote:
 Its been 6 months since this topic was discussed but we don't have
 closure on it.
 For SQL on top of Pig, we are using Jflex and CUP
 (https://issues.apache.org/jira/browse/PIG-824). If we have decided on
 the right parser, can we have a plan to move the other parsers in Pig to
 the same technology?
 
 Thanks,
 Santhosh
 
 PS: I am assuming we are not moving to Antlr.
 
 
 -Original Message-
 From: Alan Gates [mailto:ga...@yahoo-inc.com]
 Sent: Tuesday, February 24, 2009 10:17 AM
 To: pig-dev@hadoop.apache.org; pi.so...@gmail.com
 Subject: Re: switching to different parser in Pig
 
 Sorry, after I sent that email yesterday I realized I was not very
 clear.  I did not mean to imply that antlr didn't have good
 documentation or good error handling.  What I wanted to say was we
 want all three of those things, and it didn't appear that antlr
 provided all three, since it doesn't separate out scanner and parser.
 Also, from my viewpoint, I prefer bottom up LALR(1) parsers like yacc
 to top down parsers like javacc.  My understanding is that antlr is
 top down like javacc.  My reasoning for this preference is that parser
 books and classes have used those for decades, so there are a large
 number of engineers out there (including me :) ) who know how to work
 with them.  But maybe antlr is close enough to what we need.  I'll
 take a deeper look at it before I vote officially on which way we
 should go.
 
 As for loops and branches, I'm not saying we need those in Pig Latin.
 We need them somehow.  Whether it's better to put them in Pig Latin or
 imbed pig in a existing script language is an ongoing debate.  I don't
 want to make a decision now that effectively ends that debate without
 buy in from those who feel strongly that Pig Latin should include
 those constructs.
 
 I agree with you that we should modify the logical plan to support
 this rather than add another layer.  As for active development, the
 only thing I'm aware of is we hope to start working on a more robust
 optimizer for pig soon, and that will require some additional
 functionality out of the logical operators, but it shouldn't cause any
 fundamental architectural changes.
 
 Alan.
 
 
 On Feb 24, 2009, at 1:27 AM, pi song wrote:
 
 (1) Lack of good documentation which makes it hard to and time
 consuming
 to learn javacc and make changes to Pig grammar
 == ANTLR is very very well documented.
 http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference
 http://media.pragprog.com/titles/tpantlr/toc.pdf
 http://www.antlr.org/wiki/display/ANTLR3/ANTLR+3+Wiki+Home
 
 (2) No easy way to customize error handling and error messages
 == ANTLR has very extensive error handling support
 http://media.pragprog.com/titles/tpantlr/errors.pdf
 
 (3) Single path that performs both tokenizing and parsing
 == What is the advantage of decoupling tokenizer and parsing ?
 
 In addition, Composite Grammar is very useful for keeping the parser
 modular. Things that can be treated as sub-languages such as bag
 schema
 definition can be done and unit tested separately.
 
 ANTLRWorks http://www.antlr.org/works/index.html
 http://www.antlr.org/works/index.htmlalso
 makes grammar development very efficient. Think about IDE that helps
 you
 debug your code (which is grammar).
 
 One question, is there any use case for branching and loops? The
 current Pig
 is more like a query (declarative) language. I don't really see how
 loop
 constructs would fit. I think what Ted mentioned is more embedding
 Pig in
 other languages and use those languages to do loops.
 
 We should think about how the logical plan layer can be made simpler
 for
 external use so don't have to introduce a new layer. Is there any
 major
 active development on it? Currently I have more spare time and
 should be
 able to help out. (BTW, I'm slow because this is just my hobby. I
 don't want
 to drag you guys)
 
 Pi Song
 
 On Tue, Feb 24, 2009 at 6:23 AM, nitesh bhatia
 niteshbhatia...@gmail.com
 wrote:
 
 Hi
 I got this info from javacc mailing lists. This may prove helpful:
 
 
 
 
 
 
 -Original Message- From: Ken Beesley
 [mailto:ken@xrce.xerox.com] Sent: Wednesday, August 18, 2004 2:56
 PM To: javacc Subject: [JavaCC] 

Re: Proposal to create a branch for contrib project Zebra

2009-08-18 Thread Thejas Nair
I think we are creating unnecessary bureaucratic hurdles here by preventing
contrib project from having a branch. I don't see why zebra has to use pig
release branch, as the new pig release does not include it.

The decisions are supposed to help keeping things open, but this seems to be
forcing Raghu to keep things in private git .

-Thejas

On 8/18/09 10:56 AM, Raghu Angadi rang...@yahoo-inc.com wrote:

 
 Right. I just noticed the mails on Pig.0.4.0. I joined pig-dev list just
 yesterday. waiting for 0.4.0 might be good enough if it is just a couple
 of weeks. will keep a watch on it.
 
 I think we will wait for a few days and attach any new feature patches
 to jiras. Those patches can certainly wait there. For interdependencies
 of the patches, we might maintain a private git.
 
 Raghu.
 
 Santhosh Srinivasan wrote:
 I would recommend that zebra wait for Pig 0.4.0 (a couple of weeks?). A
 branch will be created for the 0.4.0 release and zebra will
 automatically benefit.
 
 Santhosh



Re: [Pig Wiki] Update of ProposedProjects by AlanGates

2009-04-16 Thread Thejas Nair
This paper seems very relevant to the proposal -
Compiled Query Execution Engine using JVM
http://www2.computer.org/portal/web/csdl/doi/10.1109/ICDE.2006.40

From the abstract -
Our experimental results on the TPC-H data set show that, despite both
engines benefiting from JIT, the compiled engine runs on average about twice
as fast as the interpreted one, and significantly faster than an in-memory

(I don't have access to the full paper though).

-Thejas


On 4/16/09 9:26 AM, Alan Gates ga...@yahoo-inc.com wrote:

 Your understanding of the proposal is correct.  The goal would be to
 produce Java code rather than a pipeline configuration.  But the
 reasoning is not so that users can then take that and modify
 themselves.  There's nothing preventing them from doing it, but it has
 a couple of major drawbacks.
 
 1) Code generators generally generate horrific looking code, because
 they are going for speed and compactness not human maintainability.
 Trying to work in that code would be very difficult.
 
 2) If you start adding code to generated code, you can no longer use
 the original Pig Latin.  You are from that point forward stuck in
 Java, since you can't backport your Java into the Pig Latin.
 
 The proposal is designed to test the performance of Pig based on
 generated Java (or for that matter any other language, it need not be
 Java).  For the idea you suggest, the NATIVE keyword (proposed here
 https://issues.apache.org/jira/browse/PIG-506)
   is a better solution.
 
 Alan.
 
 On Apr 16, 2009, at 12:54 AM, nitesh bhatia wrote:
 
 Hi
 Can you briefly explain what is required in the first project? After
 reading
 the description my impression is, currently when we are executing
 commands
 on Pig Shell, Pig is first converting to map-reduce jobs and then
 feeding it
 to hadoop. In this project are we proposing that, the execution plan
 made by
 Pig will be first converted to a java file for map-reduce procedure
 and then
 feed onto hadoop network ?
 
 If this is the case then I am sure it will be great help to users as
 this
 functionality can be used to write complicated map-reduce jobs very
 easily.
 Initially user can write the Pig scripts / commands required for his
 job and
 get the map-reduce java files. Then he can edit map-reduce files to
 extend
 the functionality  and add extra procedures that are not provided by
 Pig but
 can be executed over hadoop.
 
 --nitesh
 
 On Wed, Apr 15, 2009 at 9:57 PM, Apache Wiki wikidi...@apache.org
 wrote:
 
 Dear Wiki user,
 
 You have subscribed to a wiki page or wiki category on Pig Wiki for
 change notification.
 
 The following page has been changed by AlanGates:
 http://wiki.apache.org/pig/ProposedProjects
 
 New page:
 = Proposed Pig Projects =
 This page describes projects what we (the committers) would like to
 see
 added
 to Pig.  The scale of these projects vary, but they are larger
 projects,
 usually on the weeks or months scale.  We have not yet filed
 [https://issues.apache.org/jira/browse/PIG JIRAs] for some of these
 because they are still in the vague idea stage.  As they become more
 concrete,
 [https://issues.apache.org/jira/browse/PIG JIRAs] will be filed for
 them.
 
 We welcome contributers to take on one of these projects.  If you
 would
 like
 to do so, please file a JIRA (if one does not already exist for the
 project)
 with a proposed solution.  Pig's committers will work with you from
 there
 to
 help refine your solution.  Once a solution is agreed upon, you can
 begin
 implementation.
 
 If you see a project here that you would like to see Pig implement
 but you
 are
 not in a position to implement the solution right now, feel free to
 vote
 for
 the project.  Add your name to the list of supporters.  This will
 help
 contributers looking for a project to select one that will benefit
 many
 users.
 
 If you would like to propose a project for Pig, feel free to add to
 this
 list.
 If it is a smaller project, or something you plan to begin work on
 immediately, filing a [https://issues.apache.org/jira/browse/PIG
 JIRA] is
 a better route.
 
 || Catagory || Project || JIRA || Proposed By || Votes For ||
 || Execution || Pig currently executes scripts by building a
 pipeline of
 pre-built operators and running data through those operators in map
 reduce
 jobs.  We need to investigate instead have Pig generate java code
 specific
 to a job, and then compiling that code and using it to run the map
 reduce
 jobs. || || Many conference attendees || gates ||
 || Language || Currently only DISTINCT, ORDER BY, and FILTER are
 allowed
 inside FOREACH.  All operators should be allowed in FOREACH. (Limit
 is being
 worked on [https://issues.apache.org/jira/browse/PIG-741 741] || ||
 gates
 || ||
 || Optimization || Speed up comparison of tuples during shuffle for
 ORDER
 BY || [https://issues.apache.org/jira/browse/PIG-659 659] || olgan
 || ||
 || Optimization || Order by should be changed to not use POPackage
 to put
 all of the tuples in a 

Re: scope string in OperatorKey

2009-03-11 Thread Thejas Nair
The id in OperatorKey helps distinguish between multiple operators of same
type. 
What I am proposing is just changing the toString() in OperatorKey to make
the explain output more readable, (we can change it back later or look at
other options, if any future requirements make printing of scope necessary).

Ie 
public String toString() {

  return scope + - + id;

}
Changes to
public String toString() {

  return ((Integer(id)).toString();

}


Thanks,
Thejas


On 3/11/09 10:55 AM, Alan Gates ga...@yahoo-inc.com wrote:

 The purpose of the scope string is to allow us to have multiple
 sessions of pig running and distinguish the operators.  It's one of
 those things that was put in before an actual requirement, so whether
 it will prove useful or not remains to be seen.
 
 As for removing it from explain, is it still reasonably easy to
 distinguish operators without it?  IIRC the OperatorKey includes an
 operator number.  When looking at the explain plans this is useful for
 cases where there is more than one of a given type of operator and you
 want to be able to distinguish between them.
 
 Alan.
 
 On Mar 6, 2009, at 3:14 PM, Thejas Nair wrote:
 
 What is the purpose of scope string in
 org.apache.pig.impl.plan.OperatorKey
 ?Is it meant to be used if we have a pig deamon process ?
 
 Is it ok to stop printing the scope part in explain output? It does
 not seem
 to add value to it and makes the output more verbose.
 
 Thanks,
 Thejas
 
 



scope string in OperatorKey

2009-03-06 Thread Thejas Nair
What is the purpose of scope string in org.apache.pig.impl.plan.OperatorKey
?Is it meant to be used if we have a pig deamon process ?

Is it ok to stop printing the scope part in explain output? It does not seem
to add value to it and makes the output more verbose.

Thanks,
Thejas