Re: Avoiding serialization/de-serialization in pig
On 6/28/10 5:51 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: I have a feeling that propagating schemas when known, and using them to for (de)serialization instead of reflecting every field, would also be a big win. Thoughts on just using Avro for the internal PigStorage? When I profiled pig queries, I don't see much time being spent in DataType.findType(Object o), where the type of object is determined using instanceof. (I am assuming you were referring to that). But we can still optimize the cases where schema is known (ie all rows have same schema) by not storing the type with each field in the serialization format . Avro stores the schema separately, so I assume it has this optimization. But in the case where schema is not known, we would need to store the type information for every row. When query plan is generated, we would need to determine which serialization format is to be used. -Thejas
Avoiding serialization/de-serialization in pig
I have created a wiki which puts together some ideas that can help in improving performance by avoiding/delaying serialization/de-serialization . http://wiki.apache.org/pig/AvoidingSedes These are ideas that don't involve changes to optimizer. Most of them involve changes in the load/store functions. Your feedback is welcome. Thanks, Thejas
Re: Begin a discussion about Pig as a top level project
I agree with Alan and Dmitriy - Pig is tightly coupled with hadoop, and heavily influenced by its roadmap. I think it makes sense to continue as a sub-project of hadoop. -Thejas On 3/31/10 4:04 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Over time, Pig is increasing its coupling to Hadoop (for good reasons), rather than decreasing it. If and when Pig becomes a viable entity without hadoop around, it might make sense as a TLP. As is, I think becoming a TLP will only introduce unnecessary administrative and bureaucratic headaches. So my vote is also -1. -Dmitriy On Wed, Mar 31, 2010 at 2:38 PM, Alan Gates ga...@yahoo-inc.com wrote: So far I haven't seen any feedback on this. Apache has asked the Hadoop PMC to submit input in April on whether some subprojects should be promoted to TLPs. We, the Pig community, need to give feedback to the Hadoop PMC on how we feel about this. Please make your voice heard. So now I'll head my own call and give my thoughts on it. The biggest advantage I see to being a TLP is a direct connection to Apache. Right now all of the Pig team's interaction with Apache is through the Hadoop PMC. Being directly connected to Apache would benefit Pig team members who would have a better view into Apache. It would also raise our profile in Apache and thus make other projects more aware of us. However, I am concerned about loosing Pig's explicit connection to Hadoop. This concern has a couple of dimensions. One, Hadoop and MapReduce are the current flavor of the month in computing. Given that Pig shares a name with the common farm animal, it's hard to be sure based on search statistics. But Google trends shows that hadoop is searched on much more frequently than hadoop pig or apache pig (see http://www.google.com/trends?q=hadoop%2Chadoop+pig). I am guessing that most Pig users come from Hadoop users who discover Pig via Hadoop's website. Loosing that subproject tab on Hadoop's front page may radically lower the number of users coming to Pig to check out our project. I would argue that this benefits Hadoop as well, since high level languages like Pig Latin have the potential to greatly extend the user base and usability of Hadoop. Two, being explicitly connected to Hadoop keeps our two communities aware of each others needs. There are features proposed for MR that would greatly help Pig. By staying in the Hadoop community Pig is better positioned to advocate for and help implement and test those features. The response to this will be that Pig developers can still subscribe to Hadoop mailing lists, submit patches, etc. That is, they can still be part of the Hadoop community. Which reinforces my point that it makes more sense to leave Pig in the Hadoop community since Pig developers will need to be part of that community anyway. Finally, philosophically it makes sense to me that projects that are tightly connected belong together. It strikes me as strange to have Pig as a TLP completely dependent on another TLP. Hadoop was originally a subproject of Lucene. It moved out to be a TLP when it became obvious that Hadoop had become independent of and useful apart from Lucene. Pig is not in that position relative to Hadoop. So, I'm -1 on Pig moving out. But this is a soft -1. I'm open to being persuaded that I'm wrong or my concerns can be addressed while still having Pig as a TLP. Alan. On Mar 19, 2010, at 10:59 AM, Alan Gates wrote: You have probably heard by now that there is a discussion going on in the Hadoop PMC as to whether a number of the subprojects (Hbase, Avro, Zookeeper, Hive, and Pig) should move out from under the Hadoop umbrella and become top level Apache projects (TLP). This discussion has picked up recently since the Apache board has clearly communicated to the Hadoop PMC that it is concerned that Hadoop is acting as an umbrella project with many disjoint subprojects underneath it. They are concerned that this gives Apache little insight into the health and happenings of the subproject communities which in turn means Apache cannot properly mentor those communities. The purpose of this email is to start a discussion within the Pig community about this topic. Let me cover first what becoming TLP would mean for Pig, and then I'll go into what options I think we as a community have. Becoming a TLP would mean that Pig would itself have a PMC that would report directly to the Apache board. Who would be on the PMC would be something we as a community would need to decide. Common options would be to say all active committers are on the PMC, or all active committers who have been a committer for at least a year. We would also need to elect a chair of the PMC. This lucky person would have no additional power, but would have the additional responsibility of writing quarterly reports on Pig's status for Apache board meetings, as well as coordinating with
Re: LoadFunc.skipNext() function for faster sampling ?
Yes, that should work. I will use InputFormat.getNext from the SampleLoader to skip the records. Thanks, Thejas On 11/3/09 6:39 PM, Alan Gates ga...@yahoo-inc.com wrote: We definitely want to avoid parsing every tuple when sampling. But do we need to implement a special function for it? Pig will have access to the InputFormat instance, correct? Can it not call InputFormat.getNext the desired number of times (which will not parse the tuple) and then call LoadFunc.getNext to get the next parsed tuple? Alan. On Nov 3, 2009, at 4:28 PM, Thejas Nair wrote: In the new implementation of SampleLoader subclasses (used by order- by, skew-join ..) as part of the loader redesign, we are not only reading all the records input but also parsing them as pig tuples. This is because the SampleLoaders are wrappers around the actual input loaders specified in the query. We can make things much faster by having a skipNext() function (or skipNext(int numSkip) ) which will avoid parsing the record into a pig tuple. LoadFunc could optionally implement this (easy to implement) function (which will be part of an interface) for improving speed of queries such as order-by. -Thejas
LoadFunc.skipNext() function for faster sampling ?
In the new implementation of SampleLoader subclasses (used by order-by, skew-join ..) as part of the loader redesign, we are not only reading all the records input but also parsing them as pig tuples. This is because the SampleLoaders are wrappers around the actual input loaders specified in the query. We can make things much faster by having a skipNext() function (or skipNext(int numSkip) ) which will avoid parsing the record into a pig tuple. LoadFunc could optionally implement this (easy to implement) function (which will be part of an interface) for improving speed of queries such as order-by. -Thejas
Definition of equality of bags
I could not find any documentation (in piglatin manual) on what the definition of equality of bags is (or what it should be), does the order of tuples in the bag matter ? But the definition of a bag does not imply any ordering. This has implication on the definition of join/cogroup/group on bags. Thanks, Thejas
Re: Definition of equality of bags
Looks like the join/cogroup/group is not defined on bags. I assume this is because the equality on bags is not defined. It gives the error in map-reduce mode, but does not in local mode. Since pig is likely to get rid of custom local mode implementation and use hadoop local mode and that should fix it, I am not filing a jira. -Thejas On 11/2/09 9:19 AM, Thejas Nair te...@yahoo-inc.com wrote: I could not find any documentation (in piglatin manual) on what the definition of equality of bags is (or what it should be), does the order of tuples in the bag matter ? But the definition of a bag does not imply any ordering. This has implication on the definition of join/cogroup/group on bags. Thanks, Thejas
Re: switching to different parser in Pig
Jflex is covered by GPL, but code generated by it is not. Only the code that is generated by Jflex goes into pig.jar. We can't checkin Jflex.jar into svn, ivy will be setup to download it from maven repository. -Thejas On 8/25/09 11:57 AM, Dmitriy Ryaboy dvrya...@cloudera.com wrote: Santosh, Am I missing something about Jflex licensing? I thought that it being GPL, we can't package it with apache-licensed software, which prevents it from being a viable option (regardless of technical merits) -Dmitriy On Tue, Aug 25, 2009 at 1:58 PM, Santhosh Srinivasans...@yahoo-inc.com wrote: Its been 6 months since this topic was discussed but we don't have closure on it. For SQL on top of Pig, we are using Jflex and CUP (https://issues.apache.org/jira/browse/PIG-824). If we have decided on the right parser, can we have a plan to move the other parsers in Pig to the same technology? Thanks, Santhosh PS: I am assuming we are not moving to Antlr. -Original Message- From: Alan Gates [mailto:ga...@yahoo-inc.com] Sent: Tuesday, February 24, 2009 10:17 AM To: pig-dev@hadoop.apache.org; pi.so...@gmail.com Subject: Re: switching to different parser in Pig Sorry, after I sent that email yesterday I realized I was not very clear. I did not mean to imply that antlr didn't have good documentation or good error handling. What I wanted to say was we want all three of those things, and it didn't appear that antlr provided all three, since it doesn't separate out scanner and parser. Also, from my viewpoint, I prefer bottom up LALR(1) parsers like yacc to top down parsers like javacc. My understanding is that antlr is top down like javacc. My reasoning for this preference is that parser books and classes have used those for decades, so there are a large number of engineers out there (including me :) ) who know how to work with them. But maybe antlr is close enough to what we need. I'll take a deeper look at it before I vote officially on which way we should go. As for loops and branches, I'm not saying we need those in Pig Latin. We need them somehow. Whether it's better to put them in Pig Latin or imbed pig in a existing script language is an ongoing debate. I don't want to make a decision now that effectively ends that debate without buy in from those who feel strongly that Pig Latin should include those constructs. I agree with you that we should modify the logical plan to support this rather than add another layer. As for active development, the only thing I'm aware of is we hope to start working on a more robust optimizer for pig soon, and that will require some additional functionality out of the logical operators, but it shouldn't cause any fundamental architectural changes. Alan. On Feb 24, 2009, at 1:27 AM, pi song wrote: (1) Lack of good documentation which makes it hard to and time consuming to learn javacc and make changes to Pig grammar == ANTLR is very very well documented. http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference http://media.pragprog.com/titles/tpantlr/toc.pdf http://www.antlr.org/wiki/display/ANTLR3/ANTLR+3+Wiki+Home (2) No easy way to customize error handling and error messages == ANTLR has very extensive error handling support http://media.pragprog.com/titles/tpantlr/errors.pdf (3) Single path that performs both tokenizing and parsing == What is the advantage of decoupling tokenizer and parsing ? In addition, Composite Grammar is very useful for keeping the parser modular. Things that can be treated as sub-languages such as bag schema definition can be done and unit tested separately. ANTLRWorks http://www.antlr.org/works/index.html http://www.antlr.org/works/index.htmlalso makes grammar development very efficient. Think about IDE that helps you debug your code (which is grammar). One question, is there any use case for branching and loops? The current Pig is more like a query (declarative) language. I don't really see how loop constructs would fit. I think what Ted mentioned is more embedding Pig in other languages and use those languages to do loops. We should think about how the logical plan layer can be made simpler for external use so don't have to introduce a new layer. Is there any major active development on it? Currently I have more spare time and should be able to help out. (BTW, I'm slow because this is just my hobby. I don't want to drag you guys) Pi Song On Tue, Feb 24, 2009 at 6:23 AM, nitesh bhatia niteshbhatia...@gmail.com wrote: Hi I got this info from javacc mailing lists. This may prove helpful: -Original Message- From: Ken Beesley [mailto:ken@xrce.xerox.com] Sent: Wednesday, August 18, 2004 2:56 PM To: javacc Subject: [JavaCC]
Re: Proposal to create a branch for contrib project Zebra
I think we are creating unnecessary bureaucratic hurdles here by preventing contrib project from having a branch. I don't see why zebra has to use pig release branch, as the new pig release does not include it. The decisions are supposed to help keeping things open, but this seems to be forcing Raghu to keep things in private git . -Thejas On 8/18/09 10:56 AM, Raghu Angadi rang...@yahoo-inc.com wrote: Right. I just noticed the mails on Pig.0.4.0. I joined pig-dev list just yesterday. waiting for 0.4.0 might be good enough if it is just a couple of weeks. will keep a watch on it. I think we will wait for a few days and attach any new feature patches to jiras. Those patches can certainly wait there. For interdependencies of the patches, we might maintain a private git. Raghu. Santhosh Srinivasan wrote: I would recommend that zebra wait for Pig 0.4.0 (a couple of weeks?). A branch will be created for the 0.4.0 release and zebra will automatically benefit. Santhosh
Re: [Pig Wiki] Update of ProposedProjects by AlanGates
This paper seems very relevant to the proposal - Compiled Query Execution Engine using JVM http://www2.computer.org/portal/web/csdl/doi/10.1109/ICDE.2006.40 From the abstract - Our experimental results on the TPC-H data set show that, despite both engines benefiting from JIT, the compiled engine runs on average about twice as fast as the interpreted one, and significantly faster than an in-memory (I don't have access to the full paper though). -Thejas On 4/16/09 9:26 AM, Alan Gates ga...@yahoo-inc.com wrote: Your understanding of the proposal is correct. The goal would be to produce Java code rather than a pipeline configuration. But the reasoning is not so that users can then take that and modify themselves. There's nothing preventing them from doing it, but it has a couple of major drawbacks. 1) Code generators generally generate horrific looking code, because they are going for speed and compactness not human maintainability. Trying to work in that code would be very difficult. 2) If you start adding code to generated code, you can no longer use the original Pig Latin. You are from that point forward stuck in Java, since you can't backport your Java into the Pig Latin. The proposal is designed to test the performance of Pig based on generated Java (or for that matter any other language, it need not be Java). For the idea you suggest, the NATIVE keyword (proposed here https://issues.apache.org/jira/browse/PIG-506) is a better solution. Alan. On Apr 16, 2009, at 12:54 AM, nitesh bhatia wrote: Hi Can you briefly explain what is required in the first project? After reading the description my impression is, currently when we are executing commands on Pig Shell, Pig is first converting to map-reduce jobs and then feeding it to hadoop. In this project are we proposing that, the execution plan made by Pig will be first converted to a java file for map-reduce procedure and then feed onto hadoop network ? If this is the case then I am sure it will be great help to users as this functionality can be used to write complicated map-reduce jobs very easily. Initially user can write the Pig scripts / commands required for his job and get the map-reduce java files. Then he can edit map-reduce files to extend the functionality and add extra procedures that are not provided by Pig but can be executed over hadoop. --nitesh On Wed, Apr 15, 2009 at 9:57 PM, Apache Wiki wikidi...@apache.org wrote: Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The following page has been changed by AlanGates: http://wiki.apache.org/pig/ProposedProjects New page: = Proposed Pig Projects = This page describes projects what we (the committers) would like to see added to Pig. The scale of these projects vary, but they are larger projects, usually on the weeks or months scale. We have not yet filed [https://issues.apache.org/jira/browse/PIG JIRAs] for some of these because they are still in the vague idea stage. As they become more concrete, [https://issues.apache.org/jira/browse/PIG JIRAs] will be filed for them. We welcome contributers to take on one of these projects. If you would like to do so, please file a JIRA (if one does not already exist for the project) with a proposed solution. Pig's committers will work with you from there to help refine your solution. Once a solution is agreed upon, you can begin implementation. If you see a project here that you would like to see Pig implement but you are not in a position to implement the solution right now, feel free to vote for the project. Add your name to the list of supporters. This will help contributers looking for a project to select one that will benefit many users. If you would like to propose a project for Pig, feel free to add to this list. If it is a smaller project, or something you plan to begin work on immediately, filing a [https://issues.apache.org/jira/browse/PIG JIRA] is a better route. || Catagory || Project || JIRA || Proposed By || Votes For || || Execution || Pig currently executes scripts by building a pipeline of pre-built operators and running data through those operators in map reduce jobs. We need to investigate instead have Pig generate java code specific to a job, and then compiling that code and using it to run the map reduce jobs. || || Many conference attendees || gates || || Language || Currently only DISTINCT, ORDER BY, and FILTER are allowed inside FOREACH. All operators should be allowed in FOREACH. (Limit is being worked on [https://issues.apache.org/jira/browse/PIG-741 741] || || gates || || || Optimization || Speed up comparison of tuples during shuffle for ORDER BY || [https://issues.apache.org/jira/browse/PIG-659 659] || olgan || || || Optimization || Order by should be changed to not use POPackage to put all of the tuples in a
Re: scope string in OperatorKey
The id in OperatorKey helps distinguish between multiple operators of same type. What I am proposing is just changing the toString() in OperatorKey to make the explain output more readable, (we can change it back later or look at other options, if any future requirements make printing of scope necessary). Ie public String toString() { return scope + - + id; } Changes to public String toString() { return ((Integer(id)).toString(); } Thanks, Thejas On 3/11/09 10:55 AM, Alan Gates ga...@yahoo-inc.com wrote: The purpose of the scope string is to allow us to have multiple sessions of pig running and distinguish the operators. It's one of those things that was put in before an actual requirement, so whether it will prove useful or not remains to be seen. As for removing it from explain, is it still reasonably easy to distinguish operators without it? IIRC the OperatorKey includes an operator number. When looking at the explain plans this is useful for cases where there is more than one of a given type of operator and you want to be able to distinguish between them. Alan. On Mar 6, 2009, at 3:14 PM, Thejas Nair wrote: What is the purpose of scope string in org.apache.pig.impl.plan.OperatorKey ?Is it meant to be used if we have a pig deamon process ? Is it ok to stop printing the scope part in explain output? It does not seem to add value to it and makes the output more verbose. Thanks, Thejas
scope string in OperatorKey
What is the purpose of scope string in org.apache.pig.impl.plan.OperatorKey ?Is it meant to be used if we have a pig deamon process ? Is it ok to stop printing the scope part in explain output? It does not seem to add value to it and makes the output more verbose. Thanks, Thejas