Re: switching to different parser in Pig
Jflex is covered by GPL, but code generated by it is not. Only the code that is generated by Jflex goes into pig.jar. We can't checkin Jflex.jar into svn, ivy will be setup to download it from maven repository. -Thejas On 8/25/09 11:57 AM, Dmitriy Ryaboy dvrya...@cloudera.com wrote: Santosh, Am I missing something about Jflex licensing? I thought that it being GPL, we can't package it with apache-licensed software, which prevents it from being a viable option (regardless of technical merits) -Dmitriy On Tue, Aug 25, 2009 at 1:58 PM, Santhosh Srinivasans...@yahoo-inc.com wrote: Its been 6 months since this topic was discussed but we don't have closure on it. For SQL on top of Pig, we are using Jflex and CUP (https://issues.apache.org/jira/browse/PIG-824). If we have decided on the right parser, can we have a plan to move the other parsers in Pig to the same technology? Thanks, Santhosh PS: I am assuming we are not moving to Antlr. -Original Message- From: Alan Gates [mailto:ga...@yahoo-inc.com] Sent: Tuesday, February 24, 2009 10:17 AM To: pig-dev@hadoop.apache.org; pi.so...@gmail.com Subject: Re: switching to different parser in Pig Sorry, after I sent that email yesterday I realized I was not very clear. I did not mean to imply that antlr didn't have good documentation or good error handling. What I wanted to say was we want all three of those things, and it didn't appear that antlr provided all three, since it doesn't separate out scanner and parser. Also, from my viewpoint, I prefer bottom up LALR(1) parsers like yacc to top down parsers like javacc. My understanding is that antlr is top down like javacc. My reasoning for this preference is that parser books and classes have used those for decades, so there are a large number of engineers out there (including me :) ) who know how to work with them. But maybe antlr is close enough to what we need. I'll take a deeper look at it before I vote officially on which way we should go. As for loops and branches, I'm not saying we need those in Pig Latin. We need them somehow. Whether it's better to put them in Pig Latin or imbed pig in a existing script language is an ongoing debate. I don't want to make a decision now that effectively ends that debate without buy in from those who feel strongly that Pig Latin should include those constructs. I agree with you that we should modify the logical plan to support this rather than add another layer. As for active development, the only thing I'm aware of is we hope to start working on a more robust optimizer for pig soon, and that will require some additional functionality out of the logical operators, but it shouldn't cause any fundamental architectural changes. Alan. On Feb 24, 2009, at 1:27 AM, pi song wrote: (1) Lack of good documentation which makes it hard to and time consuming to learn javacc and make changes to Pig grammar == ANTLR is very very well documented. http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference http://media.pragprog.com/titles/tpantlr/toc.pdf http://www.antlr.org/wiki/display/ANTLR3/ANTLR+3+Wiki+Home (2) No easy way to customize error handling and error messages == ANTLR has very extensive error handling support http://media.pragprog.com/titles/tpantlr/errors.pdf (3) Single path that performs both tokenizing and parsing == What is the advantage of decoupling tokenizer and parsing ? In addition, Composite Grammar is very useful for keeping the parser modular. Things that can be treated as sub-languages such as bag schema definition can be done and unit tested separately. ANTLRWorks http://www.antlr.org/works/index.html http://www.antlr.org/works/index.htmlalso makes grammar development very efficient. Think about IDE that helps you debug your code (which is grammar). One question, is there any use case for branching and loops? The current Pig is more like a query (declarative) language. I don't really see how loop constructs would fit. I think what Ted mentioned is more embedding Pig in other languages and use those languages to do loops. We should think about how the logical plan layer can be made simpler for external use so don't have to introduce a new layer. Is there any major active development on it? Currently I have more spare time and should be able to help out. (BTW, I'm slow because this is just my hobby. I don't want to drag you guys) Pi Song On Tue, Feb 24, 2009 at 6:23 AM, nitesh bhatia niteshbhatia...@gmail.com wrote: Hi I got this info from javacc mailing lists. This may prove helpful: -Original Message- From: Ken Beesley [mailto:ken@xrce.xerox.com] Sent: Wednesday, August 18, 2004 2:56 PM To: javacc Subject: [JavaCC
RE: switching to different parser in Pig
To answer Santhosh's question. I think the plan is to move to Jflex and CUP but when that happens is a matter of priorities and resources which are not clear at this point. We do welcome contributions ;). Olga -Original Message- From: Thejas Nair [mailto:te...@yahoo-inc.com] Sent: Tuesday, August 25, 2009 12:52 PM To: pig-dev@hadoop.apache.org; Dmitriy Ryaboy Cc: pi.so...@gmail.com Subject: Re: switching to different parser in Pig Jflex is covered by GPL, but code generated by it is not. Only the code that is generated by Jflex goes into pig.jar. We can't checkin Jflex.jar into svn, ivy will be setup to download it from maven repository. -Thejas On 8/25/09 11:57 AM, Dmitriy Ryaboy dvrya...@cloudera.com wrote: Santosh, Am I missing something about Jflex licensing? I thought that it being GPL, we can't package it with apache-licensed software, which prevents it from being a viable option (regardless of technical merits) -Dmitriy On Tue, Aug 25, 2009 at 1:58 PM, Santhosh Srinivasans...@yahoo-inc.com wrote: Its been 6 months since this topic was discussed but we don't have closure on it. For SQL on top of Pig, we are using Jflex and CUP (https://issues.apache.org/jira/browse/PIG-824). If we have decided on the right parser, can we have a plan to move the other parsers in Pig to the same technology? Thanks, Santhosh PS: I am assuming we are not moving to Antlr. -Original Message- From: Alan Gates [mailto:ga...@yahoo-inc.com] Sent: Tuesday, February 24, 2009 10:17 AM To: pig-dev@hadoop.apache.org; pi.so...@gmail.com Subject: Re: switching to different parser in Pig Sorry, after I sent that email yesterday I realized I was not very clear. I did not mean to imply that antlr didn't have good documentation or good error handling. What I wanted to say was we want all three of those things, and it didn't appear that antlr provided all three, since it doesn't separate out scanner and parser. Also, from my viewpoint, I prefer bottom up LALR(1) parsers like yacc to top down parsers like javacc. My understanding is that antlr is top down like javacc. My reasoning for this preference is that parser books and classes have used those for decades, so there are a large number of engineers out there (including me :) ) who know how to work with them. But maybe antlr is close enough to what we need. I'll take a deeper look at it before I vote officially on which way we should go. As for loops and branches, I'm not saying we need those in Pig Latin. We need them somehow. Whether it's better to put them in Pig Latin or imbed pig in a existing script language is an ongoing debate. I don't want to make a decision now that effectively ends that debate without buy in from those who feel strongly that Pig Latin should include those constructs. I agree with you that we should modify the logical plan to support this rather than add another layer. As for active development, the only thing I'm aware of is we hope to start working on a more robust optimizer for pig soon, and that will require some additional functionality out of the logical operators, but it shouldn't cause any fundamental architectural changes. Alan. On Feb 24, 2009, at 1:27 AM, pi song wrote: (1) Lack of good documentation which makes it hard to and time consuming to learn javacc and make changes to Pig grammar == ANTLR is very very well documented. http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference http://media.pragprog.com/titles/tpantlr/toc.pdf http://www.antlr.org/wiki/display/ANTLR3/ANTLR+3+Wiki+Home (2) No easy way to customize error handling and error messages == ANTLR has very extensive error handling support http://media.pragprog.com/titles/tpantlr/errors.pdf (3) Single path that performs both tokenizing and parsing == What is the advantage of decoupling tokenizer and parsing ? In addition, Composite Grammar is very useful for keeping the parser modular. Things that can be treated as sub-languages such as bag schema definition can be done and unit tested separately. ANTLRWorks http://www.antlr.org/works/index.html http://www.antlr.org/works/index.htmlalso makes grammar development very efficient. Think about IDE that helps you debug your code (which is grammar). One question, is there any use case for branching and loops? The current Pig is more like a query (declarative) language. I don't really see how loop constructs would fit. I think what Ted mentioned is more embedding Pig in other languages and use those languages to do loops. We should think about how the logical plan layer can be made simpler for external use so don't have to introduce a new layer. Is there any major active development on it? Currently I have more spare time and should be able to help out. (BTW, I'm slow because this is just my hobby. I don't want to drag you guys) Pi Song On Tue, Feb 24, 2009 at 6:23 AM
Re: switching to different parser in Pig
(1) Lack of good documentation which makes it hard to and time consuming to learn javacc and make changes to Pig grammar == ANTLR is very very well documented. http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference http://media.pragprog.com/titles/tpantlr/toc.pdf http://www.antlr.org/wiki/display/ANTLR3/ANTLR+3+Wiki+Home (2) No easy way to customize error handling and error messages == ANTLR has very extensive error handling support http://media.pragprog.com/titles/tpantlr/errors.pdf (3) Single path that performs both tokenizing and parsing == What is the advantage of decoupling tokenizer and parsing ? In addition, Composite Grammar is very useful for keeping the parser modular. Things that can be treated as sub-languages such as bag schema definition can be done and unit tested separately. ANTLRWorks http://www.antlr.org/works/index.html http://www.antlr.org/works/index.htmlalso makes grammar development very efficient. Think about IDE that helps you debug your code (which is grammar). One question, is there any use case for branching and loops? The current Pig is more like a query (declarative) language. I don't really see how loop constructs would fit. I think what Ted mentioned is more embedding Pig in other languages and use those languages to do loops. We should think about how the logical plan layer can be made simpler for external use so don't have to introduce a new layer. Is there any major active development on it? Currently I have more spare time and should be able to help out. (BTW, I'm slow because this is just my hobby. I don't want to drag you guys) Pi Song On Tue, Feb 24, 2009 at 6:23 AM, nitesh bhatia niteshbhatia...@gmail.comwrote: Hi I got this info from javacc mailing lists. This may prove helpful: -Original Message- From: Ken Beesley [mailto:ken@xrce.xerox.com] Sent: Wednesday, August 18, 2004 2:56 PM To: javacc Subject: [JavaCC] Alternatives to JavaCC (was Hello All) Vicas wrote: Hello All Kindly let me know other parsers available which does the same job as javacc. It would be very nice of you if you can send me some documentation related to this. Thanks Vikas (Correction and clarifications to the following would be _very_ welcome. I'm very likely out of date.) Of course, no two software tools are likely to do _exactly_ the same job. Someone already pointed you to ANTLR, which is probably the best-known alternative to JavaCC. Another possibility is SableCC. http://sablecc.org The criteria include stability, documentation, language of the parser generated, and abstract-syntax-tree building. When I last looked (a couple of years ago) at ANTLR, SableCC and JavaCC, I chose JavaCC for the following reasons: 1. ANTLR could not handle Unicode input. Things change, of course, so ANTLR might now be more Unicode-friendly. Unicode was important to me, so this was a big factor in my decision. On the plus side for ANTLR, it has better abstract-syntax-tree building capabilities (in my opinion) than JJTree/JavaCC. You can learn to use JJTree commands, but it's not easy for most people. And ANTLR can generate either a Java or a C++ parser. JavaCC generates only Java parsers. Another concern about ANTLR was that it was reputed to change a lot as the guru, Terence Parr, experimented with new syntax and functionality. JavaCC, at least at the time, was reputed to be more stable, perhaps stable to a fault. I wanted stability and reliability. 2. SableCC is much like JavaCC; it generates a Java parser from a grammar description; but it had, in my opinion, less flexible abstract-syntax-tree building than JJTree/JavaCC. In SableCC (when I looked at it), the AST it built was always a direct reflection of your grammar, generating one tree node for each grammar expansion involved in a parse, much like using JavaCC with Java Tree Builder (JTB http://www.cs.purdue.edu/jtb/). When using JavaCC, JTB is the alternative to using JJTree. Using SableCC, or the combination JavaCC/JTB, should be _very_ similar indeed. In my opinion, SableCC and JavaCC/JTB have made a conscious choice to simplify AST building--you get trees that reflect the expansions in your grammar. Period. But often these default trees will be big, full of extraneous nodes that reflect precedence hierarchies in the recursive-descent parsing. If you want to have more control over AST building, to get more compact and tailored ASTs, you need to pay the price of learning JJTree. Assuming that you need to build ASTs, with JavaCC you have the choice between JJTree and JTB. With SableCC, when I last looked at it, you only get the JTB-like option. *** (Again, corrections and expansions would be much appreciated.) Ken Beesley
Re: switching to different parser in Pig
Sorry, after I sent that email yesterday I realized I was not very clear. I did not mean to imply that antlr didn't have good documentation or good error handling. What I wanted to say was we want all three of those things, and it didn't appear that antlr provided all three, since it doesn't separate out scanner and parser. Also, from my viewpoint, I prefer bottom up LALR(1) parsers like yacc to top down parsers like javacc. My understanding is that antlr is top down like javacc. My reasoning for this preference is that parser books and classes have used those for decades, so there are a large number of engineers out there (including me :) ) who know how to work with them. But maybe antlr is close enough to what we need. I'll take a deeper look at it before I vote officially on which way we should go. As for loops and branches, I'm not saying we need those in Pig Latin. We need them somehow. Whether it's better to put them in Pig Latin or imbed pig in a existing script language is an ongoing debate. I don't want to make a decision now that effectively ends that debate without buy in from those who feel strongly that Pig Latin should include those constructs. I agree with you that we should modify the logical plan to support this rather than add another layer. As for active development, the only thing I'm aware of is we hope to start working on a more robust optimizer for pig soon, and that will require some additional functionality out of the logical operators, but it shouldn't cause any fundamental architectural changes. Alan. On Feb 24, 2009, at 1:27 AM, pi song wrote: (1) Lack of good documentation which makes it hard to and time consuming to learn javacc and make changes to Pig grammar == ANTLR is very very well documented. http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference http://media.pragprog.com/titles/tpantlr/toc.pdf http://www.antlr.org/wiki/display/ANTLR3/ANTLR+3+Wiki+Home (2) No easy way to customize error handling and error messages == ANTLR has very extensive error handling support http://media.pragprog.com/titles/tpantlr/errors.pdf (3) Single path that performs both tokenizing and parsing == What is the advantage of decoupling tokenizer and parsing ? In addition, Composite Grammar is very useful for keeping the parser modular. Things that can be treated as sub-languages such as bag schema definition can be done and unit tested separately. ANTLRWorks http://www.antlr.org/works/index.html http://www.antlr.org/works/index.htmlalso makes grammar development very efficient. Think about IDE that helps you debug your code (which is grammar). One question, is there any use case for branching and loops? The current Pig is more like a query (declarative) language. I don't really see how loop constructs would fit. I think what Ted mentioned is more embedding Pig in other languages and use those languages to do loops. We should think about how the logical plan layer can be made simpler for external use so don't have to introduce a new layer. Is there any major active development on it? Currently I have more spare time and should be able to help out. (BTW, I'm slow because this is just my hobby. I don't want to drag you guys) Pi Song On Tue, Feb 24, 2009 at 6:23 AM, nitesh bhatia niteshbhatia...@gmail.com wrote: Hi I got this info from javacc mailing lists. This may prove helpful: -Original Message- From: Ken Beesley [mailto:ken@xrce.xerox.com] Sent: Wednesday, August 18, 2004 2:56 PM To: javacc Subject: [JavaCC] Alternatives to JavaCC (was Hello All) Vicas wrote: Hello All Kindly let me know other parsers available which does the same job as javacc. It would be very nice of you if you can send me some documentation related to this. Thanks Vikas (Correction and clarifications to the following would be _very_ welcome. I'm very likely out of date.) Of course, no two software tools are likely to do _exactly_ the same job. Someone already pointed you to ANTLR, which is probably the best-known alternative to JavaCC. Another possibility is SableCC. http://sablecc.org The criteria include stability, documentation, language of the parser generated, and abstract-syntax-tree building. When I last looked (a couple of years ago) at ANTLR, SableCC and JavaCC, I chose JavaCC for the following reasons: 1. ANTLR could not handle Unicode input. Things change, of course, so ANTLR might now be more Unicode-friendly. Unicode was important to me, so this was a big factor in my decision. On the plus side for ANTLR, it has better abstract-syntax-tree building capabilities (in my opinion) than JJTree/JavaCC. You can learn to use JJTree commands, but it's not easy for most people. And ANTLR can generate either a Java or
Re: switching to different parser in Pig
Yes. And one thing I should have mentioned was Chris W's thoughts along the lines that it would be very nice to expose the logical plan to something like Cascading so that a global restructuring could be done across more than just Pig programs. It works the other way as well, with it becoming possible for Pig to execute programs expressed (conceivably) in Cascading form. On Tue, Feb 24, 2009 at 1:27 AM, pi song pi.so...@gmail.com wrote: I think what Ted mentioned is more embedding Pig in other languages and use those languages to do loops. -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax)
Re: switching to different parser in Pig
Sounds good but how about exposing the logical plan layer instead? Wouldn't that yield the same effect? From python for example you still can construct a logical plan and give to Pig to execute. On Wed, Feb 18, 2009 at 10:07 AM, Ted Dunning ted.dunn...@gmail.com wrote: 2009/2/17 Alan Gates ga...@yahoo-inc.com [not commenting on the switch, only on the exposure of AST's] Is that correct? Nearly so. So whether we switch parsing technologies or not is not of interest to you, only the interfaces we expose? I would think that switching parsing technologies would encourage creation of a better AST interface layer which further my goal of getting to the AST's for other purposes. I also think that exposing the AST layer would further your goal of switching parser technology by allowing outsiders to contribute parsers that you might ultimately like better. So I do see a linkage and do support switching. +1 to switching parsers (and thus making switching easier)
Re: switching to different parser in Pig
Probably nearly the same effect as you suggest. Are the concepts at the logical plan layer similar to those expressed in pig latin? Or has a significant transformation occurred by then? On Fri, Feb 20, 2009 at 1:59 AM, pi song pi.so...@gmail.com wrote: Sounds good but how about exposing the logical plan layer instead? Wouldn't that yield the same effect? From python for example you still can construct a logical plan and give to Pig to execute. -- Ted Dunning, CTO DeepDyve
switching to different parser in Pig
Pig Developers, Pig currently uses javacc for parsing pig commands. We have found several shortcomings with using javacc. In particular, (1) Lack of good documentation which makes it hard to and time consuming to learn javacc and make changes to Pig grammar (2) No easy way to customize error handling and error messages (3) Single path that performs both tokenizing and parsing We are considering to use JFlex and Cup which are Java versions of Lex and Bison instead. The main advantage of this transition is proven, well known and well understood technology and input format. In addition, it addresses the issues stated above. One problem with the transition is that JFlex and Cup have GPL license that is not compatible with Apache license. The workaround could be that we don't commit the tools into SVN and instead developers who need to update grammar would install them on their own. Note, that we can commit the input grammar as well as the output of the grammar into SVN which means that for developers just compiling code or making non-parser changes, there will be no impact. Please, comment on whether you think this is a reasonable change. Thanks, Olga
Re: switching to different parser in Pig
This sounds like a great idea ! Would be great if other means of generating ast's for pig was possible. Regards, Mridul Ted Dunning wrote: In general, it would be really, really nice if it were easy to build abstract Pig syntax trees outside of the normal parser. For instance, I find the fact that pig is not a full scale scripting language incredibly confining. I would love to be able to build a DSL in groovy that let me use groovy for scripting, but still execute pig jobs easily. If I could build Pig syntax trees easily, then I would be, as they say, in pig heaven. That would also let the switch to a different parsing technology happen gradually rather than all at once. Two different grunt interpreters could coexist for a short time while the new one is proved out. On Thu, Feb 12, 2009 at 3:58 PM, Olga Natkovich ol...@yahoo-inc.com wrote: Pig Developers, Pig currently uses javacc for parsing pig commands. We have found several shortcomings with using javacc. In particular, (1) Lack of good documentation which makes it hard to and time consuming to learn javacc and make changes to Pig grammar (2) No easy way to customize error handling and error messages (3) Single path that performs both tokenizing and parsing We are considering to use JFlex and Cup which are Java versions of Lex and Bison instead. The main advantage of this transition is proven, well known and well understood technology and input format. In addition, it addresses the issues stated above. One problem with the transition is that JFlex and Cup have GPL license that is not compatible with Apache license. The workaround could be that we don't commit the tools into SVN and instead developers who need to update grammar would install them on their own. Note, that we can commit the input grammar as well as the output of the grammar into SVN which means that for developers just compiling code or making non-parser changes, there will be no impact. Please, comment on whether you think this is a reasonable change. Thanks, Olga