Re: switching to different parser in Pig

Alan Gates Tue, 24 Feb 2009 10:17:40 -0800

Sorry, after I sent that email yesterday I realized I was not veryclear. I did not mean to imply that antlr didn't have gooddocumentation or good error handling. What I wanted to say was wewant all three of those things, and it didn't appear that antlrprovided all three, since it doesn't separate out scanner and parser.Also, from my viewpoint, I prefer bottom up LALR(1) parsers like yaccto top down parsers like javacc. My understanding is that antlr istop down like javacc. My reasoning for this preference is that parserbooks and classes have used those for decades, so there are a largenumber of engineers out there (including me :) ) who know how to workwith them. But maybe antlr is close enough to what we need. I'lltake a deeper look at it before I vote officially on which way weshould go.

As for loops and branches, I'm not saying we need those in Pig Latin.We need them somehow. Whether it's better to put them in Pig Latin orimbed pig in a existing script language is an ongoing debate. I don'twant to make a decision now that effectively ends that debate withoutbuy in from those who feel strongly that Pig Latin should includethose constructs.

I agree with you that we should modify the logical plan to supportthis rather than add another layer. As for active development, theonly thing I'm aware of is we hope to start working on a more robustoptimizer for pig soon, and that will require some additionalfunctionality out of the logical operators, but it shouldn't cause anyfundamental architectural changes.


Alan.


On Feb 24, 2009, at 1:27 AM, pi song wrote:

(1) Lack of good documentation which makes it hard to and timeconsuming
to learn javacc and make changes to Pig grammar
<== ANTLR is very very well documented.
http://www.pragprog.com/titles/tpantlr/the-definitive-antlr-reference
http://media.pragprog.com/titles/tpantlr/toc.pdf
http://www.antlr.org/wiki/display/ANTLR3/ANTLR+3+Wiki+Home

(2) No easy way to customize error handling and error messages
<== ANTLR has very extensive error handling support
http://media.pragprog.com/titles/tpantlr/errors.pdf

(3) Single path that performs both tokenizing and parsing
<== What is the advantage of decoupling tokenizer and parsing ?

In addition, "Composite Grammar" is very useful for keeping the parser
modular. Things that can be treated as sub-languages such as bagschema
definition can be done and unit tested separately.

ANTLRWorks http://www.antlr.org/works/index.html
<http://www.antlr.org/works/index.html>also
makes grammar development very efficient. Think about IDE that helpsyou
debug your code (which is grammar).
One question, is there any use case for branching and loops? Thecurrent Pigis more like a query (declarative) language. I don't really see howloopconstructs would fit. I think what Ted mentioned is more embeddingPig in
other languages and use those languages to do loops.
We should think about how the logical plan layer can be made simplerforexternal use so don't have to introduce a new layer. Is there anymajoractive development on it? Currently I have more spare time andshould beable to help out. (BTW, I'm slow because this is just my hobby. Idon't want
to drag you guys)

Pi Song
On Tue, Feb 24, 2009 at 6:23 AM, nitesh bhatia <niteshbhatia...@gmail.com>wrote:
Hi
I got this info from javacc mailing lists. This may prove helpful:


----------------------------------------------------------------------------------------------------------------------------------------------------------------
-----Original Message----- From: Ken Beesley
[mailto:ken....@xrce.xerox.com] Sent: Wednesday, August 18, 2004 2:56
PM To: javacc Subject: [JavaCC] Alternatives to JavaCC (was HelloAll)
Vicas wrote:

Hello All

Kindly let me know other parsers available which does the same job as
javacc.

It would be very nice of you if you can send me some documentation
related to this.

Thanks Vikas

(Correction and clarifications to the following would be _very_
welcome. I'm very likely out of date.)

Of course, no two software tools are likely to do _exactly_ the same
job. Someone already pointed you to ANTLR, which is probably the
best-known alternative to JavaCC. Another possibility is SableCC.
http://sablecc.org

The criteria include stability, documentation, language of the parser
generated, and abstract-syntax-tree building.

When I last looked (a couple of years ago) at ANTLR, SableCC and
JavaCC, I chose JavaCC for the following reasons:

1. ANTLR could not handle Unicode input. Things change, of course, so
ANTLR might now be more Unicode-friendly. Unicode was important tome,
so this was a big factor in my decision.

On the plus side for ANTLR, it has better abstract-syntax-tree
building capabilities (in my opinion) than JJTree/JavaCC. You can
learn to use JJTree commands, but it's not easy for most people.
And ANTLR can generate either a Java or a C++ parser. JavaCCgenerates
only Java parsers.
Another concern about ANTLR was that it was reputed to change a lotas
the guru, Terence Parr, experimented with new syntax and
functionality. JavaCC, at least at the time, was reputed to be more
stable, perhaps stable to a fault. I wanted stability andreliability.
2. SableCC is much like JavaCC; it generates a Java parser from a
grammar description; but it had, in my opinion, less flexible
abstract-syntax-tree building than JJTree/JavaCC. In SableCC (when I
looked at it), the AST it built was always a direct reflection ofyour
grammar, generating one tree node for each grammar expansion involved
in a parse, much like using JavaCC with Java Tree Builder (JTB
http://www.cs.purdue.edu/jtb/). When using JavaCC, JTB is the
alternative to using JJTree.
Using SableCC, or the combination JavaCC/JTB, should be _very_similar
indeed.

In my opinion, SableCC and JavaCC/JTB have made a conscious choice to
simplify AST building--you get trees that reflect the expansions in
your grammar. Period. But often these default trees will be big, full
of extraneous nodes that reflect precedence hierarchies in the
recursive-descent parsing. If you want to have more control over AST
building, to get more compact and tailored ASTs, you need to pay the
price of learning JJTree.

Assuming that you need to build ASTs, with JavaCC you have the choice
between JJTree and JTB. With SableCC, when I last looked at it, you
only get the JTB-like option.

*******

(Again, corrections and expansions would be much appreciated.)

Ken Beesley





---------------------------------------------------------------------------------------------------------------------------------------------------


Of course, no two software tools are likely to do _exactly_ the same
job. Someone already pointed you to ANTLR, which is probably the
best-known alternative to JavaCC. Another possibility is SableCC.
http://sablecc.org

The criteria include stability, documentation, language of the parser
generated, and abstract-syntax-tree building.

When I last looked (a couple of years ago) at ANTLR, SableCC and
JavaCC, I chose JavaCC for the following reasons:

1. ANTLR could not handle Unicode input. Things change, of course, so
ANTLR might now be more Unicode-friendly. Unicode was important tome,
so this was a big factor in my decision.

On the plus side for ANTLR, it has better abstract-syntax-tree
building capabilities (in my opinion) than JJTree/JavaCC. You can
learn to use JJTree commands, but it's not easy for most people.
And ANTLR can generate either a Java or a C++ parser. JavaCCgenerates
only Java parsers.
Another concern about ANTLR was that it was reputed to change a lotas
the guru, Terence Parr, experimented with new syntax and
functionality. JavaCC, at least at the time, was reputed to be more
stable, perhaps stable to a fault. I wanted stability andreliability.
2. SableCC is much like JavaCC; it generates a Java parser from a
grammar description; but it had, in my opinion, less flexible
abstract-syntax-tree building than JJTree/JavaCC. In SableCC (when I
looked at it), the AST it built was always a direct reflection ofyour
grammar, generating one tree node for each grammar expansion involved
in a parse, much like using JavaCC with Java Tree Builder (JTB
http://www.cs.purdue.edu/jtb/). When using JavaCC, JTB is the
alternative to using JJTree.
Using SableCC, or the combination JavaCC/JTB, should be _very_similar
indeed.

In my opinion, SableCC and JavaCC/JTB have made a conscious choice to
simplify AST building--you get trees that reflect the expansions in
your grammar. Period. But often these default trees will be big, full
of extraneous nodes that reflect precedence hierarchies in the
recursive-descent parsing. If you want to have more control over AST
building, to get more compact and tailored ASTs, you need to pay the
price of learning JJTree.

Assuming that you need to build ASTs, with JavaCC you have the choice
between JJTree and JTB. With SableCC, when I last looked at it, you
only get the JTB-like option.

----------
On Mon, Feb 23, 2009 at 10:06 PM, Alan Gates <ga...@yahoo-inc.com>wrote:
We looked into antlr. It appears to be very similar to javacc,with theadded feature that the java code it generates is humanlyreadable. Thatisn't why we want to switch off of javacc. Olga listed the 3things we
want
out of a parser that javacc isn't giving us (lack of docs, no easy
customization of error handle, decoupling of scanning andparsing). So
antlr doesn't look viable.
In response to Pi's suggestion that we could use the logical plan,I hope
we
could use something close to it. Whatever we choose we want it tobeflexible enough to represent richer language constructs (likebranch andloop). I'm not sure our current logical plan can do that. At thesametime, we don't need another layer of translation (we already havelogical
->
physical -> mapreduce). I would like to find a representationthat could
handle expressing the syntax and what is currently the logical plan.

Alan.

On Feb 20, 2009, at 5:15 PM, pi song wrote:
Should be pretty close but we may need to cleanup the interface abit.
Then
the new parser  module can be switched in easily.
BTW, have we already got the solution for the new parser generator?

Pi
On Fri, Feb 20, 2009 at 9:03 PM, Ted Dunning<ted.dunn...@gmail.com>
wrote:
Probably nearly the same effect as you suggest. Are theconcepts at
the
logical plan layer similar to those expressed in pig latin? Orhas a
significant transformation occurred by then?
On Fri, Feb 20, 2009 at 1:59 AM, pi song <pi.so...@gmail.com>wrote:
Sounds good but how about exposing the logical plan layerinstead?
Wouldn't
that yield the same effect? From python for example you stillcan
construct
a logical plan and give to Pig to execute.
--
Ted Dunning, CTO
DeepDyve
--
Nitesh Bhatia
Dhirubhai Ambani Institute of Information & Communication Technology
Gandhinagar
Gujarat

"Life is never perfect. It just depends where you draw the line."

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun

Re: switching to different parser in Pig

Reply via email to