Sure. I went back to our emails and to the CVS logs.
From that, I find that Terence really discovered only one ambiguity
in the grammar (Terence, please check this as well). The production
for DeclarationOrStatement reads:
transient GNode DeclarationOrStatement = /* Choice */
<Declaration> Declaration
/ <Statement> Statement
;
But a declaration may be a block (think block of code directly nested
inside a class declaration) and Statement may also be a block (think
nested block of code). As a result, either alternative matches
exactly the same syntax and, depending on the order of the two
alternatives, you either get a BlockDeclaration AST node or a Block
AST node (with the production as shown it's the BlockDeclaration).
Terence also found a few bugs: The grammar didn't support assertions,
empty declarations, and types as primary expressions (to
enable .class expressions).
Going through CVS logs for the C, Java, and Rats! grammars, I found
the following other problems:
Wide string literals in C are written as L"text" and the C grammar
originally tried to recognize identifiers before literals, with the
result that L"text" would be parsed as identifier "L" followed by
regular string literal "text". I discovered that very quickly once I
started regression testing xtc's C type checker, i.e. actually used
wide string literals.
An integer literal can be a prefix of a floating point literal in C &
Java and the Java grammar originally tried to recognize integer
literals before floating point literals. Again, I discovered that
very quickly once I tested code containing floating point operations.
I also avoided that bug when writing the C grammar a while later.
A more subtle and complicated problem occurred in the Rats! grammar:
Originally, Rats! didn't have a module system and didn't support the
addition/removal of alternatives in top-level choices. When I added
that feature, the parser suddenly didn't work anymore. The problem
was an unfortunate interaction between productions recognizing the
empty input and greediness. In more detail, sequences may be empty
and an ordered choice is a composed of one or more sequences
separated by a slash "/":
OrderedChoice Choice =
s:Sequence ss:( void:"/":Symbol Sequence )*
{ yyValue = new OrderedChoice(new Pair(s, ss).list()); }
;
Sequence Sequence =
n:SequenceName? l:Voided*
{ yyValue = new Sequence(n, l.list()); }
;
(I'm giving you the old definition).
Adding new alternatives before another, named alternative is
expressed as:
choice:Choice "/":Symbol s:SequenceName "...":Symbol
E.g. to inject Java's unsigned right-shift into the production for
symbols common to C and Java, I write:
String SymbolCharacters +=
<TripleGreaterEqual> ">>>="
/ <GreaterGreaterEqual> ...
;
But since sequences may be empty and choices are greedy the "/
<GreaterGreaterEqual>" is consumed by the choice in the above
expression, which makes the expression fail on the ellipsis "...".
The solution is to add a syntactic predicate to the definition of
sequence:
Sequence Sequence =
!Ellipsis n:SequenceName? l:Voided*
{ yyValue = new Sequence(n, l.list()); }
;
I have had at least one user report, where the interaction between
empty and greediness caused infinite recursions, b/c the repeated
expression recognized the empty input.
In summary: Yes, there can be subtle ordering problems, but they
usually show quickly during testing. And for the really hard ones,
you have to hope for Terence's sharp eyes and mind. :) Also, a more
subtle class of problems can be caused by interactions between
(legal) empty inputs and greediness. At the same time, once
discovered, the solutions are simple: reorder for the former case and
add a predicate in the latter case.
To put this in perspective: I have translated my Java grammar to
Scott McPeak's Elkhound. I got 18 shift-reduce and 3 reduce-reduce
conflicts. I then had to test across a range of source files to find
out that only 2 of these conflicts were real conflicts requiring
explicit disambiguation. Additionally, I had an extremely frustrating
day trying to translate the token-level productions into flex rules,
with the end result being considerably less precise than Rats!.
Don't get me wrong: I like Elkhound. It is a good tool with very nice
insights into GLR parsing behind it. But it too presents some subtle
debugging issues...
Robert
On Oct 18, 2006, at 4:22 PM, Sylvain Schmitz wrote:
Hi all,
Robert Grimm wrote:
I agree with Terence in that ambiguity vs. ordering is a trade-
off. With CFGs you may get unnecessary ambiguity and with PEGs you
may get subtle ordering errors. As Terence points out, I had a
(very) few in my Java grammar.
I'd love to know what the precise errors were. Would you mind
sharing them with the list? Did you ever find a really bad problem
where no ordering would work and you had to rewrite a larger part
of the grammar?
The key difference, however, is that CFGs are only closed under
composition if you use GLR parsing, while PEGs are always closed
under
composition. As a result, providing modularity for PEGs is simpler
and faster than for CFGs. That closes the deal for me...
--
Cheers,
Sylvain
_______________________________________________
PEG mailing list
[EMAIL PROTECTED]
https://lists.csail.mit.edu/mailman/listinfo/peg