Sure. I went back to our emails and to the CVS logs.

From that, I find that Terence really discovered only one ambiguity in the grammar (Terence, please check this as well). The production for DeclarationOrStatement reads:

transient GNode DeclarationOrStatement = /* Choice */
  <Declaration> Declaration
  / <Statement> Statement
  ;

But a declaration may be a block (think block of code directly nested inside a class declaration) and Statement may also be a block (think nested block of code). As a result, either alternative matches exactly the same syntax and, depending on the order of the two alternatives, you either get a BlockDeclaration AST node or a Block AST node (with the production as shown it's the BlockDeclaration).

Terence also found a few bugs: The grammar didn't support assertions, empty declarations, and types as primary expressions (to enable .class expressions).


Going through CVS logs for the C, Java, and Rats! grammars, I found the following other problems:

Wide string literals in C are written as L"text" and the C grammar originally tried to recognize identifiers before literals, with the result that L"text" would be parsed as identifier "L" followed by regular string literal "text". I discovered that very quickly once I started regression testing xtc's C type checker, i.e. actually used wide string literals.

An integer literal can be a prefix of a floating point literal in C & Java and the Java grammar originally tried to recognize integer literals before floating point literals. Again, I discovered that very quickly once I tested code containing floating point operations. I also avoided that bug when writing the C grammar a while later.

A more subtle and complicated problem occurred in the Rats! grammar: Originally, Rats! didn't have a module system and didn't support the addition/removal of alternatives in top-level choices. When I added that feature, the parser suddenly didn't work anymore. The problem was an unfortunate interaction between productions recognizing the empty input and greediness. In more detail, sequences may be empty and an ordered choice is a composed of one or more sequences separated by a slash "/":

OrderedChoice Choice =
   s:Sequence ss:( void:"/":Symbol Sequence )*
      { yyValue = new OrderedChoice(new Pair(s, ss).list()); }
   ;

Sequence Sequence =
    n:SequenceName? l:Voided*
      { yyValue = new Sequence(n, l.list()); }
   ;

(I'm giving you the old definition).

Adding new alternatives before another, named alternative is expressed as:

    choice:Choice "/":Symbol s:SequenceName "...":Symbol

E.g. to inject Java's unsigned right-shift into the production for symbols common to C and Java, I write:

String SymbolCharacters +=
    <TripleGreaterEqual>  ">>>="
  / <GreaterGreaterEqual> ...
  ;

But since sequences may be empty and choices are greedy the "/ <GreaterGreaterEqual>" is consumed by the choice in the above expression, which makes the expression fail on the ellipsis "...". The solution is to add a syntactic predicate to the definition of sequence:

Sequence Sequence =
   !Ellipsis n:SequenceName? l:Voided*
      { yyValue = new Sequence(n, l.list()); }
   ;

I have had at least one user report, where the interaction between empty and greediness caused infinite recursions, b/c the repeated expression recognized the empty input.


In summary: Yes, there can be subtle ordering problems, but they usually show quickly during testing. And for the really hard ones, you have to hope for Terence's sharp eyes and mind. :) Also, a more subtle class of problems can be caused by interactions between (legal) empty inputs and greediness. At the same time, once discovered, the solutions are simple: reorder for the former case and add a predicate in the latter case.

To put this in perspective: I have translated my Java grammar to Scott McPeak's Elkhound. I got 18 shift-reduce and 3 reduce-reduce conflicts. I then had to test across a range of source files to find out that only 2 of these conflicts were real conflicts requiring explicit disambiguation. Additionally, I had an extremely frustrating day trying to translate the token-level productions into flex rules, with the end result being considerably less precise than Rats!.

Don't get me wrong: I like Elkhound. It is a good tool with very nice insights into GLR parsing behind it. But it too presents some subtle debugging issues...

Robert


On Oct 18, 2006, at 4:22 PM, Sylvain Schmitz wrote:

Hi all,

Robert Grimm wrote:
I agree with Terence in that ambiguity vs. ordering is a trade- off. With CFGs you may get unnecessary ambiguity and with PEGs you may get subtle ordering errors. As Terence points out, I had a (very) few in my Java grammar.

I'd love to know what the precise errors were. Would you mind sharing them with the list? Did you ever find a really bad problem where no ordering would work and you had to rewrite a larger part of the grammar?

The key difference, however, is that CFGs are only closed under composition if you use GLR parsing, while PEGs are always closed under composition. As a result, providing modularity for PEGs is simpler and faster than for CFGs. That closes the deal for me...

--
Cheers,

  Sylvain


_______________________________________________
PEG mailing list
[EMAIL PROTECTED]
https://lists.csail.mit.edu/mailman/listinfo/peg

Reply via email to