Re: [PEG] error handling in PEGs

Mathias Doenitz Mon, 26 May 2014 13:05:44 -0700

> hi. That makes sense. It’s basically identifying a no viable alternative 
> situation if the char occurs at the start of a rule.
> 
> I think one could get better error messages and recovery using tokens instead 
> of scannerless but one gives up composability/modularity. For token errors, 
> it’s easy to just skip over stuff in the lexer. The parser need not even see 
> those.
> 
>  when using tokens, it seems one could get a better error message like 
> “missing ID, found INT” instead of “invalid input ‘3’ expected [a-zA-Z]+” or 
> whatever.

I’m not sure that tokenizing will always result in better error messages.
In fact, there are situation in which tokenizing appears to actually *decrease* 
error message quality (for example when there are typos in language keywords).

The approach I outlined for PEGs gives you a list of rule “traces” for an error 
position whereby each trace is the stack of rules that failed.
There are several conceivable ways to turn such a list of traces into a proper 
error message.

Going up the trace to the first parent rule which failed at its first sub-rule 
(rather than somewhere in the middle) for reporting an expected alternative is 
only one option.
You could also allow the user to somehow explicitly mark the rules that are 
considered to be terminals which would traditionally be matched by the lexer. 
(This marking could also be minimally invasive like ALL-CAPS naming). And if 
you allow for special constructs in the grammar that give the user even more 
control over what and how errors are reported (like in the paper that Sérgio 
pointed to) the possibilities multiply.

However, the negative syntactic predicate available in PEGs can pose somewhat 
of a problem for fully automatic error reporting. This starts with the 
determination of the error position (do you count progress made in a negative 
syntactic predicate towards the overall progress the parser has made or not?) 
and also affects the display of “expected” input (e.g.: “expected anything but 
FOO or BAR”).

> Single token insertion and deletion is great, although I wonder how well it 
> works with characters compared to tokens.

Actually it works well and can be superior to recovery on a token stream.
Again, consider a typo in a keyword (like `publc` rather than `public`). Since 
recovery for PEGs runs on the same level that the user usually makes mistakes 
on (i.e. the character level rather than the token level) single 
insertion/deletion/replacement is quite effective in real-world applications 
according to our experiences.

(Forgot to mention before that parboiled doesn’t only try single-char deletion 
and insertion but also replacement, which is essentially a deletion followed by 
an insertion.)

> Yep, if ANTLR cannot repair inside a rule, it consumes tokens until something 
> in  what could follow appears. Do you consider the actual call stack or the 
> complete follow set? If you have the call stack, you can limit the set of 
> things to consume to just what is possible given the input as opposed to 
> giving all input.

parboiled 1.x determines the follower-set by working up through a rule trace, 
so the follower-set is not determined through a static analysis but does take 
the actual input into account.
However, it’s not yet as smart as it could be. Ideally we’d attempt 
resynchronisation on *all* rule traces for the current error and choose the 
best one (i.e. the one that skips the least number of input chars).
IIRC parboiled currently always resynchronises on the first trace only.

Cheers,
Mathias

---
math...@parboiled.org
http://www.parboiled.org

On 26 May 2014, at 18:30, Terence Parr <pa...@cs.usfca.edu> wrote:

> 
> On May 26, 2014, at 8:48 AM, Mathias Doenitz <math...@parboiled.org> wrote:
> 
>> Terence,
>> 
>> for error reporting parboiled applies this logic:
>> 
>> 1. Determine error position (first input pos that cannot be parsed), if this 
>> is end-of-input we have no error and are done.
>> 
>> 2. Restart the parser from the very beginning and “watch" the parser try to 
>> match the error position. Record all the rules (rule stacks) that are 
>> applied against the error position. From this list of rules we can construct 
>> proper error messages like:
>> 
>>    Invalid input 'x', expected 'f', Digit, hex or UpperAlpha (line 1, column 
>> 4):
>>    abcx
>>       ^
>> 
>>    4 rules mismatched at error location:
>>      targetRule / | / "fgh" / 'f'
>>      targetRule / | / Digit
>>      targetRule / | / hex
>>      targetRule / | / UpperAlpha
> 
> hi. That makes sense. It’s basically identifying a no viable alternative 
> situation if the char occurs at the start of a rule.
> 
> I think one could get better error messages and recovery using tokens instead 
> of scannerless but one gives up composability/modularity. For token errors, 
> it’s easy to just skip over stuff in the lexer. The parser need not even see 
> those.
> 
>  when using tokens, it seems one could get a better error message like 
> “missing ID, found INT” instead of “invalid input ‘3’ expected [a-zA-Z]+” or 
> whatever.
> 
>> Restarting the parser for gathering error info is acceptable in almost all 
>> use cases since errors are infrequent and added overhead here is not a 
>> problem.
> 
> I have no problem restarting parsers upon syntax errors; it’s what I 
> recommend people do in ANTLR 4 for maximum speed. ;)   That way, we can use a 
> faster mechanism for the common case and fail over to a more expensive one to 
> determine if the syntax error is correct or a result of a weakness in the 
> first strategy. BTW, I will be presenting the ALL(*) at OOPSLA this year; 
> here’s more on the strategy:
> 
> http://www.antlr.org/papers/allstar-techreport.pdf
> 
> It basically properly builds a parser for any CFG you give it that does not 
> have indirect left recursion. (ANTLR rewrites direct left-recur under the 
> hood.) No backtracking and it does all analysis on-the-fly, memoizing 
> lookahead-to-predicted-alternative mappings with DFA.
> 
> I’ve actually been using the supplied parser interpreter these days instead 
> of generating grammars for instant gratification :) Generating code and 
> compiling is tedious during prototyping.
> 
>> Error recovery:
>> Here we have the following logic:
>> 
>> 1. Try to “remove” the error position from the input and see whether the 
>> parser can continue.
>> 2. If not, try to insert a single character that the parser “expects” (see 
>> reporting) at the error position. If this allows us to continue parsing, 
>> great.
>> 3. If no single-char fix “repairs” the input we resynchronise the parser in 
>> a similar way that ANTLR does it (IIRC), i.e. we skip ahead in the input 
>> until the parser can continue.
> 
> Single token insertion and deletion is great, although I wonder how well it 
> works with characters compared to tokens.
> 
> Yep, if ANTLR cannot repair inside a rule, it consumes tokens until something 
> in  what could follow appears. Do you consider the actual call stack or the 
> complete follow set? If you have the call stack, you can limit the set of 
> things to consume to just what is possible given the input as opposed to 
> giving all input.
> 
> Thanks very much for the summary.
> 
> Ter

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Re: [PEG] error handling in PEGs

Reply via email to