> I have considered the possibility of launching lots of threads, parsing the > input back to front and from lots of different places looking for the > correction that allows more of the input to be parsed. Perhaps we have > finally found a way to use all those extra core ;)
Yes, since parse errors normally happen only in a small minority of parser runs spending time on finding the best recovery is probably worth it. Whether this needs to happen in a parallelized fashion or not is then another question. In general I think error recovery in PEGs is a very interesting and still somewhat under-researched topic. As is another feature, which we are currently planning: “parser continuations”, i.e. the ability to start a parsing run on partial input (a prefix) and continue when the next input chunk becomes available. This means that, in addition to the normal parsing result when the input is complete, the parser can return a function that the user can use to continue the parsing run later when more input has arrived. Ideally, since PEG parsers can potentially backtrack all the way to very beginning of the input, in these cases the parser should also return an info about what prefix (length) of the total input can now be dropped, because the grammar will never backtrack beyond it. This feature would allow PEGs to become more valuable in streaming scenarios, which is something that we are currently dealing a lot with. Cheers, Mathias --- math...@parboiled.org http://www.parboiled.org On 27 May 2014, at 01:31, Terence Parr <pa...@cs.usfca.edu> wrote: > Very interesting. Thanks for the clarifications. I can see how a misspelled > keyword could be better handled at character level, unless of course you mean > it to be an identifier. My intuition is that token level error handling would > provide more opportunities for clarity and natural recovery but I have > nothing to back this up. indicating which rules are like tokens would be a > good compromise maybe. > > I have considered the possibility of launching lots of threads, parsing the > input back to front and from lots of different places looking for the > correction that allows more of the input to be parsed. Perhaps we have > finally found a way to use all those extra core ;) > > Ter > > On May 26, 2014, at 1:04 PM, Mathias Doenitz <math...@parboiled.org> wrote: > >>> hi. That makes sense. It’s basically identifying a no viable alternative >>> situation if the char occurs at the start of a rule. >>> >>> I think one could get better error messages and recovery using tokens >>> instead of scannerless but one gives up composability/modularity. For token >>> errors, it’s easy to just skip over stuff in the lexer. The parser need not >>> even see those. >>> >>> when using tokens, it seems one could get a better error message like >>> “missing ID, found INT” instead of “invalid input ‘3’ expected [a-zA-Z]+” >>> or whatever. >> >> I’m not sure that tokenizing will always result in better error messages. >> In fact, there are situation in which tokenizing appears to actually >> *decrease* error message quality (for example when there are typos in >> language keywords). >> >> The approach I outlined for PEGs gives you a list of rule “traces” for an >> error position whereby each trace is the stack of rules that failed. >> There are several conceivable ways to turn such a list of traces into a >> proper error message. >> >> Going up the trace to the first parent rule which failed at its first >> sub-rule (rather than somewhere in the middle) for reporting an expected >> alternative is only one option. >> You could also allow the user to somehow explicitly mark the rules that are >> considered to be terminals which would traditionally be matched by the >> lexer. (This marking could also be minimally invasive like ALL-CAPS naming). >> And if you allow for special constructs in the grammar that give the user >> even more control over what and how errors are reported (like in the paper >> that Sérgio pointed to) the possibilities multiply. >> >> However, the negative syntactic predicate available in PEGs can pose >> somewhat of a problem for fully automatic error reporting. This starts with >> the determination of the error position (do you count progress made in a >> negative syntactic predicate towards the overall progress the parser has >> made or not?) and also affects the display of “expected” input (e.g.: >> “expected anything but FOO or BAR”). >> >>> Single token insertion and deletion is great, although I wonder how well it >>> works with characters compared to tokens. >> >> Actually it works well and can be superior to recovery on a token stream. >> Again, consider a typo in a keyword (like `publc` rather than `public`). >> Since recovery for PEGs runs on the same level that the user usually makes >> mistakes on (i.e. the character level rather than the token level) single >> insertion/deletion/replacement is quite effective in real-world applications >> according to our experiences. >> >> (Forgot to mention before that parboiled doesn’t only try single-char >> deletion and insertion but also replacement, which is essentially a deletion >> followed by an insertion.) >> >>> Yep, if ANTLR cannot repair inside a rule, it consumes tokens until >>> something in what could follow appears. Do you consider the actual call >>> stack or the complete follow set? If you have the call stack, you can limit >>> the set of things to consume to just what is possible given the input as >>> opposed to giving all input. >> >> parboiled 1.x determines the follower-set by working up through a rule >> trace, so the follower-set is not determined through a static analysis but >> does take the actual input into account. >> However, it’s not yet as smart as it could be. Ideally we’d attempt >> resynchronisation on *all* rule traces for the current error and choose the >> best one (i.e. the one that skips the least number of input chars). >> IIRC parboiled currently always resynchronises on the first trace only. >> >> Cheers, >> Mathias >> >> --- >> math...@parboiled.org >> http://www.parboiled.org >> >> On 26 May 2014, at 18:30, Terence Parr <pa...@cs.usfca.edu> wrote: >> >>> >>> On May 26, 2014, at 8:48 AM, Mathias Doenitz <math...@parboiled.org> wrote: >>> >>>> Terence, >>>> >>>> for error reporting parboiled applies this logic: >>>> >>>> 1. Determine error position (first input pos that cannot be parsed), if >>>> this is end-of-input we have no error and are done. >>>> >>>> 2. Restart the parser from the very beginning and “watch" the parser try >>>> to match the error position. Record all the rules (rule stacks) that are >>>> applied against the error position. From this list of rules we can >>>> construct proper error messages like: >>>> >>>> Invalid input 'x', expected 'f', Digit, hex or UpperAlpha (line 1, column >>>> 4): >>>> abcx >>>> ^ >>>> >>>> 4 rules mismatched at error location: >>>> targetRule / | / "fgh" / 'f' >>>> targetRule / | / Digit >>>> targetRule / | / hex >>>> targetRule / | / UpperAlpha >>> >>> hi. That makes sense. It’s basically identifying a no viable alternative >>> situation if the char occurs at the start of a rule. >>> >>> I think one could get better error messages and recovery using tokens >>> instead of scannerless but one gives up composability/modularity. For token >>> errors, it’s easy to just skip over stuff in the lexer. The parser need not >>> even see those. >>> >>> when using tokens, it seems one could get a better error message like >>> “missing ID, found INT” instead of “invalid input ‘3’ expected [a-zA-Z]+” >>> or whatever. >>> >>>> Restarting the parser for gathering error info is acceptable in almost all >>>> use cases since errors are infrequent and added overhead here is not a >>>> problem. >>> >>> I have no problem restarting parsers upon syntax errors; it’s what I >>> recommend people do in ANTLR 4 for maximum speed. ;) That way, we can use >>> a faster mechanism for the common case and fail over to a more expensive >>> one to determine if the syntax error is correct or a result of a weakness >>> in the first strategy. BTW, I will be presenting the ALL(*) at OOPSLA this >>> year; here’s more on the strategy: >>> >>> http://www.antlr.org/papers/allstar-techreport.pdf >>> >>> It basically properly builds a parser for any CFG you give it that does not >>> have indirect left recursion. (ANTLR rewrites direct left-recur under the >>> hood.) No backtracking and it does all analysis on-the-fly, memoizing >>> lookahead-to-predicted-alternative mappings with DFA. >>> >>> I’ve actually been using the supplied parser interpreter these days instead >>> of generating grammars for instant gratification :) Generating code and >>> compiling is tedious during prototyping. >>> >>>> Error recovery: >>>> Here we have the following logic: >>>> >>>> 1. Try to “remove” the error position from the input and see whether the >>>> parser can continue. >>>> 2. If not, try to insert a single character that the parser “expects” (see >>>> reporting) at the error position. If this allows us to continue parsing, >>>> great. >>>> 3. If no single-char fix “repairs” the input we resynchronise the parser >>>> in a similar way that ANTLR does it (IIRC), i.e. we skip ahead in the >>>> input until the parser can continue. >>> >>> Single token insertion and deletion is great, although I wonder how well it >>> works with characters compared to tokens. >>> >>> Yep, if ANTLR cannot repair inside a rule, it consumes tokens until >>> something in what could follow appears. Do you consider the actual call >>> stack or the complete follow set? If you have the call stack, you can limit >>> the set of things to consume to just what is possible given the input as >>> opposed to giving all input. >>> >>> Thanks very much for the summary. >>> >>> Ter >> > _______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg