Inspired about the idea of parsing by recognition, I wrote a small script that applies the idea. It can't really parse anything but it runs an analysis over a language and a message that is unambiguous in that language.
It seems to expose the structure, I filtered some lines apart from the printout: index: 0 'for' -> 0 for ('for', identifier, 'in', expression, expression) index: 1 identifier -> 1 for ('for', identifier, 'in', expression, expression) index: 2 'in' -> 2 for ('for', identifier, 'in', expression, expression) index: 3 expression -> 0 call (expression, '(', expression, ')') expression -> 3 for ('for', identifier, 'in', expression, expression) index: 4 '(' -> 1 call (expression, '(', expression, ')') index: 5 expression -> 2 call (expression, '(', expression, ')') expression -> 0 call (expression, '(', expression, ')') index: 6 ')' -> 3 call (expression, '(', expression, ')') index: 7 expression -> 0 call (expression, '(', expression, ')') index: 8 '(' -> 1 call (expression, '(', expression, ')') index: 9 expression -> 2 call (expression, '(', expression, ')') expression -> 0 call (expression, '(', expression, ')') index: 10 ')' -> 3 call (expression, '(', expression, ')') The message parsed was: for i in range ( 200 ) print ( 5 ) Every non-keyword token in the message is seen as a potential expression or statement. The printout shows that there are two call -expressions you can fold out of the input. If you would fold it twice, the whole message would seem to be a for -statement. So I should probably introduce that folding there soon. The fold goes through that mapping, doing something similar to folding a paper. In the input you have: expression -> 0 call (expression, '(', expression, ')') '(' -> 1 call (expression, '(', expression, ')') expression -> 2 call (expression, '(', expression, ')') ')' -> 3 call (expression, '(', expression, ')') It's clear that this can be folded into call (expression, '(', expression, ')'). It seems you could go through the input and attempt to fold it several times like this. Some inputs would fold entirely, while others would not fold because they are invalid. Third kind of inputs would fold in many ways, because they are ambiguous. It's probably worthwhile to provide another message and more grammar so I end up with situation where there's ambiguity. Also some messages that are invalid. It's exciting to see whether this actually can work. On Tue, Dec 10, 2013 at 3:24 AM, Henri Tuhola <henri.tuh...@gmail.com>wrote: > While I was studying left recursion, I realized that it becomes complex > and unpredictable very fast. Allowing left recursion in peg parser is > almost like a permission for making it extremely complicated. Suddenly > right recursion starts affecting what ends up actually parsed and it's > neither nice that the solution is some wrap-around papered over the simple > ways to parse with parsing expression grammars. > > I noticed another property that isn't emphasized a lot about pegs. If you > scan from right to left instead of left to right, the same grammar may > match a different language. Regular expressions have same behavior though. > > It all gave me a reason to question the sanity in parsing expression > grammars. The idea to construct parser from recognition rules sounds > intuitive though. Perhaps there's a way to construct recognition based > parsing that doesn't depend on direction of evaluation as much. > > Human parses the text by looking at some small area at once. Perhaps it > would make sense imitating that. Here's some observations: > > Even if PEG parsing doesn't force that, it seems preferable to mark some > words, such as 'if', 'for' and 'else', as keywords. "term = identifier" > seems to be less specific than "statement = if ... else ..." or > "statement = for ... in" . > > Infix expressions alone are unambiguous, and if human doesn't know their > precedence then only then they become ambiguous. For example: "4 + 5*3" is > easy to understand because precedence is very much standardized at that > point. The "4 + 3 << 2" is more ambiguous. > > If some pattern is lacking, the pieces that formed parts of that pattern > are garbage. For example superfluous '(' and ')' parentheses. Some such > garbage patterns cause ambiguous situations as well. > > I don't know whether I explore this idea further. It feels more promising > because there's chance for better error reporting there. It could also work > in noisy settings. Haven't studied whether there's something similar > existing already. >
def main(): identifier = Label('identifier') expression = Label('expression') call = Label('call') for_ = Label('for') statement = Label('statement') message = "for i in range ( 200 ) print ( 5 )".split(' ') clauses = [ (identifier, lambda terminal: terminal.isalnum()), (call, (expression, '(', expression, ')')), (for_, ('for', identifier, 'in', expression, expression)), (expression, identifier), (expression, call), (statement, expression), (statement, for_), ] callables = [] derive_direct = {} cues = {} def push(table, key, value): if key in table: ob = table[key] else: ob = table[key] = set() ob.add(value) for lhs, rhs in clauses: if callable(rhs): callables.append((lhs, rhs)) elif isinstance(rhs, Label): push(derive_direct, rhs, lhs) else: for i, cue in enumerate(rhs): push(cues, cue, (i, lhs, rhs, len(rhs))) derive = {} def expand_derive(labels, rhs): if rhs not in labels: labels.add(rhs) for der in derive_direct.get(rhs, ()): expand_derive(labels, der) for lhs, records in derive_direct.items(): labels = set() for rhs in records: expand_derive(labels, rhs) derive[lhs] = labels print callables print derive for cue, reactions in cues.items(): print_cues(cue, reactions) print print 'try parse', message for index, lexeme in enumerate(message): print 'index: %2i' % index if lexeme in cues: reactions = cues[lexeme] print_cues(lexeme, reactions) else: for lhs, rhs in callables: if rhs(lexeme): print_cues(lhs, cues.get(lhs, ())) for label in derive.get(lhs, ()): print_cues(label, cues.get(label, ())) def print_cues(cue, reactions): for index, label, rule, count in reactions: print "%15r -> %i %15r %r" % (cue, index, label, rule) class Label(object): def __init__(self, name): self.name = name def __repr__(self): return self.name if __name__=='__main__': main()
stdout
Description: Binary data
_______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg