Inspired about the idea of parsing by recognition, I wrote a small script
that applies the idea. It can't really parse anything but it runs an
analysis over a language and a message that is unambiguous in that language.

It seems to expose the structure, I filtered some lines apart from the
printout:

index:  0
          'for' -> 0             for  ('for', identifier, 'in', expression,
expression)
index:  1
     identifier -> 1             for  ('for', identifier, 'in', expression,
expression)
index:  2
           'in' -> 2             for  ('for', identifier, 'in', expression,
expression)

index:  3
     expression -> 0            call  (expression, '(', expression, ')')
     expression -> 3             for  ('for', identifier, 'in', expression,
expression)
index:  4
            '(' -> 1            call  (expression, '(', expression, ')')
index:  5
     expression -> 2            call  (expression, '(', expression, ')')
     expression -> 0            call  (expression, '(', expression, ')')
index:  6
            ')' -> 3            call  (expression, '(', expression, ')')

index:  7
     expression -> 0            call  (expression, '(', expression, ')')
index:  8
            '(' -> 1            call  (expression, '(', expression, ')')
index:  9
     expression -> 2            call  (expression, '(', expression, ')')
     expression -> 0            call  (expression, '(', expression, ')')
index: 10
            ')' -> 3            call  (expression, '(', expression, ')')

The message parsed was: for i in range ( 200 ) print ( 5 )

Every non-keyword token in the message is seen as a potential expression or
statement. The printout shows that there are two call -expressions you can
fold out of the input. If you would fold it twice, the whole message would
seem to be a for -statement.

So I should probably introduce that folding there soon. The fold goes
through that mapping, doing something similar to folding a paper. In the
input you have:

     expression -> 0            call  (expression, '(', expression, ')')
            '(' -> 1            call  (expression, '(', expression, ')')
     expression -> 2            call  (expression, '(', expression, ')')
            ')' -> 3            call  (expression, '(', expression, ')')

It's clear that this can be folded into call  (expression, '(', expression,
')'). It seems you could go through the input and attempt to fold it
several times like this. Some inputs would fold entirely, while others
would not fold because they are invalid. Third kind of inputs would fold in
many ways, because they are ambiguous.

It's probably worthwhile to provide another message and more grammar so I
end up with situation where there's ambiguity. Also some messages that are
invalid. It's exciting to see whether this actually can work.


On Tue, Dec 10, 2013 at 3:24 AM, Henri Tuhola <henri.tuh...@gmail.com>wrote:

> While I was studying left recursion, I realized that it becomes complex
> and unpredictable very fast. Allowing left recursion in peg parser is
> almost like a permission for making it extremely complicated. Suddenly
> right recursion starts affecting what ends up actually parsed and it's
> neither nice that the solution is some wrap-around papered over the simple
> ways to parse with parsing expression grammars.
>
> I noticed another property that isn't emphasized a lot about pegs. If you
> scan from right to left instead of left to right, the same grammar may
> match a different language. Regular expressions have same behavior though.
>
> It all gave me a reason to question the sanity in parsing expression
> grammars. The idea to construct parser from recognition rules sounds
> intuitive though. Perhaps there's a way to construct recognition based
> parsing that doesn't depend on direction of evaluation as much.
>
> Human parses the text by looking at some small area at once. Perhaps it
> would make sense imitating that. Here's some observations:
>
> Even if PEG parsing doesn't force that, it seems preferable to mark some
> words, such as 'if', 'for' and 'else', as keywords.  "term = identifier"
>  seems to be less specific than  "statement = if ... else ..."  or
>  "statement = for ... in" .
>
> Infix expressions alone are unambiguous, and if human doesn't know their
> precedence then only then they become ambiguous. For example: "4 + 5*3" is
> easy to understand because precedence is very much standardized at that
> point. The "4 + 3 << 2" is more ambiguous.
>
> If some pattern is lacking, the pieces that formed parts of that pattern
> are garbage. For example superfluous '(' and ')' parentheses. Some such
> garbage patterns cause ambiguous situations as well.
>
> I don't know whether I explore this idea further. It feels more promising
> because there's chance for better error reporting there. It could also work
> in noisy settings. Haven't studied whether there's something similar
> existing already.
>
def main():
    identifier = Label('identifier')
    expression = Label('expression')
    call       = Label('call')
    for_       = Label('for')
    statement  = Label('statement')

    message = "for i in range ( 200 ) print ( 5 )".split(' ')

    clauses = [
        (identifier, lambda terminal: terminal.isalnum()),
        (call,       (expression, '(', expression, ')')),
        (for_,       ('for', identifier, 'in', expression, expression)),
        (expression, identifier),
        (expression, call),
        (statement,  expression),
        (statement,  for_),
    ]

    callables = []
    derive_direct = {}
    cues = {}

    def push(table, key, value):
        if key in table:
            ob = table[key]
        else:
            ob = table[key] = set()
        ob.add(value)

    for lhs, rhs in clauses:
        if callable(rhs):
            callables.append((lhs, rhs))
        elif isinstance(rhs, Label):
            push(derive_direct, rhs, lhs)
        else:
            for i, cue in enumerate(rhs):
                push(cues, cue, (i, lhs, rhs, len(rhs)))

    derive = {}
    def expand_derive(labels, rhs):
        if rhs not in labels:
            labels.add(rhs)
            for der in derive_direct.get(rhs, ()):
                expand_derive(labels, der)
    for lhs, records in derive_direct.items():
        labels = set()
        for rhs in records:
            expand_derive(labels, rhs)
        derive[lhs] = labels

    print callables
    print derive
    for cue, reactions in cues.items():
        print_cues(cue, reactions)
    print

    print 'try parse', message
    for index, lexeme in enumerate(message):
        print 'index: %2i' % index
        if lexeme in cues:
            reactions = cues[lexeme]
            print_cues(lexeme, reactions)
        else:
            for lhs, rhs in callables:
                if rhs(lexeme):
                    print_cues(lhs, cues.get(lhs, ()))
                    for label in derive.get(lhs, ()):
                        print_cues(label, cues.get(label, ()))

def print_cues(cue, reactions):
    for index, label, rule, count in reactions:
        print "%15r -> %i %15r  %r" % (cue, index, label, rule)


class Label(object):
    def __init__(self, name):
        self.name = name

    def __repr__(self):
        return self.name

if __name__=='__main__':
    main()

Attachment: stdout
Description: Binary data

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Reply via email to