Re: Perl6 grammars -- Parsing english

Moritz Lenz Wed, 11 Jul 2012 03:19:52 -0700

Hi Lard,

sorry for the late and incomplete answer.


Am 04.07.2012 15:09, schrieb Lard Farnwell:

Hi Moritz,

Thanks that was interesting. My investigation into grammars took a while but 
here are the results thus far:

Grammar rules and regexes are just methods…


I hadn't thought about what a grammar and rule actually was before. This 
inspired me to try:

---------------------------
grammar Gram{
     has $.x;

     rule TOP{
        {say $.x}
     }

     method test{
        say $.x
     }
}
my Gram $test .= new(:x("hello"));
$test.parse("ignore this");
$test.test;
say $test.TOP;
---------------------------
which outputs:
Any()                          #output of TOP in parse
hello                            #output of test.test
hello                            #outputted on direct call to rule
Gram.new(x => Any) #the return value of $test.TOP


So rules can't interpolate their grammar's attributes when being called by 
'parse' but can when called as a method. Also rules being called directly as 
methods return the parent grammar. I'm not sure whether either of these things 
are intended…


I'm not sure how it's intended to work either.

Notionally, grammar rules (and other components of regexes) communicateby passing "cursors" around. A cursor is an immutable object that pointsto a location in the string, and additionally keeps track of otherinformation like captures.

So when you write 'grammar gram { ... }', you are actually inheritingfrom class Grammar, which in turn inherits from class Cursor.

When you call the .parse method, a cursor is intatiated automatically,which explains why its attribute is empty -- it's not the same object asyou created in your code.

I'm not sure if there is a mechanism to passing around attributes -- sofar I always just assumed it would work, but it doesn't.

=============================

Also I tried rules with arguments and it worked from grammar->parse but not 
from calling directly as a method.

---------------------------
grammar Gram{

     rule TOP{
        <test_rule('hello')>
     }

     rule test_rule($a){
       $a
     }
}

my Gram $test .= new();
$test.parse("hello") #returns true
$test.test_rule("hello") #error
---------------------------

The error is:

Invalid operation on null string
   in any !LITERAL at src/stage2/QRegex.nqp:653
   in method INTERPOLATE at src/gen/CORE.setting:9731
  (at the line where test_rule starts)

=============================

Ok now to try the things you mentioned:

First I tried using a parcel instead of an array as the role prototype (array 
resulted in error):
---------------------------
role roley [$foo]{
     token tokeny { $foo }
}

grammar gram {
     token TOP { <tokeny> }
}
---------------------------
my gram $gram .= new  does roley[('this','or', 'that')];
$gram.parse('this or that');  #returns true

So parcels get joined with spaces into one token


That's a known not-yet-implemented part of Rakudo :(

=============================

Now to try the around about way:

---------------------------
role roley [$foo]{
     token tokeny:sym<dynamic> { $foo }
}

grammar gram {
     token TOP { <tokeny>[\ <tokeny>]* }
     proto token tokeny {*}
}

my gram $gram .= new;
$gram does roley[$_] for <that this>;
$gram.parse('this'); #matches
$gram.parse('that'); #nope
---------------------------

Each iteration overwrites the previous one in terms of what 'tokeny' resolves 
to rather than adding it (symmetrically? is that what sym is short for?)

"sym" stands for "symbol", the thing that appears in the name of thetoken inside <...>.

============================

One more thing I found which seems to be a bug. I defined my nouns/pronouns 
like:

---------------------------
token PN:sym<John> { <.sym> } #The dot should mean it doesn't get captured
token N:sym<ball> { <.sym> }
---------------------------

when my grammar parses this it ends up with a tree like this:
---------------------------
  sentence => q[John hit the ball]
   statement => q[John hit the ball]
    NP => q[John]
     PN => q[John]
       => q[John]
    VP => q[hit the ball]
     verb => q[hit]
       => q[hit]
     NP => q[the ball]
      D => q[the]
        => q[the]
      N => q[ball]
        => q[ball]
---------------------------

Notice the empty slots on the left. Rather than not capturing the <sym>  the 
<.sym> just means it doesn't capture it's name :S

I've recently discovered the same bug (and tried to fix it, instead ofsubmitted it as a bug report; I failed to fix it though :/). Basically<sym> is special-cased in the compiler, and the . modifier at the startsimply doesn't harmonize with that special case.

============================

So after all this I have a much better understanding of what grammars really 
are but I'm still confused about a few things:

grammars are like classes. They are special because they have a method called 
'parse' which applies a rule/token definition (regex) called TOP (or whatever 
is set by the  :rule argument to parse).
Q: Are grammars meant to be able to have attributes like classes and are they 
meant to be able to interpolate them into their rules/token?
rules and tokens are just special types of methods who's body is a regex rather 
than perl6 code.


See above. The answer is "I'm not quite sure".

Q: What is the meaning of the return values of tokens/rules when called as 
methods?

The specification says that a token/rule/regex returns cursor if thereis one possible match, or a lazy list of cursors with possible matchesif there are multiple ways it can match (for backtracking).

I don't think Rakudo sticks to that calling convention though;backtracking is somehow managed through a stack of integers (pointing topositions of the string) inside a capture object. Or so. I'm really notan expert when it comes to implementation details of the regex engine.


I also don't know what exact arguments a regex gets passed to at invocation.

Q: Is it possible to write a normal method that conforms the the same interface as 
rules/tokens (whatever that is). i.e. where we can use <normal_method> in 
rules/tokens which is passed arguments and somehow matches and sets position etc.

See above. Tricky right now, because of the mismatched callingconvention between Rakudo and the specification.

Q: Are rules/tokens meant to be able to have arguments like methods and if so 
how do they fit in.


grammar A {
   token foo($x) { \' ~ \' $x };
   token TOP { <foo("bar")> }
};

say A.parse(q['bar']);  # matches
say A.parse(q['baz']);  # no match

grammars don't check whether the things in their tokens/rules like <foo> are 
actually defined until it comes time to call them

> Q: Is this the way it's meant to be?

Yes. Calls like <foo> are simply method calls, and method calls aren'teasily veriable at compile time. It's perfectly fine to write


grammar Sentence {
   rule TOP { <subject> <verb> <object> }
}

and require that subclasses implement the rules subject, verb and object.

I saw your post on doc.perl6.org docs. If I can get my head around all this I 
would be happy to help document grammars!


That would be very much appreciated.

I've also asked the author ofhttps://github.com/perlpilot/perl6-docs/blob/master/intro/p6-grammar-intro.podwhether we can use that as a base for a regex/grammar tutorial ondec.perl6.org, and knowing the author I don't think he'll object. (Justwant to make sure you don't duplicate effort in this area).


Cheers,
Moritz

Re: Perl6 grammars -- Parsing english

Reply via email to