Luke Blanshard writes:
> Luke Palmer wrote:
> >This list is for people interested in building the Perl 6 compiler.  Now
> >you have your first real task!  
> >
> >We have to make a formal grammar for Perl 6.  Perl 6 is a huge language,
> >so the task seems better done incrementally by the community...
> >
> >Send patches to this list.
> 
> OK, I'll bite.  In contrast to Luke's 50-thousand-foot level, I'm
> diving down into the goriest of details.  At the end of this message
> is a rule for whitespace within Perl code, and supporting rules for
> comments and pod.

Excellent! 

> I'm not posting this as a diff, because I have the faint suspicion
> that others might have been hacking on this file offline.  But I
> gather these rules should go in the "TOKENS" section.

Well, you might think that, but it wouldn't be true.  I haven't touched
it since I sent it out.  I *should* have, so your suspicion is
well-founded. :-)

> [By the way, shouldn't this grammar be called "Perl" rather than
> "Perl6::Grammar"?  Also, is this file now available in some repository
> somewhere?]

Grammars and classes share a namespace, so I think Perl::Grammar is
correct.  That's what it's currently called in the repository, which is
at:

    https://svn.perl.org/perl6/grammar/trunk

> I'd like reviewers to pay special attention to the pod stuff.  It's
> not clear to me what the precise rules are or should be for blank
> lines preceding pod commands.  I got from S02 the idea that we should
> allow standalone =begin/=end sections (and that they should nest).
> But does the =end line have to be preceded by a blank line?

Nope.

> As far as I can tell, the =begin line does not.  In the interest of
> symmetry, I have written the rules to not require a blank line before
> the closing =end either.  Even though this appears to violate the
> usual rules for pod.

Yeah, we're changing the blank line requirement, mostly to make
bulleted lists take up less vertical space.

> (Another guy called) Luke
> 
> 
> ====================================
> 
> # Whitespace definition for Perl code.
> rule ws() {
>       # Case 1: Unicode space characters, comments, or POD blocks, or
>       # any combination thereof.
>     [ \s | Âcomment | Âpod ]+

I changed your Âcomment and Âpod to <comment> and <pod>.  We don't
have a policy yet on what we're caputring and how, so I'm just leaving
all the angle brackets single.  Once we decide how our resultant data
structure should look, we can go back and change them.  

> 
>       # Case 2: We're looking at a non-word-constituent or EOF,
>       # meaning zero-width counts as whitespace.
>   | <before \W> | $
> 
>       # Case 3: We must be looking at a word constituent.  We match
>       # whitespace at BOF or after a non-word-constituent.
>   | ^ | <after \W>

I'm going to kill these last two cases.  The rules for where whitespace
is optional are more complex than whether you're on a word constituent
or not.  The user of the ws rule is going to know whether whitespace is
optional or required in a particular position, so he can put <ws> or
<ws>? as he needs to.  Also, if we're being good little boys, we'll be
putting backtracking colons after our identifier matches, so a <ws> rule
will never show up in the middle of an identifier.

> }
> 
> # Comment definition for Perl code.
> rule comment() {
>       # A hash ("#"), then everything through the next newline or EOF.
>     <'#'> .*? [ \n | $ ]
> }

I factored <'#'> out into <comment_introducer>.  We're putting all token
characters into their own rules so it's easy for extenders to change
them.

> # A POD block, as extended for P6.  This is a =begin/=end pair, a =for
> # paragraph, or a standard =<anything>/=cut block.
> rule pod() {
>       # Case 1: a =begin/=end block, in its own rule so it can
>       # recurse.
>     Âpod_begin_end_blockÂ
> 
>       # Case 2: a =for paragraph.  "=for" at BOL, plus any space
>       # character, starts it, and the first blank line (or EOF) ends
>       # it.
>   | ^^=for \s :: .*? [ \n \h* \n | $ ]
> 
>       # Case 3: any arbitrary POD block.  Starts with "=" at BOL,
>       # followed by a letter, ends with "=cut" at BOL or at EOF.
>   | ^^=<+<alpha>> :: .*? [ \n =cut [ \s | $ ] | $ ]
> }

Factored = out into <pod_introducer>

> # A (recursive) =begin/=end POD block.
> rule pod_begin_end_block() {
>       # Starts with "=begin" at BOL, followed by an optional name
>       # which we save to match with the corresponding "=end".
>     ^^=begin [ \h+ $<name> := (\S+) | \h* \n ]
> 
>       # Next comes any number of single characters or nested =begin/
>       # =end blocks -- but the smallest number that will match...
>     [ . | Âpod_begin_end_block ]*?

Reversed as you requested, and added an alternative between them that
speeds things up.  That's right, I'm preprematurely optimizing.

> 
>       # ...an "=end" at BOL followed by the name saved above, or
>       # followed by nothing if there wasn't one.  If we make it to EOF
>       # without finding the "=end" line, we blow up.
>     [
>       ^^=end [ <( $<name> )> :: \h+ $<name> | <null> ] \h* [ \n | $ ]
>     |
>       $ <commit> { fail "Unterminated =begin/=end block" }
>     ]
> }

Okay, it's in.  I can't say it's correct, since I've never been very
good at writing regexes, and this is certainly more like a regex than
like a grammar.  When I wasn't sure how something worked, I just assumed
you did it right.

However, we'd like to eventually make the POD rule less like a match and
more like a parse.  The POD sections are going to be stored as metadata
for the program to grab if it needs to.  Right now, it just pretends
it's all a comment.

Luke

Reply via email to