Luke Blanshard writes: > Luke Palmer wrote: > >This list is for people interested in building the Perl 6 compiler. Now > >you have your first real task! > > > >We have to make a formal grammar for Perl 6. Perl 6 is a huge language, > >so the task seems better done incrementally by the community... > > > >Send patches to this list. > > OK, I'll bite. In contrast to Luke's 50-thousand-foot level, I'm > diving down into the goriest of details. At the end of this message > is a rule for whitespace within Perl code, and supporting rules for > comments and pod.
Excellent! > I'm not posting this as a diff, because I have the faint suspicion > that others might have been hacking on this file offline. But I > gather these rules should go in the "TOKENS" section. Well, you might think that, but it wouldn't be true. I haven't touched it since I sent it out. I *should* have, so your suspicion is well-founded. :-) > [By the way, shouldn't this grammar be called "Perl" rather than > "Perl6::Grammar"? Also, is this file now available in some repository > somewhere?] Grammars and classes share a namespace, so I think Perl::Grammar is correct. That's what it's currently called in the repository, which is at: https://svn.perl.org/perl6/grammar/trunk > I'd like reviewers to pay special attention to the pod stuff. It's > not clear to me what the precise rules are or should be for blank > lines preceding pod commands. I got from S02 the idea that we should > allow standalone =begin/=end sections (and that they should nest). > But does the =end line have to be preceded by a blank line? Nope. > As far as I can tell, the =begin line does not. In the interest of > symmetry, I have written the rules to not require a blank line before > the closing =end either. Even though this appears to violate the > usual rules for pod. Yeah, we're changing the blank line requirement, mostly to make bulleted lists take up less vertical space. > (Another guy called) Luke > > > ==================================== > > # Whitespace definition for Perl code. > rule ws() { > # Case 1: Unicode space characters, comments, or POD blocks, or > # any combination thereof. > [ \s | Âcomment | Âpod ]+ I changed your Âcomment and Âpod to <comment> and <pod>. We don't have a policy yet on what we're caputring and how, so I'm just leaving all the angle brackets single. Once we decide how our resultant data structure should look, we can go back and change them. > > # Case 2: We're looking at a non-word-constituent or EOF, > # meaning zero-width counts as whitespace. > | <before \W> | $ > > # Case 3: We must be looking at a word constituent. We match > # whitespace at BOF or after a non-word-constituent. > | ^ | <after \W> I'm going to kill these last two cases. The rules for where whitespace is optional are more complex than whether you're on a word constituent or not. The user of the ws rule is going to know whether whitespace is optional or required in a particular position, so he can put <ws> or <ws>? as he needs to. Also, if we're being good little boys, we'll be putting backtracking colons after our identifier matches, so a <ws> rule will never show up in the middle of an identifier. > } > > # Comment definition for Perl code. > rule comment() { > # A hash ("#"), then everything through the next newline or EOF. > <'#'> .*? [ \n | $ ] > } I factored <'#'> out into <comment_introducer>. We're putting all token characters into their own rules so it's easy for extenders to change them. > # A POD block, as extended for P6. This is a =begin/=end pair, a =for > # paragraph, or a standard =<anything>/=cut block. > rule pod() { > # Case 1: a =begin/=end block, in its own rule so it can > # recurse. > Âpod_begin_end_block > > # Case 2: a =for paragraph. "=for" at BOL, plus any space > # character, starts it, and the first blank line (or EOF) ends > # it. > | ^^=for \s :: .*? [ \n \h* \n | $ ] > > # Case 3: any arbitrary POD block. Starts with "=" at BOL, > # followed by a letter, ends with "=cut" at BOL or at EOF. > | ^^=<+<alpha>> :: .*? [ \n =cut [ \s | $ ] | $ ] > } Factored = out into <pod_introducer> > # A (recursive) =begin/=end POD block. > rule pod_begin_end_block() { > # Starts with "=begin" at BOL, followed by an optional name > # which we save to match with the corresponding "=end". > ^^=begin [ \h+ $<name> := (\S+) | \h* \n ] > > # Next comes any number of single characters or nested =begin/ > # =end blocks -- but the smallest number that will match... > [ . | Âpod_begin_end_block ]*? Reversed as you requested, and added an alternative between them that speeds things up. That's right, I'm preprematurely optimizing. > > # ...an "=end" at BOL followed by the name saved above, or > # followed by nothing if there wasn't one. If we make it to EOF > # without finding the "=end" line, we blow up. > [ > ^^=end [ <( $<name> )> :: \h+ $<name> | <null> ] \h* [ \n | $ ] > | > $ <commit> { fail "Unterminated =begin/=end block" } > ] > } Okay, it's in. I can't say it's correct, since I've never been very good at writing regexes, and this is certainly more like a regex than like a grammar. When I wasn't sure how something worked, I just assumed you did it right. However, we'd like to eventually make the POD rule less like a match and more like a parse. The POD sections are going to be stored as metadata for the program to grab if it needs to. Right now, it just pretends it's all a comment. Luke