Author: larry
Date: Thu Sep  6 17:12:02 2007
New Revision: 14449

Modified:
   doc/trunk/design/syn/S05.pod

Log:
old <?foo> is now <+foo> to suppress capture
new <?foo> now is zero-width like <!foo>
clarifications on backtracking and longest-token semantics
minimal quantifiers are now considered to terminate a longest token


Modified: doc/trunk/design/syn/S05.pod
==============================================================================
--- doc/trunk/design/syn/S05.pod        (original)
+++ doc/trunk/design/syn/S05.pod        Thu Sep  6 17:12:02 2007
@@ -14,9 +14,9 @@
    Maintainer: Patrick Michaud <[EMAIL PROTECTED]> and
                Larry Wall <[EMAIL PROTECTED]>
    Date: 24 Jun 2002
-   Last Modified: 17 Aug 2007
+   Last Modified: 6 Sep 2007
    Number: 5
-   Version: 63
+   Version: 64
 
 This document summarizes Apocalypse 5, which is about the new regex
 syntax.  We now try to call them I<regex> rather than "regular
@@ -247,13 +247,13 @@
 
 The new C<:s> (C<:sigspace>) modifier causes whitespace sequences
 to be considered "significant"; they are replaced by a whitespace
-matching rule, C<< <?ws> >>.  That is,
+matching rule, C<< <+ws> >>.  That is,
 
      m:s/ next cmd =   <condition>/
 
 is the same as:
 
-     m/ <?ws> next <?ws> cmd <?ws> = <?ws> <condition>/
+     m/ <+ws> next <+ws> cmd <+ws> = <+ws> <condition>/
 
 which is effectively the same as:
 
@@ -265,18 +265,18 @@
 
 or equivalently,
 
-     m { (a|\*) <?ws> (b|\+) }
+     m { (a|\*) <+ws> (b|\+) }
 
-C<< <?ws> >> can't decide what to do until it sees the data.
-It still does the right thing.  If not, define your own C<< <?ws> >>
+C<< <+ws> >> can't decide what to do until it sees the data.
+It still does the right thing.  If not, define your own C<< ws >>
 and C<:sigspace> will use that.
 
 In general you don't need to use C<:sigspace> within grammars because
 the parser rules automatically handle whitespace policy for you.
 In this context, whitespace often includes comments, depending on
 how the grammar chooses to define its whitespace rule.  Although the
-default C<< <?ws> >> subrule recognizes no comment construct, any
-grammar is free to override the rule.  The C<< <?ws> >> rule is not
+default C<< <+ws> >> subrule recognizes no comment construct, any
+grammar is free to override the rule.  The C<< <+ws> >> rule is not
 intended to mean the same thing everywhere.
 
 It's also possible to pass an argument to C<:sigspace> specifying
@@ -285,7 +285,7 @@
 important to distinguish the significant whitespace in the pattern from
 the "whitespace" being matched, so we'll call the pattern's whitespace
 I<sigspace>, and generally reserve I<whitespace> to indicate whatever
-C<< <?ws> >> matches in the current grammar. The correspondence
+C<< <+ws> >> matches in the current grammar. The correspondence
 between sigspace and whitespace is primarily metaphorical, which is
 why the correspondence is both useful and (potentially) confusing.
 
@@ -336,16 +336,16 @@
 If followed by an C<x>, it means repetition.  Use C<:x(4)> for the
 general form.  So
 
-     s:4x [ (<?ident>) = (\N+) $$] [$0 => $1];
+     s:4x [ (<+ident>) = (\N+) $$] [$0 => $1];
 
 is the same as:
 
-     s:x(4) [ (<?ident>) = (\N+) $$] [$0 => $1];
+     s:x(4) [ (<+ident>) = (\N+) $$] [$0 => $1];
 
 which is almost the same as:
 
      $_.pos = 0;
-     s:c[ (<?ident>) = (\N+) $$] = "$0 => $1" for 1..4;
+     s:c[ (<+ident>) = (\N+) $$] = "$0 => $1" for 1..4;
 
 except that the string is unchanged unless all four matches are found.
 However, ranges are allowed, so you can say C<:x(1..4)> to change anywhere
@@ -450,7 +450,7 @@
 and these are equivalent to
 
     $string ~~ m/^ \d+: $/;
-    $string ~~ m/^ <?ws> \d+: <?ws> $/;
+    $string ~~ m/^ <+ws> \d+: <+ws> $/;
 
 =item *
 
@@ -584,7 +584,8 @@
 erroneous to assume either order happens consistently.  The C<&&>
 form guarantees left-to-right order, and backtracking makes the right
 argument vary faster than the left.  In other words, C<&&> and C<||> establish
-sequence points.
+sequence points.  The left side may be backtracked into when backtracking
+is allowed into the construct as a whole.
 
 The C<&> operator is list associative like C<|>, but has slightly
 tighter precedence.  Likewise C<&&> has slightly tighter precedence
@@ -1008,10 +1009,14 @@
 To pass a string with leading whitespace, or to interpolate any values
 into the string, you must use the parenthesized form.
 
-If the first character is a plus or minus, the initial identifier
-is taken as a character class, so the first character after the
-identifier doesn't matter in this case, and you can use whitespace
-however you like.  Therefore
+If the first character is a plus or minus, the rest of the assertion
+is parsed as a set of character classes (though the definition of
+character class is intentionally vague, and may include any other rule
+whether it matches characters exclusively or not).
+
+An initial identifier is taken as a character class, so the first
+character after the identifier doesn't matter in this case, and you
+can use whitespace however you like.  Therefore
 
     <foo+bar-baz>
 
@@ -1054,15 +1059,21 @@
 
 =item *
 
-A leading C<?> causes the assertion not to capture what it matches (see
+A leading C<+> causes a named assertion not to capture what it matches (see
 L<Subrule captures>. For example:
 
      / <ident>  <ws>  /      # $/<ident> and $/<ws> both captured
-     / <?ident> <ws>  /      # only $/<ws> captured
-     / <?ident> <?ws> /      # nothing captured
+     / <+ident> <ws>  /      # only $/<ws> captured
+     / <+ident> <+ws> /      # nothing captured
 
 The non-capturing behavior may be overridden with a C<:keepall>.
 
+The rest of the assertion is reparsed as if the C<+> (and any following
+whitespace) weren't there, so it is legal (but redundant) to say:
+
+    <+++ws>
+    <+ + +ws>
+
 =item *
 
 A leading C<$> indicates an indirect subrule.  The variable must contain
@@ -1070,7 +1081,7 @@
 string is never matched literally.
 
 By default C<< <$foo> >> is captured into C<< $<foo> >>, but you can
-use the C<< <?$foo> >> form to suppress capture, and you can always say
+use the C<< <+$foo> >> form to suppress capture, and you can always say
 C<< $<$foo> := <$foo> >> if you prefer to include the sigil in the key.
 
 A subrule is considered declarative to the extent that the front of it
@@ -1098,7 +1109,7 @@
 matched literally.  (There is no difference for a C<Regex> object.)
 
 By default C<< <@foo> >> is captured into C<< $<foo> >>, but you can
-use the C<< <[EMAIL PROTECTED]> >> form to suppress capture, and you can 
always say
+use the C<< <[EMAIL PROTECTED]> >> form to suppress capture, and you can 
always say
 C<< $<@foo> := <@foo> >> if you prefer to include the sigil in the key.
 
 =item *
@@ -1109,7 +1120,7 @@
 and a closure may do whatever it likes.)
 
 By default C<< <%foo> >> is captured into C<< $<foo> >>, but you can
-use the C<< <?%foo> >> form to suppress capture, and you can always say
+use the C<< <+%foo> >> form to suppress capture, and you can always say
 C<< $<%foo> := <%foo> >> if you prefer to include the sigil in the key.
 
 As with bare hash, the longest key matches according to the venerable
@@ -1120,7 +1131,7 @@
 A leading C<{> indicates code that produces a regex to be interpolated
 into the pattern at that point as a subrule:
 
-     / (<?ident>)  <{ %cache{$0} //= get_body_for($0) }> /
+     / (<+ident>)  <{ %cache{$0} //= get_body_for($0) }> /
 
 The closure is guaranteed to be run at the canonical time; it declares
 a sequence point, and is considered to be procedural.
@@ -1185,16 +1196,15 @@
 
 =item *
 
-A leading C<[> or C<+> indicates an enumerated character class.  Ranges
+A leading C<[> indicates an enumerated character class.  Ranges
 in enumerated character classes are indicated with "C<..>" rather than "C<->".
 
      / <[a..z_]>* /
-     / <+[a..z_]>* /
-     / <+[ a..z _ ]>* /
-     / <+ [ a .. z _ ] >* /
 
 Whitespace is ignored within square brackets and after the initial C<+>.
 
+     / <[ a..z _ ]>* /
+
 =item *
 
 A leading C<-> indicates a complemented character class:
@@ -1209,6 +1219,15 @@
 
 =item *
 
+A leading C<+> may also be supplied to indicate that the following
+character class is to matched in a positive sense
+
+     / <+[a..z_]>* /
+     / <+[ a..z _ ]>* /
+     / <+[ a .. z _ ] >* /
+
+=item *
+
 Character classes can be combined (additively or subtractively) within
 a single set of angle brackets.  Whitespace is ignored. For example:
 
@@ -1220,7 +1239,7 @@
 
      / <+alpha-[Jj]> /              # J-less alpha
      / <alpha-[Jj]> /               # same thing
-     / <+ alpha - [ Jj ]> /         # still the same thing
+     / <+alpha - [ Jj ]> /          # still the same thing
 
 However, whitespace is not allowed after the first identifier if it
 immediately follows the left angle.
@@ -1254,6 +1273,17 @@
 
 =item *
 
+A leading C<?> indicates a positive zero-width assertion, and like C<!>
+merely reparses the rest of the assertion recursively as if the C<?>
+were not there.  In addition to forcing zero-width, it also suppresses
+any named capture:
+
+    <alpha>     # match a letter and capture in $<alpha>
+    <+alpha>    # match a letter, don't capture
+    <?alpha>    # much null before a letter, don't capture
+
+=item *
+
 A leading C<~~> indicates a recursive call back into some or all of
 the current rule.  An optional argument indicates which subpattern
 to re-use, and if provided must resolve to a single subpattern.
@@ -1326,7 +1356,7 @@
 
 A C<«> or C<<< << >>> token indicates a left word boundary.  A C<»> or
 C<<< >> >>> token indicates a right word boundary.  (As separate tokens,
-these need not be balanced.)  Perl 5's C<\b> is replaced by a C<< <?wb> >>
+these need not be balanced.)  Perl 5's C<\b> is replaced by a C<< <+wb> >>
 "word boundary" assertion, while C<\B> becomes C<< <!wb> >>.  (None of
 these are dependent on the definition of C<< <ws> >>, but only on the C<\w>
 definition of "word" characters.)
@@ -1672,6 +1702,11 @@
 (i.e. using a reserved word as a subroutine name is instantly fatal
 to the I<surrounding> match as well)
 
+If commit is given an argument, it's the name of a calling rule that
+should be committed:
+
+    <commit('infix')>
+
 =item *
 
 A C<< <cut> >> assertion always matches successfully, and has the
@@ -1749,7 +1784,7 @@
 
 For example:
 
-     split /<?null>/, $string
+     split /<+null>/, $string
 
 splits between characters.
 
@@ -1757,7 +1792,7 @@
 
 To match a null alternative, use:
 
-     /a|b|c|<?null>/
+     /a|b|c|<+null>/
 
 This makes it easier to catch errors like this:
 
@@ -1766,20 +1801,20 @@
 As a special case, however, the first null alternative in a match like
 
      mm/ [
-          | if :: <expr> <block>
-          | for :: <list> <block>
-          | loop :: <loop_controls>? <block>
-          ]
+         | if :: <expr> <block>
+         | for :: <list> <block>
+         | loop :: <loop_controls>? <block>
+         ]
      /
 
 is simply ignored.  Only the first alternative is special that way.
 If you write:
 
      mm/ [
-              if :: <expr> <block>              |
-              for :: <list> <block>             |
-              loop :: <loop_controls>? <block>  |
-          ]
+             if :: <expr> <block>              |
+             for :: <list> <block>             |
+             loop :: <loop_controls>? <block>  |
+         ]
      /
 
 
@@ -1793,6 +1828,8 @@
      $something = "";
      /a|b|c|$something/;
 
+In particular, <?> also matches the null string, and <!> always fails.
+
 =back
 
 =head1 Longest-token matching
@@ -1816,7 +1853,8 @@
 if at least the token processing proceeds deterministically.  So for
 regex matching purposes we define token patterns as those patterns
 containing no whitespace that can be matched without side effects
-or self-reference.
+or self-reference.  Basically, Perl automatically derives a lexer
+from the grammar without you having to write one yourself.
 
 To that end, every regex in Perl 6 is required to be able to
 distinguish its "pure" patterns from its actions, and return its
@@ -1849,8 +1887,13 @@
 
 =item *
 
-Any {...} action, but not an assertion containing a closure, nor a
-C<**{...}> quantifier if the closure returns an immutable selector.
+Any atom that is quantified with a minimally match (using the C<?> modifier).
+
+=item *
+
+Any C<{...}> action, but not an assertion containing a closure.
+The closure form of the general C<**{...}> quantifier terminates the
+longest token, but not the closureless forms.
 
 =item *
 
@@ -1872,7 +1915,7 @@
 are simulated in any of various ways, such as by Thompson NFA, it may
 be possible to know when to fire off the assertions without backchecks.)
 
-Ordinary quantifiers and characters classes do not terminate a token pattern.
+Greedy quantifiers and characters classes do not terminate a token pattern.
 Zero-width assertions such as word boundaries are also okay.
 
 Oddly enough, the C<token> keyword specifically does not determine
@@ -1948,9 +1991,9 @@
 In string context it evaluates to the stringified value of its
 I<result object>, which is usually the entire matched string:
 
-     print %hash{ "{$text ~~ /<?ident>/}" };
+     print %hash{ "{$text ~~ /<+ident>/}" };
      # or equivalently:
-     $text ~~ /<?ident>/  &&  print %hash{~$/};
+     $text ~~ /<+ident>/  &&  print %hash{~$/};
 
 But generally you should say C<~$/> if you mean C<~$/>.
 
@@ -3142,10 +3185,10 @@
 the angles is used as part of the key.  Suppose the earlier example
 parsed whitespace:
 
-     / <key> <?ws> '=>' <?ws> <value> { %hash{$<key>} = $<value> } /
+     / <key> <+ws> '=>' <+ws> <value> { %hash{$<key>} = $<value> } /
 
-The two instances of C<< <?ws> >> above would store an array of two
-values accessible as C<< @<?ws> >>.  It would also store the literal
+The two instances of C<< <+ws> >> above would store an array of two
+values accessible as C<< @<+ws> >>.  It would also store the literal
 match into C<< $<'=\>'> >>.  Just to make sure nothing is forgotten,
 under C<:keepall> any text or whitespace not otherwise remembered is
 attached as an extra property on the subsequent node. (The name of

Reply via email to