Autrijus wrote:
/me eagerly awaits new revelation from Damian...
Be careful what you wish for. Here's draft zero. ;-)
Note that there may still be bugs in the examples, or even in the design.
@Larry has thrashed this through pretty carefully, and Patrick has implemented it for PGE, but it's 10.30 at night after a full day's teaching, so I may have transcribed the post-thrashing, post-implementation corrections incorrectly. %-)
Damian
-----cut----------cut----------cut----------cut----------cut----------cut-----
=head1 Perl 6 rules capturing semantics
=head2 Match objects
All match attempts--successful or not--against any rule, subrule, or subpattern (see below) return an object of (or derived from) class C<Match>. That is:
$match_obj = $str ~~ /pattern/; say "Matched" if $match_obj;
In any code that is not nested inside a rule, this returned object is also automagically assigned to the lexical C<$/> variable. That is:
$str ~~ /pattern/; say "Matched" if $/;
In any code that is nested inside a rule, the C<$/> variable holds the surrounding rule's nascent C<Match> object (which can be modified via the internal C<$/>. For example:
$str ~~ / foo # Match 'foo' { $/ = new Match: :str<bar> } # But pretend we matched 'bar' /;
C<Match> objects have methods that provide addition information about the match. For example:
if m/ def <ident> <codeblock> / { say "Found sub def between index $/.from() and index { $/.to()-1 }"; }
A C<Match> object can also be treated as a boolean, an integer, a string, an array, or a hash. See below.
=head2 Match results
A failed match returns a C<Match> object whose boolean value is false, whose integer value is zero, whose string value is C<"">, and whose array and hash components are empty. For example:
"bard" ~~ /food/; say "Poet inedible" unless $/;
A successful match returns a C<Match> object whose boolean value is true, whose integer value is typically 1 (except under the C<:g> or C<:x> flags; see L<Capturing from non-singular matches>), whose string value is the complete substring that was matched by the entire rule, whose array component contains all subpattern (unnamed) captures, and whose hash component contains all subrule (named) captures. For example:
if ($/) { $count += $/; say "Matched the substring: $/"; say "Parens captured: @{$/}"; say 'Subrules captured:'; for %{$/}.kv -> $subrule_name, $substr { say "\t$subrule_name: $substr"; } }
=head2 Subpattern captures
Any part of a rule enclosed in capturing parentheses is called a I<subpattern>. For example:
# subpattern # _________________/\____________________ # | | # | subpattern subpattern | # | __/\__ __/\__ | # | | | | | | m:w/ (I am the (walrus), ( khoo )**{2} kachoo) /;
Each subpattern in a rule produces a C<Match> object if it is successfully matched. This object is assigned into the array inside the C<Match> object belonging to the surrounding scope -- either the C<Match> object of the innermost surrounding subpattern (if the subpattern is nested) or else the C<Match> object of the rule itself. These assignments to the array are, of course, undone if the subpattern is backtracked out of.
For example, if the following pattern matched successfully:
# subpat-A # _________________/\____________________ # | | # | subpat-B subpat-C | # | __/\__ __/\__ | # | | | | | | m:w/ (I am the (walrus), ( khoo )**{2} kachoo) /;
then the C<Match> objects representing the matches made by subpat-B and subpat-C would be successively assigned into the array inside subpat-A's C<Match> object. Then subpat-A's C<Match> object would be assigned into the array inside the C<Match> object for the entire rule (i.e. C<$/>'s array).
The array elements of a C<Match> object are referred to using either the standard array access notation (e.g. C<$/[0]>, C<$/[1]>, C<$/[2]>, etc.) or else via the corresponding lexically scoped numeric aliases (i.e. C<$1>, C<$2>, C<$3>, etc.)
So:
say "$/[1] found between $/[0] and $/[2]";
is the same as:
say "$2 found between $1 and $3";
Note that the standard array access notation uses zero-based indices (0,1,2...), whereas the corresponding numeric variables are numbered by ordinal position (1,2,3...)
Since the array elements of the rule's C<Match> object (i.e. C<$/>) store individual C<Match> objects representing the substrings that where matched and captured by the first, second, third, etc. I<outermost> (i.e. unnested) subpatterns, these elements can be treated like fully fledged match results. For example:
if m/ (\d\d\d\d)-(\d\d)-(\d\d) (BCE?|AD|CE)?/ { ($yr, $mon, $day) = ($1, $2, $3); # Or: ($yr, $mon, $day) = $/[0..2] $era = $4 if $4; # Tests if 4th parens matched @datepos = ($1.from() .. $3.to()-1); # $1, $2, etc. are full Match objs }
=head2 Nested subpattern captures
Nested subpatterns (i.e. nested capturing parens) are I<not> captured directly into the array of the rule's C<Match> object. Instead, the captures made by nested subpatterns appear in the array inside the C<Match> object belonging to the surrounding subpattern. This is quite different to Perl 5 semantics:
# Perl 5... # # $1----------------------------- $5--------- $6-------------------- # | $2-- $3--------------- | | | | $7-- $8------ | # | | | | $4-- | | | | | | | | | | # | | | | | | | | | | | | | | | | m/ ( The (\S+) (guy|gal|g(\S+) ) ) (sees|calls) ( the (\S+) (gal|guy) ) /;
In Perl 6, nested parens produce properly nested captures:
# Perl 6... # # $1----------------------------- $2--------- $3-------------------- # | $1[0] $1[1]------------ | | | | $3[0] $3[1]--- | # | | | | $1[1][0] | | | | | | | | | | # | | | | | | | | | | | | | | | | m/ ( The (\S+) (guy|gal|g(\S+) ) ) (sees|calls) ( the (\S+) (gal|guy) ) /;
This means that the internal structure of the arrays in a rule's final C<Match> object mirrors (and preserves!) both the nesting structure of subpatterns in the rule, and the dynamic structure of the hierarchical way in which those subpatterns matched. This "reconstructability" can be taken even further (see L<The C<:parsetree> flag> below).
There may also be shortcuts for accessing nested components of a subpattern, specifically:
# Perl 6... # # $1----------------------------- $2--------- $3-------------------- # | $1.1 $1.2------------- | | | | $3.1 $3.2---- | # | | | | $1.2.1 | | | | | | | | | | # | | | | | | | | | | | | | | | | m/ ( The (\S+) (guy|gal|g(\S+) ) ) (sees|calls) ( the (\S+) (gal|guy) ) /;
but this has not yet been decided.
=head2 Quantified subpattern captures
If a subpattern is directly quantified using any quantifier -- except C<?>, or C<??> -- it no longer produces a single C<Match> object. Instead, it produces an array of C<Match> objects, which will have been collected from the sequence of individual matches made by the repeated subpattern.
Because a quantified subpattern returns an array of C<Match> objects, the corresponding array element for the quantified capture will store an array reference, rather than a single C<Match> object. For example:
# $1 $2 if m/ (\w+) \: (\w+ \s+)* / { say "Key was: $1"; # Unquantified subpat produces single Match say "Values were: @{$2}"; # Quantified subpat produces array of Matches }
Note that whether a quantified subpattern returns a single C<Match> object, or an array of C<Match> objects is determined statically (by the nature of the quantifier), not dynamically (by the actual number of repetitions that occur in the match).
If a subpattern is directly quantified using the C<?> or C<??> quantifier, it produces a single C<Match> object. That object is "successful" if the subpattern did match, and "unsuccessful" if it was skipped. That is:
if m/ next (\w+)? if (.*) / { say "Found a 'next'"; say "(targeted at $1)" if $1; say "Condition was: $2"; }
Note that if a capture is quantified as optional in this way, a C<Match> object is I<always> generated and assigned into the array inside the surrounding scope's C<Match> object. This ensures that the index/ordinal of subsequent subpatterns can still be determined statically.
=head2 Indirectly quantified subpattern captures
A subpattern may sometimes be nested inside a quantified non-capturing structure:
# non-capturing quantified # __________/\_________ __/\__ # | || | # | $1 $2 || | # | _^_ ___^___ || | # | | | | | || | m/ [ (\w+) \: (\w+ \s+)* ]**{2...} /
Non-capturing brackets I<don't> create a separate nested lexical scope, so the two subpatterns inside them are actually still in the rule's top- level scope. Hence their top-level designations: C<$1> and C<$2>. Such subpatterns are called "indirectly quantified" subpatterns. In Perl 5, any repeated captures of this kind:
# Perl 5 equivalent... m/ (?: (\w+) \: (\w+ \s+)* ){2,} /x
would overwrite the previous captures to C<$1> and C<$2> each time the surrrounding non-capturing parens iterated. So C<$1> and C<$2> would contain only the captures from the final repetition.
This does not happen in Perl 6. Any indirectly quantified subpattern is treated like a directly quantified subpattern. Specifically, an indirectly quantified subpattern also returns an array of C<Match> objects, so the corresponding array element for the indirectly quantified capture will store an array reference, rather than a single C<Match> object.
if m/ [ (\w+) \: (\w+ \s+)* ]**{2...} / { say "Keys were: @{$1}"; say "Values were: @{$2}"; }
Remember though that, if the outer quantified structure is a I<capturing> structure (i.e. a subpattern) then it I<will> introduce a nested lexical scope. That outer quantified structure will then return an array of C<Match> objects representing the captures of the inner parens for I<every> iteration (as described above).
Whereas using non-capturing parentheses for the outer quantifier causes all of the inner subpatterns to flatten their captures into C<$1> and C<$2>, using capturing parentheses for the outer quantifier retains the internal match structure of each repetition. That is:
# $/[0] # __________/\_________ # | | # | $/[0][0] $/[0][1] | # | _^_ ___^___ | # | | | | | | if m/ ( (\w+) \: (\w+ \s+)* )**{2...} / {
# Outer subpattern ($/[0]) quantified, so $1 contains an array. # Let's iterate it... for @{$1}.kv => $i, $inner_subpatterns {
# First inner subpattern ($/[0][0]) is unquantified, so it # produces a single Match... say "Key $i was: $inner_subpatterns[0]";
# Second inner subpattern ($/[0][1]) is quantified, so it # produces an array of Matches... say "Values $i were: @{$inner_subpatterns[1]}"; } }
=head2 Subpattern numbering
As the previous sections explained, the index/ordinal of a given subpattern can always be statically determined. However, this does not mean that they have to be monotonically increasing. Indeed, the hierarchical nature of nested Perl 6 subpatterns already ensures that this is not the case.
But even when there is no nesting of subpatterns it can be much more useful not to number all top-level subpattern sequentially, as Perl 5 does:
# Perl 5... # $1 $2 $3 $4 $5 $6 $tune_up5 = qr/ (don't) (ray) (me) (for) (solar tea), (d'oh!) # $7 $8 $9 $10 $11 | (every) (green) (BEM) (devours) (faces) /x;
Specifically, there are significant advantages to numbering the subpatterns in each branch of an alternation (i.e. oneither side of a C<|>) independently, restarting the numbering at the beginning of each branch. And this is precisely what Perl 6 does:
# Perl 6... # $1 $2 $3 $4 $5 $6 $tune_up6 = rx/ (don't) (ray) (me) (for) (solar tea), (d'oh!) # $1 $2 $3 $4 $5 | (every) (green) (BEM) (devours) (faces) /;
In other words, unlike in Perl 5, in Perl 6 $1 doesn't represent the capture made by the first subpattern that appears in the rule; it represents the capture made by the first subpattern of whichever alternative actually matched.
And that is extremely useful because it means that the array inside <$/> will not contain large numbers of leading C<undef> values corresponding to unmatched subpatterns from failed alternatives:
# Perl 5... @captures = $EGBDF =~ $tune_up5;
# @captures is assigned: ( (undef)x6, qw(every green BEM devours faces) )
Instead, only the "meaningful" subpattern captures are returned:
# Perl 6... @captures = $EGBDF ~~ $tune_up6;
# @captures is assigned: <every green BEM devours faces> # (no leading undefs)
A more common example is likely to be a series of alternative commands:
$cmd ~~ m:w/ (put) (\S+) in (\S+) | (get) (\S+) from (\S+) | (save) (\S+) to (\S+) / or next;
($cmd, $item, $location) = ($1, $2, $3);
Of course, the leading C<undef>s that Perl 5 would produce do convey (albeit awkwardly) which alternative actually matched. If that information is important, Perl 6 has several far cleaner ways to preserve it. For example:
rule alt (Str $n) { {$/ = $n} }
m/ <alt tea> (don't) (ray) (me) (for) (solar tea), (d'oh!) | <alt BEM> (every) (green) (BEM) (devours) (faces) /;
if ($/) { given $<alt> { when 'tea' { say "I hate solar tea" } when 'BEM' { say "I love bug-eyed monsters" } } }
It's even possible to mimic the monotonic Perl 5 semantics. See L<Numbered scalar aliasing> below for details.
=head2 Subrule captures
Any call to a named rule within a pattern is known as a I<subrule>.
Any bracketed construct that is aliased (see L<Aliasing>) to a named variable is also a subrule.
For example, this rule contains three subrules:
# subrule subrule subrule # __^__ _______^______ __^__ # | | | | | | m/ <ident> $<spaces>:=(\s*) <digit>+ /
Just like subpatterns, each successfully matched subrule within a rule produces a C<Match> object. But, unlike subpatterns, that C<Match> object is assigned to an entry of a hash. Specifically, to an entry of the hash inside the C<Match> object corresponding to the innermost surrounding rule or subpattern. For example:
# .... $/ ...................................... # : : # : .......... $/[0] ............ : # : : : : # : $/<ident> : $/[0]<ident> : : # : __^__ : __^__ : : # : | | : | | : : m:w/ <ident> \: ( known as <ident> previously )? /
The hash entries of a C<Match> object are referred to using any of the standard hash access notations (C<$/{'foo'}>, C<< $/<bar> >>, C<$/«baz»>, etc.), or else via corresponding lexically scoped aliases (C<< $<foo> >>, C<$«bar»>, C<< $<baz> >>, etc.) So the previous example also implies:
# $<ident> $1<ident> # __^__ __^__ # | | | | m:w/ <ident> \: ( known as <ident> previously )? /
In other words, the hash elements of a rule's C<Match> object store nested C<Match> objects, each of which represents a substring matched-and-captured by a named subrule call (or by a capture that was aliased to a name using the C<< $<name>:= >> syntax). For example:
if m/ (<YYYY>)-(<MM>)-(<DD>) $<ERA>:=(BCE?|AD|CE)?/ { ($year, $month, $day) = ($<YYYY>, $<MM>, $<DD>); $era = $<ERA> if $<ERA>; @indices = ($<YYYY>.from() .. $<DD>.to()-1); }
Note that it makes no difference whether the subrule is angle-bracketted (like C<< <YYYY> >> or aliased (like C<< $<ERA>:= >>. The name's the thing.
=head2 Repeated captures of the same subrule
If a subrule appears two (or more) times in the same lexical scope within a rule (i.e. within the same subpattern and alternation), or if the subrule is quantified anywhere within the rule (except with C<?> or C<??>), then its corresponding hash entry no longer stores a C<Match> object.
Instead, just like a quantified subpattern, a directly quantified, indirectly quantified, or explicitly repeated subrule results in an array of C<Match> objects. Successive matches of the subrule (whether from separate calls, or from a quantified repetition) append their individual C<Match> objects to this array. For example, with two or more subrules of the same name, the corresponding hash entry contains an reference to an array, which in turn contains the individual C<Match> objects from each subrule match:
if m:w/ mv <file> <file> / { $from = $<file>[0]; $to = $<file>[1]; }
Likewise, with an indirectly quantified subrule:
if m:w/ mv [ <file> ]**{2} / { $from = $<file>[0]; $to = $<file>[1]; }
Likewise, with both repetition and quantification:
if m:w/ mv [ <file> ]+ <file> / { $to = pop @{$<file>}; @from = @{$<file>}; }
Note that it is always possible to determine statically whether a particular hash entry in C<$/> will be a scalar, or an array reference, simply by counting the number of occurrences of the subrule in each lexical scope.
However, if a subrule is explicitly renamed (or aliased -- see L<Aliasing>), then only the "final" name counts when deciding whether it is or isn't repeated. For example:
rule dir := rule file;
if m:w/ mv <file> <dir> / { # Only one occurrence of <file>, so scalar $from = $<file>; $to = $<dir>; }
Likewise, I<none> of the following constructions cause C<< <file> >> to produce an array of C<Match> objects, since in none of them are there two or more C<< <file> >> subrules in the same lexical scope:
if m:w/ (keep) <file> | (toss) <file> / { # Each <file> is in a separate # alternation, hence not # repeated in any one scope $action = $1; $target = $<file>; }
if m:w/ <file> \: (<file>|none)? / { # Second <file> nested in subpattern # which confers different scope $actual = $/<file>; $virtual = $/[0]<file> if $/[0]<file>; }
On the other hand, unaliased square brackets don't confer a separate scope (because they don't have an associated C<Match> object). So:
if m:w/ <file> \: [<file>|none]? / { # Second <file> in same scope $actual = $/<file>[0]; $virtual = $/<file>[1] if $/<file>[1]; }
=head2 Aliasing
Aliases can be named or numbered; may be scalar-, array-, or hash-like; and may be applied to either capturing or non-capturing constructs. The following sections explain the semantics of each of those dozen combinations.
=head3 Named scalar aliases applied to non-capturing brackets
If an named scalar alias is applied to a set of non-capturing brackets:
# ___/non-capturing brackets\__ # | | # | | m:w/ $<key>:=[ (<[A-E]>) (\d**{3..6}) (X?) ] /;
then the corresponding entry in the rule's hash is assigned a C<Match> object whose:
=over
=item *
Boolean value is true,
=item *
Integer value is 1,
=item *
String value is the complete substring matched by the contents of the square brackets,
=item *
Array and hash are both empty.
=back
This last outcome (the empty hash and array) might be surprising, but it's a natural consequence of the fact that square brackets do not create a nested lexical scope, so any subpattern or subrule captures within the square brackets are in the rule's lexical scope, not in that of the alias. Consequently, any subpatterns or subrules in the square brackets still I<do> set the appropriate hash or array entries, but they set the appropriate hash or array entries of the rule's C<Match> object, not the C<Match> object of the alias.
That means, if the above example matches successfully:
=over
=item *
C<< $/<key> >> will contain the complete substring matched by the square brackets (in a C<Match> object, as described above),
=item *
C<< $/[0] >> will contain the A-E letter,
=item *
C<< $/[1] >> will contain the digits,
=item *
C<< $/[2] >> will contain the optional X.
=back
=head3 Named scalar aliasing to subpatterns
On the other hand, if an named scalar alias is applied to a set of I<capturing> parens:
# ______/capturing parens\_____ # | | # | | m:w/ $<key>:=( (<[A-E]>) (\d**{3..6}) (X?) ) /;
then the capturing parens no longer capture into the array of the rule's C<Match> object (like unadorned parens would). Instead the aliased parens capture into the hash of the C<Match> object; specifically into the hash element whose key is the alias name.
So, in the above example, a successful match sets C<< $<key> >> (i.e. C<< $/<key> >>), but I<not> C<$1> (i.e. not C<< $/[0] >>).
Another way to think about it is that aliased parens create a kind of lexically scoped named subrule; that the contents of the brackets are treated as if they were part of a separate subrule whose name is the alias. That is, the above example is exactly equivalent to:
rule key { (<[A-E]>) (\d**{3..6}) (X?) } m:w/ <key> /;
Specifically, after either version matches:
=over
=item *
C<< $/<key>[0] >> will contain the A-E letter (in a C<Match> object, of course),
=item *
C<< $/<key>[1] >> will contain the digits,
=item *
C<< $/<key>[2] >> will contain the optional X.
=back
Note that only aliased parens have this "on-the-fly-subrule" effect. Aliased square brackets (as explained in L<Named scalar aliases applied to non-capturing brackets>) only capture the substring the square brackets matched; any internal captures proceed exactly as they would if the alias were not there.
This can provide a handy optimization when calling a subrule. If only the complete substring to be matched is of interest, rather than the full hierarchical capture information, then a pattern like:
m/ <XML_file> /
(which presumably does a large amount of hierarchical capturing and returns a very complex set of nested C<Match> objects), could be rewritten:
m/ $<XML_str>:=[«XML_file»] /
instead. Here the C<< <XML_file> >> subrule is called using double brackets instead, which calls it as a non-capturing subrule. It still matches the same substring, of course, which is then captured by the C<< $<XML_str>:= >> alias.
Note too that, because a subrule call like C<«XML_file»> is a bracketed non-capturing construct, it obeys the rules for C<[...]> (as described in L<Named scalar aliases applied to non-capturing brackets>), so the above optimization could just be written:
m/ $<XML_str>:=«XML_file» /
=head3 Named scalar aliasing to subrules
An unaliased capturing subrule assigns its C<Match> object to the hash entry whose key is the name of the subrule:
if m:/ ID\: <ident> / { say "Identified as $/<ident>"; }
But if a subrule is aliased, it assigns its C<Match> object to the hash entry whose key is the name of the alias instead. And, more importantly, it I<doesn't> assign anything to the hash entry whose key is the subrule name. That is:
if m:/ ID\: $<id>:=<ident> / { say "Identified as $/<id>"; # and $/<ident> is undefined }
Hence aliasing a subrule I<changes> the destination of the subrule's C<Match> object. This is particulatly useful for differentiating two or more calls to the same subrule in the same scope. For example:
if m:w/ mv <file> $<dir>:=<file> / { $from = $<file>; $to = $<dir>; }
In this example, the final match of the C<< <file> >> subrule is not appended onto an array in C<< $/<file> >>, but is assigned to the hash element corresponding to the alias name: C<< $/<dir> >>.
=head3 Numbered scalar aliasing
If a numbered alias is used instead of a named alias:
m/ $2:=(<-[:]>*) \: $1:=<ident> /
the behaviour is exactly the same as for a named alias, except that the resulting C<Match> object is assigned to the corresponding element of the appropriate array, rather than to an element of the hash.
For example:
m:w/ $1:=[ (<[A-E]>) (\d**{3..6}) (X?) ] /; # $/[0] contains a match object storing the complete substring # matched by the square brackets
m:w/ $2:=( (<[A-E]>) (\d**{3..6}) (X?) ) /; # $/[1] contains the match object returned by the outer subpattern
if m:/ ID\: $3:=<ident> / { say "Identified as $3"; # and $/<ident> is undefined }
The only addition behaviour is that, if any numbered alias is used, the numbering of subsequent unaliased subpatterns in the same scope automatically increments from that alias number (much like enum values increment from the last explicit value). That is:
# ---$2--- -$3- ---$7--- -$8- # | | | | | | | | m/ $2:=(food) (bard) $7:=(bazd) (quxd) /;
This behaviour is particularly useful for reinstituting Perl5 semantics for consecutive subpattern numbering in alternations:
$tune_up6 = rx/ (don't) (ray) (me) (for) (solar tea), (d'oh!) | $7:=(every) (green) (BEM) (devours) (faces) # $8 $9 $10 $11 /;
It also provides an easy way in Perl 6 to reinstitute the unnested numbering semantics of nested Perl 5 subpatterns:
# Perl 5... # $1 # _____________/\______________ # | $2 $3 $4 | # | __/\___ ____/\____ /\ | # | | | | | | | | m/ ( (<[A-E]>) (\d**{3..6}) (X?) ) /;
# Perl 6... # $1 # _____________/\______________ # | $1[0] $1[1] $1[2] | # | __/\___ ____/\____ /\ | # | | | | | | | | m/ ( (<[A-E]>) (\d**{3..6}) (X?) ) /;
# Perl 6 simulating Perl 5... # $1 # _______________/\________________ # | $2 $3 $4 | # | __/\___ ____/\____ /\ | # | | | | | | | | m/ $1:=[ (<[A-E]>) (\d**{3..6}) (X?) ] /;
The non-capturing brackets don't introduce a scope, so the subpatterns within them are at rule scope, and hence numbered at the top level. Aliasing the square brackets to C<$1> means that the next subpattern at the same level (i.e. the C<< (<[A-E]>) >>) is numbered sequentially (i.e. C<$2>), etc.
=head3 Scalar aliases applied to quantified constructs
All of the above semantics apply equally to aliases which are applied to quantified structures. The only difference is that, if the aliased construct is a subrule or subpattern, that quantified subrule or subpattern will have returned an array of C<Match> objects (as described in L<Quantified subpattern captures> and L<Repeated captures of the same subrule>). So the corresponding array element or hash entry for the alias will contain an array reference instead of a single C<Match> object. Hence aliasing and quantification are completely orthogonal.
For example:
if m/ mv $<from>:=<file>+ / { # <from>+ returns an array of Match objects, # so $/<from> contains array of Match objects, # one for each successful call to <file>
# $/<file> does not exist (pre-empted by the alias) }
if m/ mv $<from>:=(\S+ \s+)+ / { # Quantified subpattern returns an aray of Match objects, so # $/<from> contains array of Match objects, # one for each successful match of the subpattern
# $/[0] does not exist (pre-empted by the alias) }
A set of quantified I<non-capturing> brackets always returns a single C<Match> object which contains only the complete substring that was matched by the full set of repetitions of the brackets (as described in L<Named scalar aliases applied to non-capturing brackets>).
So, if an alias is applied to a set of quantified I<non-capturing> brackets, the corresponding array element or hash entry for the alias will be assigned that single C<Match> object. For example:
"coffee fifo fumble" ~~ m/ .*? $<effs>:=[f <-[f]>**{1..2} \s*]+ /;
say $<effs>; # prints "fee fifo fum"
=head3 Array aliasing
An alias can also be specified using an array as the alias instead of scalar. For example:
m/ mv @<from>:=[(\S+) \s+]* <dir> /;
Using the C<< @<alias>:= >> notation instead of a C<< $<alias>:= >> has several effects. The first is that the corresponding hash entry or array element I<always> receives an array of C<Match> objects, even if the construct being aliased would normally return a single C<Match> object. That is:
m/ $<names>:=<ident> /; # $/<names> assigned a single Match object
m/ @<names>:=<ident> /; # $/<names> assigned an array which contains # a single Match object
This is useful for creating consistent capture semantics across structurally different alternations (by enforcing array captures in all branches):
m:w/ Mr?s? $<names>:=<ident> W\. $<names>:=<ident> | Mr?s? @<names>:=<ident> /;
say "name: @{$<names>}";
If an array alias is applied to a quantified pair of non-capturing brackets, it captures the substrings matched by each repetition of the brackets into separate elements of the corresponding array. That is:
m/ mv $<files>:=[ f.. \s* ]* /; # $<files> assigned a single Match # object containing the # complete substring matched by # the full set of repetitions # of the non-capturing brackets
m/ mv @<files>:=[ f.. \s* ]* /; # $<files> assigned an array, each # element of which is a C<Match> # object containing the substring # matched by Nth repetition of # the non-capturing bracket match
If an array alias is applied to a quantified pair of capturing parens (i.e. to a subpattern), then the corresponding hash or array element is assigned a list constructed by concatenating the array values of each C<Match> object returned by one repetition of the subpattern. That is, an array alias on a subpattern flattens and collects all nested subpattern captures within the aliased subpattern. For example:
if m:w/ $<pairs>:=( (\w+) \: (\N+) ) / {
# Scalar alias, so $/<pairs> contains an array of Match objects, # each of which has its own array of two subcaptures...
for @{$<pairs>} => $pair { say "Key: $pair[0]"; say "Val: $pair[1]"; } }
if m:w/ @<pairs>:=( (\w+) \: (\N+) ) / { # Array alias, so $/<pairs> contains an array of Match objects, # each of which is one of the two subcaptures within the # subpattern, all flattened back into the outer array...
for @{$<pairs>} => $key, $val { say "Key: $key"; say "Val: $val"; } }
Likewise, if an array alias is applied to a quantified subrule, then the hash or array element corresponding to the alias is assigned a list containing the array values of each C<Match> object returned by each repetition of the subrule, all flattened into a single array. That is, an array alias on a subrule flattens and collects all the subpattern captures that occurred within the aliased subrule. For example:
rule pair :w { (\w+) \: (\N+) }
if m:w/ $<pairs>:=<pair>+ / { # Scalar alias, so $/<pairs> contains an array of Match objects, # each of which is the result of the <pair> subrule call...
for @{$<pairs>} => $pair { say "Key: $pair[0]"; say "Val: $pair[1]"; } }
if m:w/ mv @<pairs>:=<pair>+ / { # Array alias, so $/<pairs> contains an array of Match objects, # each of which is one of the captures that occurred within the # subrule, flattened back into the outer array...
for @{$<pairs>} => $key, $val { say "Key: $key"; say "Val: $val"; } }
In other words, an array alias is useful to flatten into a single array any nested captures that might occur within a repeated subpattern or subrule. Whereas a scalar alias is useful to preserve (within a top-level array) the internal structure of each repetition.
Note that, outside a rule, C<< @<foo> >> is simply a shorthand for C<< @{$<foo>} >>, so the above C<for> loop could also have been written:
for @<pairs> => $key, $val { say "Key: $key"; say "Val: $val"; }
It is also possible to use a numbered variable as an array alias. The semantics are exactly as described above, with the sole difference being that the resulting array of C<Match> objects is assigned into the appropriate element of the rule's match array, rather than to a key of its match hash. For example:
if m/ mv \s+ @1:=((\w+) \s+)+ $2:=(\w+) / { # | | # | | # | \___ Scalar alias, so $2 as normal # | # \___ Array alias, so $1 assigned a flattened array # of just the (\w+) captures from each repetition
@from = @{$1}; $to = $2; }
Note that, outside a rule, C<@1> is simply a shorthand for C<@{$1}>, so the first assignment above could also have been written:
@from = @1;
=head3 Hash aliasing
An alias can also be specified using a hash as the alias variable, instead of scalar or array. For example:
m:w/ mv %<location>:=( (<ident>) \: (\N+) )+ /;
A hash alias causes the correponding hash or array element in the current scope's C<Match> object to be assigned a hash (rather than an array or a single C<Match> object).
A hash alias cannot be applied to a quantified pair of non-capturing brackets. Attempting to do so is a compile-time detectable error.
If a hash alias is applied to a pair of capturing parens (i.e. to a subpattern), then the corresponding hash or array element is assigned a hash. Each entry in that hash is constructed as follows:
=over
=item 1.
If the subpattern was unquantified, take the single C<Match> object it returns and place it in an array. If the subpattern was quantified, take the array of C<Match> objects it returns. Then, for each C<Match> object in the array...
=over
=item 1a.
Evaluate that C<Match> object as an array to produce a list.
=item 1b.
Use the first element of the list as the next key.
=item 1c.
Use the remaining element(s) of the list as the corresponding value(s). If there are no remaining elements, the value is C<undef>. If there is one remaining element, the value is that element. If there are two or more remaining elements, the value is a reference to an array containing those elements.
=back
=back
In other words, if a hash alias is applied to a subpattern, the first pair of capturing parens within the subpattern provides the hash keys, and the remaining capturing parens (if any) provide the corresponding values. If the subpattern is unquantified then the resulting hash will have only a single key; if the subpattern is quantified, the hash may have multiple keys. For example:
# key val # _^_ _^_ # | | | | if m:w/ %<pairs>:=( (\w+) \: (\N+) )+ / {
# Hash alias, so $/<pairs> contains a hash, in which each key is # provided by the first subcapture and each value is provided by # the second...
for %{$/<pairs>} -> $pair { # Hash in list context produces pairs say "Key: $pair.key"; say "Val: $pair.value"; } }
If there are three or more captures within the aliased subpattern, the second and subsequent captures are converted to an array:
# key val[0] val[1] val[2] # _^_ _^_ _^_ _^_ # | | | | | | | | if m:w/ %<synonyms>:=( (\w+) \: (\S+) (\S+) (\S+) )+ / {
# $/<synonyms> contains a hash, in which each key is provided by # the first subcapture and each value is an array containing the # second, third, and fourth subcaptures...
for %{$/<synonyms>} => $syn { say "Key: $syn.key"; say "Vals: @{$syn.value}"; } }
Note that, outside a rule, C<< %<foo> >> is a shortcut for C<< %{$/<foo>} >>, so the previous C<for> loop could equally well have been written:
for %<synonyms> => $syn { say "Key: $syn.key"; say "Vals: @{$syn.value}"; }
If a hash alias is applied to a subrule, then the corresponding hash or array element is once again assigned a hash. Each entry in that hash is constructed in exactly the same way as for a hash-aliased subpattern.
That is, the first subpattern capture within the subrule is used as each key, and the remaining subpattern captures are used as the corresponding values. For example:
rule one_to_one :w { (\w+) \: (\N+) }
if m:w/ %<pairs>:=<one_to_one>+ / {
# Hash alias, so $/<pairs> contains a hash, in which each key is # provided by the first subcapture in <one_to_one> and each # value is provided by the second subcapture within the # subrule...
for %<pairs> -> $pair { say "One: $pair.key"; say "One: $pair.value"; } }
Likewise, if the subrule captures more than two subpatterns:
rule one_to_many :w { (\w+) \: (\S+) (\S+) (\S+) }
if m:w/ %<synonyms>:=<one_to_many>+ / {
# Hash alias, so $/<pairs> contains a hash, in which each key is # provided by the first subcapture within C<one_to_many>, and # each value is an array containing the subrule's second, third, # and fourth subcaptures...
for %<pairs> -> $pair { say "One: $pair.key"; say "Many: @{$pair.value}"; } }
As with array aliases, it is also possible to use a numbered variable as a hash alias. Once again, the only difference is where the resulting C<Match> object is stored:
rule one_to_many :w { (\w+) \: (\S+) (\S+) (\S+) }
if m:w/ %1:=<one_to_many>+ / { # $/[0] contains a hash, in which each key is provided by the # first subcapture within C<one_to_many>, and each value is an # array containing the subrule's second, third, and fourth # subcaptures...
for %{$/[0]} -> $pair { say "One: $pair.key"; say "Many: @{$pair.value}"; } }
And, of course, outside the rule, C<%1> is a shortcut for C<%{$1}>:
for %1 => $pair { say "One: $pair.key"; say "Many: @{$pair.value}"; }
=head3 External aliasing
As a final alternative, instead of using internal aliases like:
m/ mv @<files>:=<ident>+ $<dir>:=<ident> /
the name of an ordinary variable can be used as an "external alias", like so:
m/ mv @files:=<ident>+ $dir:=<ident> /
In this case, the behaviour of each alias is exactly as described in the previous sections, except that the resulting capture(s) are assigned directly to the variables of the specified name that exist in the scope in which the rule declared. For example:
if m/ mv @files:=[ <ident> ]+ $dir:=<ident> / { say "From: @files"; say " To: $dir"; }
Note that, because they bind statically to variables in the I<declaration> scope, not dynamically to variables in the I<calling> scope, external aliases are generally best used only in ad hoc pattern matches like the one shown above. It is generally a Very Bad Idea to use external aliases in a named rule. That's because, if that rule is subsequently used as a subrule within a pattern match, the external aliases will assign to variables in the scope where the rule was I<declared>, not the scope in which it was I<used> as a subrule. For example:
grammar Shell::Commands { rule mv { mv @files:=[ <ident> ]+ $dir:=<ident> } }
if m/<Shell::Commands.mv>/ { say "From: @files"; # Bzzzt! @Shell::Commands::files was set say " To: $dir"; # Bzzzt! @Shell::Commands::dir was set }
Internal aliases are a far better choice in such cases, unless you truly want the subtle cross-scoping effect that is achieved:
grammar Shell::Commands {
my $lastcmd;
rule cmd { $/:=<mv> | $/:=<cp> }
rule mv { $lastcmd:=(mv) $<files>:=[ <ident> ]+ $<dir>:=<ident> } rule cp { $lastcmd:=(cp) $<files>:=[ <ident> ]+ $<dir>:=<ident> }
sub lastcmd { return $lastcmd } }
while shift ~~ m/<Shell::Commands.cmd>/ { say "From: @{$<files>}"; say " To: $<dir>"; }
say "Final command was { Shell::Commands::lastcmd() }";
=head2 The C<:parsetree> flag
Normally, subrule calls capture by name to a hash entry of the scope's C<Match> object, whilst subpatterns capture positionally to that object's array element. Usually that's sufficient, since most coders only want to access captures either sequentially (in which case they use subpatterns) or symbolically (in which case the use subrules).
But a small number of implementers -- predominantly the writers of compilers, translaters, code browsers, refactoring tools, etc.) need to know both the order in which parts of a rule match I<and> the symbolic names of those parts.
To support that, Perl 6 rules and matches can be specified with a special flag: C<:parsetree>. Under this flag the capture behaviour of both subpatterns and subrules alters from that described in the preceding sections.
Under C<:parsetree> the C<Match> objects generated by successful subpatterns are still captured into the array of the surrounding scope's C<Match> object, but now those objects not actually instances of class C<Match>. Instead, they are blessed into a class derived from C<Match>: C<Match::Subpattern>.
if ( m:parsetree/ (Volume\:) (\d+) / ) { for @{$/}.kv -> $i, $cap { when Match::Subpattern { say "Node $i is a subpattern." say "It captured: '$cap'"; } say ""; } }
which might print:
Node 0 is a subpattern. It captured: 'Volume:'
Node 1 is a subpattern. It captured: '11'
Under C<:parsetree>, the behaviour of subrules is changed even more drastically. The C<Match> objects generated by successful subrules are no longer assigned into the hash of the surrounding scope's C<Match> object. Instead, they are appended (like subpatterns) onto the array of surrounding scope's C<Match> object.
Moreover, the C<:parsetree> flag overrides the exemption of C<< «name» >> subrule calls, so they act as if they were C<< <name> >> calls instead. They generate C<Match> objects, and those objects are also appended onto the surrounding scope's C<Match> array.
This is true even for automagically inserted non-capturing subrules, such as the C<«ws»> calls inserted by the C<:words> flag.
In addition, each C<Match> object returned by a subrule is now blessed into a class derived from the C<Match::Subrule> class (which itself is derived from the C<Match> class). The actual name of the class into which each subrule's C<Match::Subrule> object is blessed is the same as the name of the subrule call that generated it.
So, for example:
if ( m:w:parsetree/ <label> <ident>/ ) { for @{$/}.kv -> $i, $cap { given $cap { when Match::Subrule { say "Node $i is a subrule named '$cap.class()'."; say "It captured: '$cap'"; } } say ""; } }
might print somthing like:
Node 0 is a subrule named 'ws'. It captured: ''
Node 1 is a subrule named 'label'. It captured: 'From:'
Node 2 is a subrule named 'ws'. It captured: ' '
Node 3 is a subrule named 'ident'. It captured: 'postmaster'
Note that, if a rule contains both subpattern and subrule captures, they will be interleaved in the order in which they appear in the input, and can be dealt with polymorphically. For example:
if ( m:w:parsetree/ (From\:) <ident>([EMAIL PROTECTED])/ ) { for @{$/}.kv -> $i, $cap { given ($cap) { when Match::Subrule { say "Node $i is a subrule named '$cap.class()'."; say "It captured: '$cap'"; } when Match::Subpattern { say "Node $i is a subpattern."; say "It captured: '$cap'"; } } say ""; } }
which might print:
Node 0 is a subrule named 'ws'. It captured: ''
Node 1 is a subpattern. It captured: 'From:'
Node 2 is a subrule named 'ws'. It captured: ' '
Node 3 is a subrule named 'ident'. It captured: 'postmaster'
Node 4 is a subpattern. It captured: '@perl.org'
Better still, because each C<Match>-derived object is blessed into a particular class related to the subpattern or rule that created it, it's easy to create handlers in those classes and make the processing fully polymorphic (and far more specific):
method Match::Subpattern::describe ($self: $index) { say "Node $index is a subpattern that matched: '$self'"; }
method ws::describe ($self: $index) { say "Node $index is the whitespace: '$self'"; }
method ident::describe ($self: $index) { say "Node $index is the identifier: '$self'."; }
if ( m:w:parsetree/ (From\:) <ident>([EMAIL PROTECTED])/ ) { my $i = 0; .describe($i++) for @{$/}; }
which might then print:
Node 0 is the whitespace: '' Node 1 is a subpattern that matched: 'From:' Node 2 is the whitespace: ' ' Node 3 is the identifier: 'postmaster' Node 4 is a subpattern that matched: '@perl.org'
One final feature of the C<:parsetree> flag is that it automatically propagates to every subrule that a C<:parsetree>'d rule calls. And, from there, recursively into any subrules that those subrules call. Et cetera. Note that this will almost certainly require a one-time recompilation of those subrules, unless they had originally been specified with C<:parsetree> themselves, but that will be entirely transparent to the user.
This propagation of the C<:parsetree> flag means that the C<Match> objects returned by subrules will contain arrays with the same linearized, objectified contents. Effectively, a C<:parsetree>'s rule will return an array of arrays of arrays etc. corresponding to the hierarchical structure of the data that the rule matched.
Which opens up the possibility of processing that data both polymorphically I<and> hierarchically. For example, if we added:
# Factor out the ugly mail address matching... rule mailaddr { <ident> \@ (\S+) }
# And specify how to describe the resulting data structure... method mailaddr::describe ($self: $index) { say "Node $index is a mail address, which consists of:"; my $subindex = 0; temp wrap say { call "\t", @_ } # Indent when describing the bits... .describe($index~'.'~$subindex++) for @{$self}; }
then we could update our original pattern match:
if ( m:w:parsetree/ (From\:) <mailaddr>/ ) { my $i = 0; .describe($i++) for @{$/}; }
The resulting syntax tree would now describe itself hierarchically:
Node 0 is the whitespace: '' Node 1 is a subpattern that matched: 'From:' Node 2 is the whitespace: ' ' Node 3 is a mail address, which consists of: Node 3.1 is the identifier: 'postmaster' Node 3.2 is a subpattern that matched: '@perl.org'
=head2 Capturing from non-singular matches
=head3 Matching under the C<:x> and C<:g> flags
When an entire rule is successfully matched with repetitions (specified via the C<:x> and C<:g> flags), it often produces a series of distinct matches.
However, a successful match under the these flags still returns a single C<Match> object in C<$/>. But the values of this match object are slightly different from a "one-ping-only" match:
=over
=item *
The boolean value of C<$/> after such matches is true or false, depending on whether the pattern matched at all.
=item *
The integer value is the number of times the pattern matched.
=item *
The string value is the substring from the start of the first match to the end of the last match (I<including> any intervening parts of the string that the rule skipped over in order to find later matches).
=item *
There are no array contents or hash entries.
=back
For example:
if $text ~~ m:words:globally/ (\S+:) <rocks> / { say "Matched {+$/} different ways";
say 'Full match context is:'; say $/; }
The list of individual match objects corresponding to each separate match is also available via the C<.matches> method. For example:
if $text ~~ m:words:globally/ (\S+:) <rocks> / { for $/.matches -> $m { say "Match between $m.from() and { $m.to()-1 }"; say 'Right on, dude!' if $m[0] eq 'Perl'; say "Rocks like $m<rocks>"; } }
=head3 Matching under the C<:overlap> and C<:exhaustive> flags
Unlike the multiple matches of the C<:x> and C<:g> flags, success under the C<:overlap> and C<:exhaustive> flags doesn't necessarily produce a sequence of disjoint matches, but rather a disjunction of alternative matches.
A successful match under the C<:overlap> or C<:exhaustive> flags still returns a single C<Match> object in C<$/> (all matches do) and the C<.matches> method of this object still returns all the distinct C<Match> objects for each alternative match (in the order the matches were found).
But the values of the top-level C<Match> object returned by an overlapping or exhaustive match are unusual:
=over
=item *
The boolean value of C<$/> after such matches is true or false, depending on whether the pattern matched at all.
=item *
The integer value is the number of distinct ways in which the pattern matched.
=item *
The string value is a disjunction of all the distinct matches.
=item *
The array contents are a list of disjunctions of all the corresponding unnamed captures from all the distinct matches. That is, C<$1> is a disjunction of the C<$1> value of each of the successful matches that sets a C<$1>.
=item *
The hash values are disjunctions of all the corresponding named captures from all the distinct matches. That is, C<< $<foo> >> is a disjunction of the C<< $<foo> >> value of each of the successful matches that sets a C< $<foo> >>.
=back
For example:
if $text ~~ m:words:exhaustive/ (\S+:) <rocks> / { say "Matched {+$/} different ways";
say 'Right on, dude!' if $1 eq 'Perl'; # Disjunctive match against # all possible $1's from # any of the exhaustive matches
say 'Found these variations on "rocks":'; say for $<rocks>.values; # List all possible substrings # successfully matched by <rocks> # in any of the exhaustive matches }
As mentioned above, the individual match objects for each alternative match are also available (in canonical order) via the C<.matches> method. For example:
if $text ~~ m:words:exhaustive/ (\S+:) <rocks> / { for $/.matches -> $m { say 'Right on, dude!' if $m[0] eq 'Perl'; # Normal match against # match $m's $1's
say "Rocks like $m<rocks>"; # Substring matched by <rocks> # in match $m } }
=head2 Executive summary of proposed changes
=over
=item *
Angles create subrules, which return a C<Match> object that is captured into the hash of their surrounding scope's C<Match> object.
=item *
Parens create subpatterns, which return a C<Match> object that is captured into the array of their surrounding scope's C<Match> object.
=item *
A subpattern is like an inlined subrule (except that it captures into an array, rather than a hash).
=item *
Subpatterns nest lexically, and the captures they return are likewise hierarchical.
=item *
The number associated with a subpattern reflects its ordinal position in its immediately surrounding scope, not its ordinal position in the overall rule. As a result, these numbers are hierarchical, rather than linear.
=item *
Quantifiers (except C<?> and C<??>) cause a matched subrule or subpattern to return an array of C<Match> objects, instead of just a single object.
=item *
Two or more calls to the same subrule or subpattern in the same lexical scope also cause the matched subrules/subpatterns to accumulate their C<Match> objects in an array.
=item *
Scalar aliases rename or renumber the construct they're applied to, changing the location in which the construct's C<Match> object's is stored, but not its captuing semantics.
=item *
Array aliases rename or renumber the construct they're applied to, and also cause its corresponding C<Match> object(s) always to be returned in an array.
=item *
The elements of that array are a flattened list of the C<Match> objects returned by the subpatterns nested inside the aliased construct.
=item *
Hash aliases rename or renumber the construct they're applied to, and also cause its corresponding C<Match> object(s) always to be returned in a hash.
=item *
The keys of this hash are C<Match> objects returned by the the first subpattern nested inside the aliased construct. The values are the C<Match> objects returned by the remaining nested subpatterns.
=item *
The C<:parsetree> flag modifies capture semantics to preserve the parse sequence, the identity information, and the hierarchical structure of captures, whilst also supporting object-oriented processing of the resulting parse tree.
=back