In perl.git, the branch blead has been updated <http://perl5.git.perl.org/perl.git/commitdiff/05a0cace6c9b2af166c73b35ba5391b2f731cdb4?hp=52b4b0e0361aa3b508faa5976f08af4626cd49ae>
- Log ----------------------------------------------------------------- commit 05a0cace6c9b2af166c73b35ba5391b2f731cdb4 Author: Karl Williamson <[email protected]> Date: Sat Feb 18 14:01:05 2017 -0700 perlrebackslash: Clarify "Character class for non vertical whitespace." wasn't meant to mean match whitespace that isn't vertical. M pod/perlrebackslash.pod commit c57a33501ea6e7d9081f2636a44c071cb0588d91 Author: Karl Williamson <[email protected]> Date: Sat Feb 18 13:50:00 2017 -0700 perlre: Revamp portions This commit folds in the after-thought section on Version 8 regexes into the rest of the document, making most of it part of a gentler "Basics" section. Some redundancies from the auxiliary pods have been removed (these being perlrebackslash and perlrecharclass, created, I presume, to allow this document to be shorter). M pod/perlre.pod commit 3644e48f9bee923ec5268c0178616fe146be013a Author: Karl Williamson <[email protected]> Date: Sat Feb 18 13:46:16 2017 -0700 perlre: Some clarifications, small corrections M pod/perlre.pod commit 57fbbe9652a3493503bd4036187f53a42f121eb7 Author: Karl Williamson <[email protected]> Date: Sat Feb 18 13:30:14 2017 -0700 perlre: Nits involving C<>, I<> This standardizes the usage of single characters inside C<> to be C<"x">, which was the most common usage previously in this pod. It italicizes e.g., etc. It removes trailing blanks on a few lines M pod/perlre.pod commit 7e7dae317603f267b98276b60ba434c40271b4da Author: Karl Williamson <[email protected]> Date: Sat Feb 18 13:21:19 2017 -0700 perlre: Don't name exact max non-consume depth In a couple of places, this pod says that 50 is the recursion limit in patterns without consuming any input, but that it is changeable by recompiling perl. Therefore, we shouldn't specify the quantity, because it might not be the correct value. Further, 50 is currently wrong. M pod/perlre.pod commit 4a88d526a65a66b761d11870fd1447cc39430c61 Author: Karl Williamson <[email protected]> Date: Sat Feb 18 13:00:49 2017 -0700 perlrecharclass: A few clarifications M pod/perlrecharclass.pod commit f716ba59778654ff1502b61022ea7adfc7c8b6d3 Author: Karl Williamson <[email protected]> Date: Fri Feb 17 11:54:07 2017 -0700 perlretut: "-" is sometimes a metacharacter M pod/perlretut.pod commit 15776bb0ab41a4f8dafef4c53c766ccc16f9efa5 Author: Karl Williamson <[email protected]> Date: Thu Feb 16 19:36:11 2017 -0700 perlretut: Cleanup, nits This adds some C<>, I<>, changes non-literal text from C<> to I<>. It changes some phrases that are enclosed in single quotes to the more idiomatic double quotes. It standardizes on single characters within C<> to be C<'x'>. This is not standardized in our documentation, and people change it back and forth. I prefer the extra quotes, as it otherwise blends in to the background on html displays. It converts the few 'regex' terms to 'regexp'. It fixes some numbered lists to display not so uglily It removes the cautions about the features that are no longer experimental It corrects some grammar M pod/perlretut.pod commit f1dc5bb2995ecccb5fa0346ca80f01f09aaa3e9d Author: Karl Williamson <[email protected]> Date: Thu Feb 16 19:30:08 2017 -0700 Pods: Standardize on one pattern mod style There were about 40 cases in pods where //m is used to represent the pattern modifier 'm', but nearly 400 where /m is used. Convert to the most common representation. M pod/perlfilter.pod M pod/perlre.pod M pod/perlreapi.pod M pod/perlrequick.pod M pod/perlretut.pod ----------------------------------------------------------------------- Summary of changes: pod/perlfilter.pod | 2 +- pod/perlre.pod | 622 ++++++++++++++++++++++++++++++------------------ pod/perlreapi.pod | 2 +- pod/perlrebackslash.pod | 26 +- pod/perlrecharclass.pod | 21 +- pod/perlrequick.pod | 8 +- pod/perlretut.pod | 506 +++++++++++++++++++-------------------- 7 files changed, 669 insertions(+), 518 deletions(-) diff --git a/pod/perlfilter.pod b/pod/perlfilter.pod index 60d086401c..b61b6f97b0 100644 --- a/pod/perlfilter.pod +++ b/pod/perlfilter.pod @@ -562,7 +562,7 @@ or the byteloader, to translate binary code back to source code. See for example the limitations in L<Switch>, which uses source filters, and thus is does not work inside a string eval, the presence of regexes with embedded newlines that are specified with raw C</.../> -delimiters and don't have a modifier C<//x> are indistinguishable from +delimiters and don't have a modifier C</x> are indistinguishable from code chunks beginning with the division operator C</>. As a workaround you must use C<m/.../> or C<m?...?> for such patterns. Also, the presence of regexes specified with raw C<?...?> delimiters may cause mysterious diff --git a/pod/perlre.pod b/pod/perlre.pod index e3fc62d305..3c902523cd 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -11,24 +11,307 @@ If you haven't used regular expressions before, a tutorial introduction is available in L<perlretut>. If you know just a little about them, a quick-start introduction is available in L<perlrequick>. -This page assumes you are familiar with regular expression basics, like -what is a "pattern", what does it look like, and how is it basically used. -For a reference on how they are used, plus various examples of the same, -see discussions of C<m//>, C<s///>, C<qr//> and C<"??"> in -L<perlop/"Regexp Quote-Like Operators">. +Except for L</The Basics> section, this page assumes you are familiar +with regular expression basics, like what is a "pattern", what does it +look like, and how it is basically used. For a reference on how they +are used, plus various examples of the same, see discussions of C<m//>, +C<s///>, C<qr//> and C<"??"> in L<perlop/"Regexp Quote-Like Operators">. New in v5.22, L<C<use re 'strict'>|re/'strict' mode> applies stricter rules than otherwise when compiling regular expression patterns. It can find things that, while legal, may not be what you intended. +=head2 The Basics +X<regular expression, version 8> X<regex, version 8> X<regexp, version 8> + +Regular expressions are strings with the very particular syntax and +meaning described in this document and auxiliary documents referred to +by this one. The strings are called "patterns". Patterns are used to +determine if some other string, called the "target", has (or doesn't +have) the characteristics specified by the pattern. We call this +"matching" the target string against the pattern. Usually the match is +done by having the target be the first operand, and the pattern be the +second operand, of one of the two binary operators C<=~> and C<!~>, +listed in L<perlop/Binding Operators>; and the pattern will have been +converted from an ordinary string by one of the operators in +L<perlop/"Regexp Quote-Like Operators">, like so: + + $foo =~ m/abc/ + +This evaluates to true if and only if the string in the variable C<$foo> +contains somewhere in it, the sequence of characters "a", "b", then "c". +(The C<=~ m>, or match operator, is described in +L<perlop/m/PATTERN/msixpodualngc>.) + +Patterns that aren't already stored in some variable must be delimitted, +at both ends, by delimitter characters. These are often, as in the +example above, forward slashes, and the typical way a pattern is written +in documentation is with those slashes. In most cases, the delimitter +is the same character, fore and aft, but there are a few cases where a +character looks like it has a mirror-image mate, where the opening +version is the beginning delimiter, and the closing one is the ending +delimiter, like + + $foo =~ m<abc> + +Most times, the pattern is evaluated in double-quotish context, but it +is possible to choose delimiters to force single-quotish, like + + $foo =~ m'abc' + +If the pattern contains its delimiter within it, that delimiter must be +escaped. Prefixing it with a backslash (I<e.g.>, C<"/foo\/bar/">) +serves this purpose. + +Any single character in a pattern matches that same character in the +target string, unless the character is a I<metacharacter> with a special +meaning described in this document. A sequence of non-metacharacters +matches the same sequence in the target string, as we saw above with +C<m/abc/>. + +Only a few characters (all or them being ASCII punctuation characters) +are metacharacters. The most commonly used one is a dot C<".">, which +normally matches almost any character (including a dot itself). + +You can cause characters that normally function as metacharacters to be +interpreted literally by prefixing them with a C<"\">, just like the +pattern's delimiter must be escaped if it also occurs within the +pattern. Thus, C<"\."> matches just a literal dot, C<"."> instead of +its normal meaning. This means that the backslash is also a +metacharacter, so C<"\\"> matches a single C<"\">. And a sequence that +contains an escaped metacharacter matches the same sequence (but without +the escape) in the target string. So, the pattern C</blur\\fl/> would +match any target string that contains the sequence C<"blur\fl">. + +The metacharacter C<"|"> is used to match one thing or another. Thus + + $foo =~ m/this|that/ + +is TRUE if and only if C<$foo> contains either the sequence C<"this"> or +the sequence C<"that">. Like all metacharacters, prefixing the C<"|"> +with a backslash makes it match the plain punctuation character; in its +case, the VERTICAL LINE. + + $foo =~ m/this\|that/ + +is TRUE if and only if C<$foo> contains the sequence C<"this|that">. + +You aren't limited to just a single C<"|">. + + $foo =~ m/fee|fie|foe|fum/ + +is TRUE if and only if C<$foo> contains any of those 4 sequences from +the children's story "Jack and the Beanstalk". + +As you can see, the C<"|"> binds less tightly than a sequence of +ordinary characters. We can override this by using the grouping +metacharacters, the parentheses C<"("> and C<")">. + + $foo =~ m/th(is|at) thing/ + +is TRUE if and only if C<$foo> contains either the sequence S<C<"this +thing">> or the sequence S<C<"that thing">>. The portions of the string +that match the portions of the pattern enclosed in parentheses are +normally made available separately for use later in the pattern, +substitution, or program. This is called "capturing", and it can get +complicated. See L</Capture groups>. + +The first alternative includes everything from the last pattern +delimiter (C<"(">, C<"(?:"> (described later), I<etc>. or the beginning +of the pattern) up to the first C<"|">, and the last alternative +contains everything from the last C<"|"> to the next closing pattern +delimiter. That's why it's common practice to include alternatives in +parentheses: to minimize confusion about where they start and end. + +Alternatives are tried from left to right, so the first +alternative found for which the entire expression matches, is the one that +is chosen. This means that alternatives are not necessarily greedy. For +example: when matching C<foo|foot> against C<"barefoot">, only the C<"foo"> +part will match, as that is the first alternative tried, and it successfully +matches the target string. (This might not seem important, but it is +important when you are capturing matched text using parentheses.) + +Besides taking away the special meaning of a metacharacter, a prefixed +backslash changes some letter and digit characters away from matching +just themselves to instead have special meaning. These are called +"escape sequences", and all such are described in L<perlrebackslash>. A +backslash sequence (of a letter or digit) that doesn't currently have +special meaning to Perl will raise a warning if warnings are enabled, +as those are reserved for potential future use. + +One such sequence is C<\b>, which matches a boundary of some sort. +C<\b{wb}> and a few others give specialized types of boundaries. +(They are all described in detail starting at +L<perlrebackslash/\b{}, \b, \B{}, \B>.) Note that these don't match +characters, but the zero-width spaces between characters. They are an +example of a L<zero-width assertion|/Assertions>. Consider again, + + $foo =~ m/fee|fie|foe|fum/ + +It evaluates to TRUE if, besides those 4 words, any of the sequences +"feed", "field", "Defoe", "fume", and many others are in C<$foo>. By +judicious use of C<\b> (or better (because it is designed to handle +natural language) C<\b{wb}>), we can make sure that only the Giant's +words are matched: + + $foo =~ m/\b(fee|fie|foe|fum)\b/ + $foo =~ m/\b{wb}(fee|fie|foe|fum)\b{wb}/ + +The final example shows that the characters C<"{"> and C<"}"> are +metacharacters. + +Another use for escape sequences is to specify characters that cannot +(or which you prefer not to) be written literally. These are described +in detail in L<perlrebackslash/Character Escapes>, but the next three +paragraphs briefly describe some of them. + +Various control characters can be written in C language style: C<"\n"> +matches a newline, C<"\t"> a tab, C<"\r"> a carriage return, C<"\f"> a +form feed, I<etc>. + +More generally, C<\I<nnn>>, where I<nnn> is a string of three octal +digits, matches the character whose native code point is I<nnn>. You +can easily run into trouble if you don't have exactly three digits. So +always use three, or since Perl 5.14, you can use C<\o{...}> to specify +any number of octal digits. + +Similarly, C<\xI<nn>>, where I<nn> are hexadecimal digits, matches the +character whose native ordinal is I<nn>. Again, not using exactly two +digits is a recipe for disaster, but you can use C<\x{...}> to specify +any number of hex digits. + +Besides being a metacharacter, the C<"."> is an example of a "character +class", something that can match any single character of a given set of +them. In its case, the set is just about all possible characters. Perl +predefines several character classes besides the C<".">; there is a +separate reference page about just these, L<perlrecharclass>. + +You can define your own custom character classes, by putting into your +pattern in the appropriate place(s), a list of all the characters you +want in the set. You do this by enclosing the list within C<[]> bracket +characters. These are called "bracketed character classes" when we are +being precise, but often the word "bracketed" is dropped. (Dropping it +usually doesn't cause confusion.) This means that the C<"["> character +is another metacharacter. It doesn't match anything just by itelf; it +is used only to tell Perl that what follows it is a bracketed character +class. If you want to match a literal left square bracket, you must +escape it, like C<"\[">. The matching C<"]"> is also a metacharacter; +again it doesn't match anything by itself, but just marks the end of +your custom class to Perl. It is an example of a "sometimes +metacharacter". It isn't a metacharacter if there is no corresponding +C<"[">, and matches its literal self: + + print "]" =~ /]/; # prints 1 + +The list of characters within the character class gives the set of +characters matched by the class. C<"[abc]"> matches a single "a" or "b" +or "c". But if the first character after the C<"["> is C<"^">, the +class matches any character not in the list. Within a list, the C<"-"> +character specifies a range of characters, so that C<a-z> represents all +characters between "a" and "z", inclusive. If you want either C<"-"> or +C<"]"> itself to be a member of a class, put it at the start of the list +(possibly after a C<"^">), or escape it with a backslash. C<"-"> is +also taken literally when it is at the end of the list, just before the +closing C<"]">. (The following all specify the same class of three +characters: C<[-az]>, C<[az-]>, and C<[a\-z]>. All are different from +C<[a-z]>, which specifies a class containing twenty-six characters, even +on EBCDIC-based character sets.) + +There is lots more to bracketed character classes; full details are in +L<perlrecharclass/Bracketed Character Classes>. + +=head3 Metacharacters +X<metacharacter> +X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]> + +L</The Basics> introduced some of the metacharacters. This section +gives them all. Most of them have the same meaning as in the I<egrep> +command. + +Only the C<"\"> is always a metacharacter. The others are metacharacters +just sometimes. The following tables lists all of them, summarizes +their use, and gives the contexts where they are metacharacters. +Outside those contexts or if prefixed by a C<"\">, they match their +corresponding punctuation character. In some cases, their meaning +varies depending on various pattern modifiers that alter the default +behaviors. See L</Modifiers>. + + + PURPOSE WHERE + \ Escape the next character Always, except when + escaped by another \ + ^ Match the beginning of the string Not in [] + (or line, if /m is used) + ^ Complement the [] class At the beginning of [] + . Match any single character except newline Not in [] + (under /s, includes newline) + $ Match the end of the string Not in [], but can + (or before newline at the end of the mean interpolate a + string; or before any newline if /m is scalar + used) + | Alternation Not in [] + () Grouping Not in [] + [ Start Bracketed Character class Not in [] + ] End Bracketed Character class Only in [], and + not first + * Matches the preceding element 0 or more Not in [] + times + + Matches the preceding element 1 or more Not in [] + times + ? Matches the preceding element 0 or 1 Not in [] + times + { Starts a sequence that gives number(s) Not in [] + of times the preceding element can be + matched + { when following certain escape sequences + starts a modifier to the meaning of the + sequence + } End sequence started by { + - Indicates a range Only in [] interior + +Notice that most of the metacharacters lose their special meaning when +they occur in a bracketed character class, except C<"^"> has a different +meaning when it is at the beginning of such a class. And C<"-"> and C<"]"> +are metacharacters only at restricted positions within bracketed +character classes; while C<"}"> is a metacharacter only when closing a +special construct started by C<"{">. + +In double-quotish context, as is usually the case, you need to be +careful about C<"$"> and the non-metacharacter C<"@">. Those could +interpolate variables, which may or may not be what you intended. + +These rules were designed for compactness of expression, rather than +legibility and maintainability. The L</E<sol>x and E<sol>xx> pattern +modifiers allow you to insert white space to improve readability. And +use of S<C<L<re 'strict'|re/'strict' mode>>> adds extra checking to +catch some typos that might silently compile into something unintended. + +By default, the C<"^"> character is guaranteed to match only the +beginning of the string, the C<"$"> character only the end (or before the +newline at the end), and Perl does certain optimizations with the +assumption that the string contains only one line. Embedded newlines +will not be matched by C<"^"> or C<"$">. You may, however, wish to treat a +string as a multi-line buffer, such that the C<"^"> will match after any +newline within the string (except if the newline is the last character in +the string), and C<"$"> will match before any newline. At the +cost of a little more overhead, you can do this by using the +L</C<E<sol>m>> modifier on the pattern match operator. (Older programs +did this by setting C<$*>, but this option was removed in perl 5.10.) +X<^> X<$> X</m> + +To simplify multi-line substitutions, the C<"."> character never matches a +newline unless you use the L<C<E<sol>s>|/s> modifier, which in effect tells +Perl to pretend the string is a single line--even if it isn't. +X<.> X</s> + =head2 Modifiers =head3 Overview -Matching operations can have various modifiers. Modifiers -that relate to the interpretation of the regular expression inside -are listed below. Modifiers that alter the way a regular expression -is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and +The default behavior for matching can be changed, using various +modifiers. Modifiers that relate to the interpretation of the pattern +are listed just below. Modifiers that alter the way a pattern is used +by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and L<perlop/"Gory details of parsing quoted constructs">. =over 4 @@ -36,7 +319,7 @@ L<perlop/"Gory details of parsing quoted constructs">. =item B<C<m>> X</m> X<regex, multiline> X<regexp, multiline> X<regular expression, multiline> -Treat the string as multiple lines. That is, change C<"^"> and C<"$"> from matching +Treat the string being matched against as multiple lines. That is, change C<"^"> and C<"$"> from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string. @@ -106,7 +389,7 @@ after the match regardless of the modifier. X</a> X</d> X</l> X</u> These modifiers, all new in 5.14, affect which character-set rules -(Unicode, etc.) are used, as described below in +(Unicode, I<etc>.) are used, as described below in L</Character set modifiers>. =item B<C<n>> @@ -114,7 +397,7 @@ X</n> X<regex, non-capture> X<regexp, non-capture> X<regular expression, non-capture> Prevent the grouping metacharacters C<()> from capturing. This modifier, -new in 5.22, will stop C<$1>, C<$2>, etc... from being filled in. +new in 5.22, will stop C<$1>, C<$2>, I<etc>... from being filled in. "hello" =~ /(hi|hello)/; # $1 is "hello" "hello" =~ /(hi|hello)/n; # $1 is undef @@ -154,7 +437,7 @@ L<perlop/"s/PATTERN/REPLACEMENT/msixpodualngcer"> are: =back Regular expression modifiers are usually written in documentation -as e.g., "the C</x> modifier", even though the delimiter +as I<e.g.>, "the C</x> modifier", even though the delimiter in question might not really be a slash. The modifiers C</imnsxadlup> may also be embedded within the regular expression itself using the C<(?...)> construct, see L</Extended Patterns> below. @@ -189,11 +472,11 @@ You can use L</(?#text)> to create a comment that ends earlier than the end of the current line, but C<text> also can't contain the closing delimiter unless escaped with a backslash. -A common pitfall is to forget that C<#> characters begin a comment under +A common pitfall is to forget that C<"#"> characters begin a comment under C</x> and are not matched literally. Just keep that in mind when trying to puzzle out why a particular C</x> pattern isn't working as expected. -Starting in Perl v5.26, if the modifier has a second C<x> within it, +Starting in Perl v5.26, if the modifier has a second C<"x"> within it, it does everything that a single C</x> does, but additionally non-backslashed SPACE and TAB characters within bracketed character classes are also generally ignored, and hence can be added to make the @@ -451,8 +734,8 @@ compatibilities. =head4 /a (and /aa) -This modifier stands for ASCII-restrict (or ASCII-safe). This modifier, -unlike the others, may be doubled-up to increase its effect. +This modifier stands for ASCII-restrict (or ASCII-safe). This modifier +may be doubled-up to increase its effect. When it appears singly, it causes the sequences C<\d>, C<\s>, C<\w>, and the Posix character classes to match only in the ASCII range. They thus @@ -491,7 +774,7 @@ comes to case-insensitive matching. To forbid ASCII/non-ASCII matches (like "k" with C<\N{KELVIN SIGN}>), specify the C<"a"> twice, for example C</aai> or C</aia>. (The first -occurrence of C<"a"> restricts the C<\d>, etc., and the second occurrence +occurrence of C<"a"> restricts the C<\d>, I<etc>., and the second occurrence adds the C</i> restrictions.) But, note that code points outside the ASCII range will use Unicode rules for C</i> matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the @@ -538,7 +821,7 @@ sets the default to C</u>, overriding any plain C<use locale>.) Unlike the mechanisms mentioned above, these affect operations besides regular expressions pattern matching, and so give more consistent results with other operators, including using -C<\U>, C<\l>, etc. in substitution replacements. +C<\U>, C<\l>, I<etc>. in substitution replacements. If none of the above apply, for backwards compatibility reasons, the C</d> modifier is the one in effect by default. As this can lead to @@ -558,49 +841,12 @@ Unicode rules, and neither did all occurrences of C<\N{}>, until 5.12. =head2 Regular Expressions -=head3 Metacharacters - -The patterns used in Perl pattern matching evolved from those supplied in -the Version 8 regex routines. (The routines are derived -(distantly) from Henry Spencer's freely redistributable reimplementation -of the V8 routines.) See L<Version 8 Regular Expressions> for -details. - -In particular the following metacharacters have their standard I<egrep>-ish -meanings: -X<metacharacter> -X<\> X<^> X<.> X<$> X<|> X<(> X<()> X<[> X<[]> - - \ Quote the next metacharacter - ^ Match the beginning of the line - . Match any character (except newline) - $ Match the end of the string (or before newline at the end - of the string) - | Alternation - () Grouping - [] Bracketed Character class - -By default, the C<"^"> character is guaranteed to match only the -beginning of the string, the C<"$"> character only the end (or before the -newline at the end), and Perl does certain optimizations with the -assumption that the string contains only one line. Embedded newlines -will not be matched by C<"^"> or C<"$">. You may, however, wish to treat a -string as a multi-line buffer, such that the C<"^"> will match after any -newline within the string (except if the newline is the last character in -the string), and C<"$"> will match before any newline. At the -cost of a little more overhead, you can do this by using the /m modifier -on the pattern match operator. (Older programs did this by setting C<$*>, -but this option was removed in perl 5.10.) -X<^> X<$> X</m> - -To simplify multi-line substitutions, the C<"."> character never matches a -newline unless you use the C</s> modifier, which in effect tells Perl to pretend -the string is a single line--even if it isn't. -X<.> X</s> - =head3 Quantifiers -The following standard quantifiers are recognized: +Quantifiers are used when a particular portion of a pattern needs to +match a certain number (or numbers) of times. If there isn't a +quantifier the number of times to match is exactly one. The following +standard quantifiers are recognized: X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}> * Match 0 or more times @@ -610,15 +856,15 @@ X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}> {n,} Match at least n times {n,m} Match at least n but not more than m times -(If a curly bracket occurs in a context other than one of the -quantifiers listed above, where it does not form part of a backslashed -sequence like C<\x{...}>, it is treated as a regular character. -However, a deprecation warning is raised for these -occurrences, and in Perl v5.26, literal uses of a curly bracket will be -required to be escaped, say by preceding them with a backslash (C<"\{">) -or enclosing them within square brackets (C<"[{]">). This change will -allow for future syntax extensions (like making the lower bound of a -quantifier optional), and better error checking of quantifiers.) +(If a non-escaped curly bracket occurs in a context other than one of +the quantifiers listed above, where it does not form part of a +backslashed sequence like C<\x{...}>, it is either a fatal syntax error, +or treated as a regular character, generally with a deprecation warning +raised. To escape it, you can precede it with a backslash (C<"\{">) or +enclose it within square brackets (C<"[{]">). +This change will allow for future syntax extensions (like making the +lower bound of a quantifier optional), and better error checking of +quantifiers). The C<"*"> quantifier is equivalent to C<{0,}>, the C<"+"> quantifier to C<{1,}>, and the C<"?"> quantifier to C<{0,1}>. I<n> and I<m> are limited @@ -659,7 +905,7 @@ For instance, 'aaaa' =~ /a++a/ -will never match, as the C<a++> will gobble up all the C<a>'s in the +will never match, as the C<a++> will gobble up all the C<"a">'s in the string and won't leave any for the remaining part of the pattern. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. For instance, the typical "match a double-quoted @@ -792,20 +1038,21 @@ See L<perlrecharclass/Extended Bracketed Character Classes> for details. =head3 Assertions -Perl defines the following zero-width assertions: +Besides L<C<"^"> and C<"$">|/Metacharacters>, Perl defines the following +zero-width assertions: X<zero-width assertion> X<assertion> X<regex, zero-width assertion> X<regexp, zero-width assertion> X<regular expression, zero-width assertion> X<\b> X<\B> X<\A> X<\Z> X<\z> X<\G> - \b{} Match at Unicode boundary of specified type - \B{} Match where corresponding \b{} doesn't match - \b Match a word boundary - \B Match except at a word boundary - \A Match only at beginning of string - \Z Match only at end of string, or before newline at the end - \z Match only at end of string - \G Match only at pos() (e.g. at the end-of-match position + \b{} Match at Unicode boundary of specified type + \B{} Match where corresponding \b{} doesn't match + \b Match a \w\W or \W\w boundary + \B Match except at a \w\W or \W\w boundary + \A Match only at beginning of string + \Z Match only at end of string, or before newline at the end + \z Match only at end of string + \G Match only at pos() (e.g. at the end-of-match position of prior m//g) A Unicode boundary (C<\b{}>), available starting in v5.22, is a spot @@ -866,7 +1113,7 @@ string: =head3 Capture groups -The bracketing construct C<( ... )> creates capture groups (also referred to as +The grouping construct C<( ... )> creates capture groups (also referred to as capture buffers). To refer to the current contents of a group later on, within the same pattern, use C<\g1> (or C<\g{1}>) for the first, C<\g2> (or C<\g{2}>) for the second, and so on. @@ -880,11 +1127,11 @@ X<named capture buffer> X<regular expression, named capture buffer> X<named capture group> X<regular expression, named capture group> X<%+> X<$+{name}> X<< \k<name> >> There is no limit to the number of captured substrings that you may use. -Groups are numbered with the leftmost open parenthesis being number 1, etc. If +Groups are numbered with the leftmost open parenthesis being number 1, I<etc>. If a group did not match, the associated backreference won't match either. (This can happen if the group is optional, or in a different branch of an alternation.) -You can omit the C<"g">, and write C<"\1">, etc, but there are some issues with +You can omit the C<"g">, and write C<"\1">, I<etc>, but there are some issues with this form, described below. You can also refer to capture groups relatively, by using a negative number, so @@ -921,7 +1168,7 @@ Capture group contents are dynamically scoped and available to you outside the pattern until the end of the enclosing block or until the next successful match, whichever comes first. (See L<perlsyn/"Compound Statements">.) You can refer to them by absolute number (using C<"$1"> instead of C<"\g1">, -etc); or by name via the C<%+> hash, using C<"$+{I<name>}">. +I<etc>); or by name via the C<%+> hash, using C<"$+{I<name>}">. Braces are required in referring to named capture groups, but are optional for absolute or relative numbered ones. Braces are safer when creating a regex by @@ -932,7 +1179,7 @@ is probably not what you intended. The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that there were no named nor relative numbered capture groups. Absolute numbered groups were referred to using C<\1>, -C<\2>, etc., and this notation is still +C<\2>, I<etc>., and this notation is still accepted (and likely always will be). But it leads to some ambiguities if there are more than 9 capture groups, as C<\10> could mean either the tenth capture group, or the character whose ordinal in octal is 010 (a backspace in @@ -994,7 +1241,7 @@ variable. X<$+> X<$^N> X<$&> X<$`> X<$'> These special variables, like the C<%+> hash and the numbered match variables -(C<$1>, C<$2>, C<$3>, etc.) are dynamically scoped +(C<$1>, C<$2>, C<$3>, I<etc>.) are dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See L<perlsyn/"Compound Statements">.) X<$+> X<$^N> X<$&> X<$`> X<$'> @@ -1009,7 +1256,7 @@ beware that once Perl sees that you need one of C<$&>, C<$`>, or C<$'> anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. -Perl uses the same mechanism to produce C<$1>, C<$2>, etc, so you also +Perl uses the same mechanism to produce C<$1>, C<$2>, I<etc>, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression C<(?: ... )> instead.) But if you never @@ -1150,8 +1397,8 @@ same scope containing the same modifier, so that /((?im)foo(?-m)bar)/ matches all of C<foobar> case insensitively, but uses C</m> rules for -only the C<foo> portion. The C<a> flag overrides C<aa> as well; -likewise C<aa> overrides C<a>. The same goes for C<x> and C<xx>. +only the C<foo> portion. The C<"a"> flag overrides C<aa> as well; +likewise C<aa> overrides C<"a">. The same goes for C<"x"> and C<xx>. Hence, in /(?-x)foo/xx @@ -1164,8 +1411,8 @@ C</x> but NOT C</xx> is turned on for matching C<foo>. (One might mistakenly think that since the inner C<(?x)> is already in the scope of C</x>, that the result would effectively be the sum of them, yielding C</xx>. It doesn't work that way.) Similarly, doing something like -C<(?xx-x)foo> turns off all C<x> behavior for matching C<foo>, it is not -that you subtract 1 C<x> from 2 to get 1 C<x> remaining. +C<(?xx-x)foo> turns off all C<"x"> behavior for matching C<foo>, it is not +that you subtract 1 C<"x"> from 2 to get 1 C<"x"> remaining. Any of these modifiers can be set to apply globally to all regular expressions compiled within the scope of a C<use re>. See @@ -1176,15 +1423,15 @@ after the C<"?"> is a shorthand equivalent to C<d-imnsx>. Flags (except C<"d">) may follow the caret to override it. But a minus sign is not legal with it. -Note that the C<a>, C<d>, C<l>, C<p>, and C<u> modifiers are special in -that they can only be enabled, not disabled, and the C<a>, C<d>, C<l>, and -C<u> modifiers are mutually exclusive: specifying one de-specifies the -others, and a maximum of one (or two C<a>'s) may appear in the +Note that the C<"a">, C<"d">, C<"l">, C<"p">, and C<"u"> modifiers are special in +that they can only be enabled, not disabled, and the C<"a">, C<"d">, C<"l">, and +C<"u"> modifiers are mutually exclusive: specifying one de-specifies the +others, and a maximum of one (or two C<"a">'s) may appear in the construct. Thus, for example, C<(?-p)> will warn when compiled under C<use warnings>; C<(?-d:...)> and C<(?dl:...)> are fatal errors. -Note also that the C<p> modifier is special in that its presence +Note also that the C<"p"> modifier is special in that its presence anywhere in a pattern has a global effect. =item C<(?:pattern)> @@ -1221,9 +1468,9 @@ is equivalent to the more verbose Note that any C<()> constructs enclosed within this one will still capture unless the C</n> modifier is in effect. -Like the L</(?adlupimnsx-imnsx)> construct, C<aa> and C<a> override each -other, as do C<xx> and C<x>. They are not additive. So, doing -something like C<(?xx-x:foo)> turns off all C<x> behavior for matching +Like the L</(?adlupimnsx-imnsx)> construct, C<aa> and C<"a"> override each +other, as do C<xx> and C<"x">. They are not additive. So, doing +something like C<(?xx-x:foo)> turns off all C<"x"> behavior for matching C<foo>. Starting in Perl 5.14, a C<"^"> (caret or circumflex accent) immediately @@ -1279,12 +1526,12 @@ Consider the following pattern. The numbers underneath show in which group the captured content will be stored. - # before ---------------branch-reset----------- after + # before ---------------branch-reset----------- after / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x - # 1 2 2 3 2 3 4 + # 1 2 2 3 2 3 4 -Be careful when using the branch reset pattern in combination with -named captures. Named captures are implemented as being aliases to +Be careful when using the branch reset pattern in combination with +named captures. Named captures are implemented as being aliases to numbered groups holding the captures, and that interferes with the implementation of the branch reset pattern. If you are using named captures in a branch reset pattern, it's best to use the same names, @@ -1573,7 +1820,7 @@ pattern captures "A"; Note that this means that there is no way for the inner pattern to refer to a capture group defined outside. (The code block itself can use C<$1>, -etc., to refer to the enclosing pattern's capture groups.) Thus, although +I<etc>., to refer to the enclosing pattern's capture groups.) Thus, although ('a' x 100)=~/(??{'(.)' x 100})/ @@ -1596,9 +1843,10 @@ L<C<(?I<PARNO>)>|/(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)> for a different, more efficient way to accomplish the same task. -Executing a postponed regular expression 50 times without consuming any -input string will result in a fatal error. The maximum depth is compiled -into perl, so changing it requires a custom build. +Executing a postponed regular expression too many times without +consuming any input string will also result in a fatal error. The depth +at which that happens is compiled into perl, so it can be changed with a +custom build. =item C<(?I<PARNO>)> C<(?-I<PARNO>)> C<(?+I<PARNO>)> C<(?R)> C<(?0)> X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)> @@ -1663,9 +1911,9 @@ the output produced should be the following: $3 = bar(baz)+baz(bop) If there is no corresponding capture group defined, then it is a -fatal error. Recursing deeper than 50 times without consuming any input -string will also result in a fatal error. The maximum depth is compiled -into perl, so changing it requires a custom build. +fatal error. Recursing deeply without consuming any input string will +also result in a fatal error. The depth at which that happens is +compiled into perl, so it can be changed with a custom build. The following shows how using negative indexing can make it easier to embed recursive patterns inside of a C<qr//> construct @@ -1724,7 +1972,7 @@ matched); =item the special symbol C<(R)> (true when evaluated inside of recursion or eval). Additionally the -C<R> may be +C<"R"> may be followed by a number, (which will be true when evaluated when recursing inside of the appropriate group), or by C<&NAME>, in which case it will be true only when evaluated during recursion in the named group. @@ -1851,7 +2099,7 @@ give anything back" semantic is desirable. For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >> (anchored at the beginning of string, as above) will match I<all> -characters C<a> at the beginning of string, leaving no C<a> for +characters C<"a"> at the beginning of string, leaving no C<"a"> for C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>, since the match of the subgroup C<a*> is influenced by the following group C<ab> (see L</"Backtracking">). In particular, C<a*> inside @@ -1903,7 +2151,7 @@ hung. However, a tiny change to this pattern which uses C<< (?>...) >> matches exactly when the one above does (verifying this yourself would be a productive exercise), but finishes in a fourth -the time when used on a similar string with 1000000 C<a>s. Be aware, +the time when used on a similar string with 1000000 C<"a">s. Be aware, however, that, when this construct is followed by a quantifier, it currently triggers a warning message under the C<use warnings> pragma or B<-w> switch saying it @@ -1911,7 +2159,7 @@ C<"matches null string many times in regex">. On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>. -This was only 4 times slower on a string with 1000000 C<a>s. +This was only 4 times slower on a string with 1000000 C<"a">s. The "grab all you can, and do not give anything back" semantic is desirable in many situations where on the first sight a simple C<()*> looks like @@ -1966,7 +2214,7 @@ see L</Combining RE Pieces>. A fundamental feature of regular expression matching involves the notion called I<backtracking>, which is currently used (when needed) -by all regular non-possessive expression quantifiers, namely C<*>, C<*?>, C<+>, +by all regular non-possessive expression quantifiers, namely C<"*">, C<*?>, C<"+">, C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized internally, but the general principle outlined here is valid. @@ -2401,9 +2649,9 @@ For instance: 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x; -will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not +will match, and C<$1> will be C<AB> and C<$2> will be C<"B">, C<$3> will not be set. If another branch in the inner parentheses was matched, such as in the -string 'ACDE', then the C<D> and C<E> would have to be matched as well. +string 'ACDE', then the C<"D"> and C<"E"> would have to be matched as well. You can provide an argument, which will be available in the var C<$REGMARK> after the match completes. @@ -2412,101 +2660,6 @@ C<$REGMARK> after the match completes. =back -=head2 Version 8 Regular Expressions -X<regular expression, version 8> X<regex, version 8> X<regexp, version 8> - -In case you're not familiar with the "regular" Version 8 regex -routines, here are the pattern-matching rules not described above. - -Any single character matches itself, unless it is a I<metacharacter> -with a special meaning described here or above. You can cause -characters that normally function as metacharacters to be interpreted -literally by prefixing them with a C<"\"> (e.g., C<"\."> matches a C<".">, not any -character; "\\" matches a C<"\">). This escape mechanism is also required -for the character used as the pattern delimiter. - -A series of characters matches that series of characters in the target -string, so the pattern C<blurfl> would match "blurfl" in the target -string. - -You can specify a character class, by enclosing a list of characters -in C<[]>, which will match any character from the list. If the -first character after the C<"["> is C<"^">, the class matches any character not -in the list. Within a list, the C<"-"> character specifies a -range, so that C<a-z> represents all characters between "a" and "z", -inclusive. If you want either C<"-"> or C<"]"> itself to be a member of a -class, put it at the start of the list (possibly after a C<"^">), or -escape it with a backslash. C<"-"> is also taken literally when it is -at the end of the list, just before the closing C<"]">. (The -following all specify the same class of three characters: C<[-az]>, -C<[az-]>, and C<[a\-z]>. All are different from C<[a-z]>, which -specifies a class containing twenty-six characters, even on EBCDIC-based -character sets.) Also, if you try to use the character -classes C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, or C<\D> as endpoints of -a range, the C<"-"> is understood literally. - -Note also that the whole range idea is rather unportable between -character sets, except for four situations that Perl handles specially. -Any subset of the ranges C<[A-Z]>, C<[a-z]>, and C<[0-9]> are guaranteed -to match the expected subset of ASCII characters, no matter what -character set the platform is running. The fourth portable way to -specify ranges is to use the C<\N{...}> syntax to specify either end -point of the range. For example, C<[\N{U+04}-\N{U+07}]> means to match -the Unicode code points C<\N{U+04}>, C<\N{U+05}>, C<\N{U+06}>, and -C<\N{U+07}>, whatever their native values may be on the platform. Under -L<use re 'strict'|re/'strict' mode> or within a L</C<(?[ ])>>, a warning -is raised, if enabled, and the other end point of a range which has a -C<\N{...}> endpoint is not portably specified. For example, - - [\N{U+00}-\x06] # Warning under "use re 'strict'". - -It is hard to understand without digging what exactly matches ranges -other than subsets of C<[A-Z]>, C<[a-z]>, and C<[0-9]>. A sound -principle is to use only ranges that begin from and end at either -alphabetics of equal case ([a-e], [A-E]), or digits ([0-9]). Anything -else is unsafe or unclear. If in doubt, spell out the range in full. - -Characters may be specified using a metacharacter syntax much like that -used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return, -"\f" a form feed, etc. More generally, \I<nnn>, where I<nnn> is a string -of three octal digits, matches the character whose coded character set value -is I<nnn>. Similarly, \xI<nn>, where I<nn> are hexadecimal digits, -matches the character whose ordinal is I<nn>. The expression \cI<x> -matches the character control-I<x>. Finally, the C<"."> metacharacter -matches any character except "\n" (unless you use C</s>). - -You can specify a series of alternatives for a pattern using C<"|"> to -separate them, so that C<fee|fie|foe> will match any of "fee", "fie", -or "foe" in the target string (as would C<f(e|i|o)e>). The -first alternative includes everything from the last pattern delimiter -(C<"(">, "(?:", etc. or the beginning of the pattern) up to the first C<"|">, and -the last alternative contains everything from the last C<"|"> to the next -closing pattern delimiter. That's why it's common practice to include -alternatives in parentheses: to minimize confusion about where they -start and end. - -Alternatives are tried from left to right, so the first -alternative found for which the entire expression matches, is the one that -is chosen. This means that alternatives are not necessarily greedy. For -example: when matching C<foo|foot> against "barefoot", only the "foo" -part will match, as that is the first alternative tried, and it successfully -matches the target string. (This might not seem important, but it is -important when you are capturing matched text using parentheses.) - -Also remember that C<"|"> is interpreted as a literal within square brackets, -so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>. - -Within a pattern, you may designate subpatterns for later reference -by enclosing them in parentheses, and you may refer back to the -I<n>th subpattern later in the pattern using the metacharacter -\I<n> or \gI<n>. Subpatterns are numbered based on the left to right order -of their opening parenthesis. A backreference matches whatever -actually matched the subpattern in the string being examined, not -the rules for that subpattern. Therefore, C<(0|0x)\d*\s\g1\d*> will -match "0x1234 0x4321", but not "0x1234 01234", because subpattern -1 matched "0x", even though the rule C<0|0x> could potentially match -the leading 0 in the second number. - =head2 Warning on C<\1> Instead of C<$1> Some people get too used to writing things like: @@ -2546,10 +2699,10 @@ loops using regular expressions, with something as innocuous as: 'foo' =~ m{ ( o? )* }x; -The C<o?> matches at the beginning of C<'foo'>, and since the position +The C<o?> matches at the beginning of "C<foo>", and since the position in the string is not moved by the match, C<o?> would match again and again because of the C<"*"> quantifier. Another common way to create a similar cycle -is with the looping modifier C<//g>: +is with the looping modifier C</g>: @matches = ( 'foo' =~ m{ o? }xg ); @@ -2637,8 +2790,8 @@ Each of the elementary pieces of regular expressions which were described before (such as C<ab> or C<\Z>) could match at most one substring at the given position of the input string. However, in a typical regular expression these elementary pieces are combined into more complicated -patterns using combining operators C<ST>, C<S|T>, C<S*> etc. -(in these examples C<S> and C<T> are regular subexpressions). +patterns using combining operators C<ST>, C<S|T>, C<S*> I<etc>. +(in these examples C<"S"> and C<"T"> are regular subexpressions). Such combinations can include alternatives, leading to a problem of choice: if we match a regular expression C<a|ab> against C<"abc">, will it match @@ -2656,28 +2809,28 @@ by the question of "which matches are better, and which are worse?". Again, for elementary pieces there is no such question, since at most one match at a given position is possible. This section describes the notion of better/worse for combining operators. In the description -below C<S> and C<T> are regular subexpressions. +below C<"S"> and C<"T"> are regular subexpressions. =over 4 =item C<ST> -Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are -substrings which can be matched by C<S>, C<B> and C<B'> are substrings -which can be matched by C<T>. +Consider two possible matches, C<AB> and C<A'B'>, C<"A"> and C<A'> are +substrings which can be matched by C<"S">, C<"B"> and C<B'> are substrings +which can be matched by C<"T">. -If C<A> is a better match for C<S> than C<A'>, C<AB> is a better +If C<"A"> is a better match for C<"S"> than C<A'>, C<AB> is a better match than C<A'B'>. -If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if -C<B> is a better match for C<T> than C<B'>. +If C<"A"> and C<A'> coincide: C<AB> is a better match than C<AB'> if +C<"B"> is a better match for C<"T"> than C<B'>. =item C<S|T> -When C<S> can match, it is a better match than when only C<T> can match. +When C<"S"> can match, it is a better match than when only C<"T"> can match. -Ordering of two matches for C<S> is the same as for C<S>. Similar for -two matches for C<T>. +Ordering of two matches for C<"S"> is the same as for C<"S">. Similar for +two matches for C<"T">. =item C<S{REPEAT_COUNT}> @@ -2701,18 +2854,18 @@ Same as C<S{0,1}?>, C<S{0,BIG_NUMBER}?>, C<S{1,BIG_NUMBER}?> respectively. =item C<< (?>S) >> -Matches the best match for C<S> and only that. +Matches the best match for C<"S"> and only that. =item C<(?=S)>, C<(?<=S)> -Only the best match for C<S> is considered. (This is important only if -C<S> has capturing parentheses, and backreferences are used somewhere +Only the best match for C<"S"> is considered. (This is important only if +C<"S"> has capturing parentheses, and backreferences are used somewhere else in the whole regular expression.) =item C<(?!S)>, C<(?<!S)> For this grouping operator there is no need to describe the ordering, since -only whether or not C<S> can match is important. +only whether or not C<"S"> can match is important. =item C<(??{ EXPR })>, C<(?I<PARNO>)> @@ -2774,7 +2927,7 @@ this: } Now C<use customre> enables the new escape in constant regular -expressions, i.e., those without any runtime variable interpolations. +expressions, I<i.e.>, those without any runtime variable interpolations. As documented in L<overload>, this conversion will work only over literal parts of regular expressions. For C<\Y|$re\Y|> the variable part of this regular expression needs to be converted explicitly @@ -2788,7 +2941,7 @@ part of this regular expression needs to be converted explicitly =head2 Embedded Code Execution Frequency -The exact rules for how often (??{}) and (?{}) are executed in a pattern +The exact rules for how often C<(??{})> and C<(?{})> are executed in a pattern are unspecified. In the case of a successful match you can assume that they DWIM and will be executed in left to right order the appropriate number of times in the accepting path of the pattern as would any other @@ -2846,7 +2999,7 @@ Subroutine call to a named capture group. Equivalent to C<< (?&NAME) >>. =head1 BUGS There are a number of issues with regard to case-insensitive matching -in Unicode rules. See C<i> under L</Modifiers> above. +in Unicode rules. See C<"i"> under L</Modifiers> above. This document varies from difficult to understand to completely and utterly opaque. The wandering prose riddled with jargon is @@ -2857,6 +3010,11 @@ from the reference content. =head1 SEE ALSO +The syntax of patterns used in Perl pattern matching evolved from those +supplied in the Bell Labs Research Unix 8th Edition (Version 8) regex +routines. (The code is actually derived (distantly) from Henry +Spencer's freely redistributable reimplementation of those V8 routines.) + L<perlrequick>. L<perlretut>. diff --git a/pod/perlreapi.pod b/pod/perlreapi.pod index c11ff9e52b..52e6b0f87b 100644 --- a/pod/perlreapi.pod +++ b/pod/perlreapi.pod @@ -745,7 +745,7 @@ C<regexp_paren_pair> struct is defined as follows: If C<< ->offs[num].start >> or C<< ->offs[num].end >> is C<-1> then that capture group did not match. C<< ->offs[0].start/end >> represents C<$&> (or -C<${^MATCH}> under C<//p>) and C<< ->offs[paren].end >> matches C<$$paren> where +C<${^MATCH}> under C</p>) and C<< ->offs[paren].end >> matches C<$$paren> where C<$paren >= 1>. =head2 C<precomp> C<prelen> diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 52d44ae604..72c7b489ec 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -69,8 +69,8 @@ as C<Not in [].> \b{}, \b Boundary. (\b is a backspace in []). \B{}, \B Not a boundary. Not in []. \cX Control-X. - \d Character class for digits. - \D Character class for non-digits. + \d Match any digit character. + \D Match any character that isn't a digit. \e Escape character. \E Turn off \Q, \L and \U processing. Not in []. \f Form feed. @@ -78,31 +78,31 @@ as C<Not in [].> \g{}, \g1 Named, absolute or relative backreference. Not in []. \G Pos assertion. Not in []. - \h Character class for horizontal whitespace. - \H Character class for non horizontal whitespace. + \h Match any horizontal whitespace character. + \H Match any character that isn't horizontal whitespace. \k{}, \k<>, \k'' Named backreference. Not in []. \K Keep the stuff left of \K. Not in []. \l Lowercase next character. Not in []. \L Lowercase till \E. Not in []. \n (Logical) newline character. - \N Any character but newline. Not in []. + \N Match any character but newline. Not in []. \N{} Named or numbered (Unicode) character or sequence. \o{} Octal escape sequence. - \p{}, \pP Character with the given Unicode property. - \P{}, \PP Character without the given Unicode property. + \p{}, \pP Match any character with the given Unicode property. + \P{}, \PP Match any character without the given property. \Q Quote (disable) pattern metacharacters till \E. Not in []. \r Return character. \R Generic new line. Not in []. - \s Character class for whitespace. - \S Character class for non whitespace. + \s Match any whitespace character. + \S Match any character that isn't a whitespace. \t Tab character. \u Titlecase next character. Not in []. \U Uppercase till \E. Not in []. - \v Character class for vertical whitespace. - \V Character class for non vertical whitespace. - \w Character class for word characters. - \W Character class for non-word characters. + \v Match any vertical whitespace character. + \V Match any character that isn't vertical whitespace + \w Match any word character. + \W Match any character that isn't a word character. \x{}, \x00 Hexadecimal escape sequence. \X Unicode "extended grapheme cluster". Not in []. \z End of string. Not in []. diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod index 22f71ab211..ab01142397 100644 --- a/pod/perlrecharclass.pod +++ b/pod/perlrecharclass.pod @@ -27,9 +27,11 @@ to mean just the bracketed form. Certainly, most Perl documentation does that. The dot (or period), C<.> is probably the most used, and certainly the most well-known character class. By default, a dot matches any character, except for the newline. That default can be changed to -add matching the newline by using the I<single line> modifier: either +add matching the newline by using the I<single line> modifier: for the entire regular expression with the C</s> modifier, or -locally with C<(?s)>. (The C<L</\N>> backslash sequence, described +locally with C<(?s)> (and even globally within the scope of +L<C<use re '/s'>|re/'E<sol>flags' mode>). (The C<L</\N>> backslash +sequence, described below, matches any character except newline without regard to the I<single line> modifier.) @@ -176,7 +178,7 @@ are generally used to add auxiliary markings to letters. C<\w> matches the platform's native underscore character plus whatever the locale considers to be alphanumeric. -=item if Unicode rules are in effect ... +=item if instead, Unicode rules are in effect ... C<\w> matches exactly what C<\p{Word}> matches. @@ -234,7 +236,7 @@ in the table below. C<\s> matches whatever the locale considers to be whitespace. -=item if Unicode rules are in effect ... +=item if instead, Unicode rules are in effect ... C<\s> matches exactly the characters shown with an "s" column in the table below. @@ -498,10 +500,11 @@ consisting of the two characters matched against. Like the other instance where a bracketed class can match multiple characters, and for similar reasons, the class must not be inverted, and the named sequence may not appear in a range, even one where it is both endpoints. If -these happen, it is a fatal error if the character class is within an -extended L<C<(?[...])>|/Extended Bracketed Character Classes> -class; and only the first code point is used (with -a C<regexp>-type warning raised) otherwise. +these happen, it is a fatal error if the character class is within the +scope of L<C<use re 'strict>|re/'strict' mode>, or within an extended +L<C<(?[...])>|/Extended Bracketed Character Classes> class; otherwise +only the first code point is used (with a C<regexp>-type warning +raised). =back @@ -946,7 +949,7 @@ just the platform's native tab and space characters. =back -=item if Unicode rules are in effect ... +=item if instead, Unicode rules are in effect ... The POSIX class matches the same as the Full-range counterpart. diff --git a/pod/perlrequick.pod b/pod/perlrequick.pod index 3cda44ab2d..5832cfa359 100644 --- a/pod/perlrequick.pod +++ b/pod/perlrequick.pod @@ -381,9 +381,9 @@ no string left to it, so it matches 0 times. There are a few more things you might want to know about matching operators. -The global modifier C<//g> allows the matching operator to match +The global modifier C</g> allows the matching operator to match within a string as many times as possible. In scalar context, -successive matches against a string will have C<//g> jump from match +successive matches against a string will have C</g> jump from match to match, keeping track of position in the string as it goes along. You can get or set the position with the C<pos()> function. For example, @@ -401,9 +401,9 @@ prints A failed match or changing the target string resets the position. If you don't want the position reset after failure to match, add the -C<//c>, as in C</regex/gc>. +C</c>, as in C</regex/gc>. -In list context, C<//g> returns a list of matched groupings, or if +In list context, C</g> returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex. So @words = ($x =~ /(\w+)/g); # matches, diff --git a/pod/perlretut.pod b/pod/perlretut.pod index 87ef42b145..9c1671edfe 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -23,30 +23,30 @@ characteristics. The string is most often some text, such as a line, sentence, web page, or even a whole book, but less commonly it could be some binary data as well. Suppose we want to determine if the text in variable, C<$var> contains -the sequence of characters C<m> C<u> C<s> C<h> C<r> C<o> C<o> C<m> +the sequence of characters S<C<m u s h r o o m>> (blanks added for legibility). We can write in Perl $var =~ m/mushroom/ The value of this expression will be TRUE if C<$var> contains that sequence of characters, and FALSE otherwise. The portion enclosed in -C<"E<sol>"> characters denotes the characteristic we are looking for. +C<'E<sol>'> characters denotes the characteristic we are looking for. We use the term I<pattern> for it. The process of looking to see if the pattern occurs in the string is called I<matching>, and the C<"=~"> -operator along with the C<"m//"> tell Perl to try to match the pattern +operator along with the C<m//> tell Perl to try to match the pattern against the string. Note that the pattern is also a string, but a very special kind of one, as we will see. Patterns are in common use these days; examples are the patterns typed into a search engine to find web pages -and the patterns used to list files in a directory, e.g., C<ls *.txt> -or C<dir *.*>. In Perl, the patterns described by regular expressions +and the patterns used to list files in a directory, I<e.g.>, "C<ls *.txt>" +or "C<dir *.*>". In Perl, the patterns described by regular expressions are used not only to search strings, but to also extract desired parts of strings, and to do search and replace operations. Regular expressions have the undeserved reputation of being abstract and difficult to understand. This really stems simply because the notation used to express them tends to be terse and dense, and not -because of inherent complexity. We recommend using the C<"/x"> regular +because of inherent complexity. We recommend using the C</x> regular expression modifier (described below) along with plenty of white space to make them less dense, and easier to read. Regular expressions are constructed using @@ -64,7 +64,7 @@ comfortable with the basics and hungry for more power tools. It discusses the more advanced regular expression operators and introduces the latest cutting-edge innovations. -A note: to save time, 'regular expression' is often abbreviated as +A note: to save time, "regular expression" is often abbreviated as regexp or regex. Regexp is a more natural abbreviation than regex, but is harder to pronounce. The Perl pod documentation is evenly split on regexp vs regex; in Perl, there is more than one way to abbreviate it. @@ -112,7 +112,7 @@ be reversed by using the C<!~> operator: The literal string in the regexp can be replaced by a variable: - $greeting = "World"; + my $greeting = "World"; if ("Hello World" =~ /$greeting/) { print "It matches\n"; } @@ -140,7 +140,7 @@ to arbitrary delimiters by putting an C<'m'> out front: # '/' becomes an ordinary char C</World/>, C<m!World!>, and C<m{World}> all represent the -same thing. When, e.g., the quote (C<">) is used as a delimiter, the forward +same thing. When, I<e.g.>, the quote (C<'"'>) is used as a delimiter, the forward slash C<'/'> becomes an ordinary character and can be used in this regexp without trouble. @@ -154,10 +154,10 @@ Let's consider how different regexps would match C<"Hello World">: The first regexp C<world> doesn't match because regexps are case-sensitive. The second regexp matches because the substring S<C<'o W'>> occurs in the string S<C<"Hello World">>. The space -character ' ' is treated like any other character in a regexp and is +character C<' '> is treated like any other character in a regexp and is needed to match in this case. The lack of a space character is the reason the third regexp C<'oW'> doesn't match. The fourth regexp -C<'World '> doesn't match because there is a space at the end of the +"C<World >" doesn't match because there is a space at the end of the regexp, but not at the end of the string. The lesson here is that regexps must match a part of the string I<exactly> in order for the statement to be true. @@ -169,15 +169,16 @@ always match at the earliest possible point in the string: "That hat is red" =~ /hat/; # matches 'hat' in 'That' With respect to character matching, there are a few more points you -need to know about. First of all, not all characters can be used 'as -is' in a match. Some characters, called I<metacharacters>, are reserved +need to know about. First of all, not all characters can be used "as +is" in a match. Some characters, called I<metacharacters>, are reserved for use in regexp notation. The metacharacters are - {}[]()^$.|*+?\ + {}[]()^$.|*+?-\ The significance of each of these will be explained in the rest of the tutorial, but for now, it is important only to know -that a metacharacter can be matched by putting a backslash before it: +that a metacharacter can be matched as-is by putting a backslash before +it: "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + @@ -210,8 +211,8 @@ which don't have printable character equivalents and are instead represented by I<escape sequences>. Common examples are C<\t> for a tab, C<\n> for a newline, C<\r> for a carriage return and C<\a> for a bell (or alert). If your string is better thought of as a sequence of arbitrary -bytes, the octal escape sequence, e.g., C<\033>, or hexadecimal escape -sequence, e.g., C<\x1B> may be a more natural representation for your +bytes, the octal escape sequence, I<e.g.>, C<\033>, or hexadecimal escape +sequence, I<e.g.>, C<\x1B> may be a more natural representation for your bytes. Here are some examples of escapes: "1000\t2000" =~ m(0\t2) # matches @@ -270,9 +271,9 @@ C</$regexp/> use the default variable C<$_> implicitly. With all of the regexps above, if the regexp matched anywhere in the string, it was considered a match. Sometimes, however, we'd like to specify I<where> in the string the regexp should try to match. To do -this, we would use the I<anchor> metacharacters C<^> and C<$>. The -anchor C<^> means match at the beginning of the string and the anchor -C<$> means match at the end of the string, or before a newline at the +this, we would use the I<anchor> metacharacters C<'^'> and C<'$'>. The +anchor C<'^'> means match at the beginning of the string and the anchor +C<'$'> means match at the end of the string, or before a newline at the end of the string. Here is how they are used: "housekeeper" =~ /keeper/; # matches @@ -280,13 +281,13 @@ end of the string. Here is how they are used: "housekeeper" =~ /keeper$/; # matches "housekeeper\n" =~ /keeper$/; # matches -The second regexp doesn't match because C<^> constrains C<keeper> to +The second regexp doesn't match because C<'^'> constrains C<keeper> to match only at the beginning of the string, but C<"housekeeper"> has keeper starting in the middle. The third regexp does match, since the -C<$> constrains C<keeper> to match only at the end of the string. +C<'$'> constrains C<keeper> to match only at the end of the string. -When both C<^> and C<$> are used at the same time, the regexp has to -match both the beginning and the end of the string, i.e., the regexp +When both C<'^'> and C<'$'> are used at the same time, the regexp has to +match both the beginning and the end of the string, I<i.e.>, the regexp matches the whole string. Consider "keeper" =~ /^keep$/; # doesn't match @@ -295,7 +296,7 @@ matches the whole string. Consider The first regexp doesn't match because the string has more to it than C<keep>. Since the second regexp is exactly the string, it -matches. Using both C<^> and C<$> in a regexp forces the complete +matches. Using both C<'^'> and C<'$'> in a regexp forces the complete string to match, so it gives you complete control over which strings match and which don't. Suppose you are looking for a fellow named bert, off in a string by himself: @@ -351,13 +352,13 @@ operation. We will meet other modifiers later in the tutorial. We saw in the section above that there were ordinary characters, which represented themselves, and special characters, which needed a -backslash C<\> to represent themselves. The same is true in a +backslash C<'\'> to represent themselves. The same is true in a character class, but the sets of ordinary and special characters inside a character class are different than those outside a character class. The special characters for a character class are C<-]\^$> (and the pattern delimiter, whatever it is). -C<]> is special because it denotes the end of a character class. C<$> is -special because it denotes a scalar variable. C<\> is special because +C<']'> is special because it denotes the end of a character class. C<'$'> is +special because it denotes a scalar variable. C<'\'> is special because it is used in escape sequences, just like above. Here is how the special characters C<]$\> are handled: @@ -368,7 +369,7 @@ special characters C<]$\> are handled: /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat' The last two are a little tricky. In C<[\$x]>, the backslash protects -the dollar sign, so the character class has two members C<$> and C<x>. +the dollar sign, so the character class has two members C<'$'> and C<'x'>. In C<[\\$x]>, the backslash is protected, so C<$x> is treated as a variable and substituted in double quote fashion. @@ -388,7 +389,7 @@ If C<'-'> is the first or last character in a character class, it is treated as an ordinary character; C<[-ab]>, C<[ab-]> and C<[a\-b]> are all equivalent. -The special character C<^> in the first position of a character class +The special character C<'^'> in the first position of a character class denotes a I<negated character class>, which matches any character but those in the brackets. Both C<[...]> and C<[^...]> must match a character, or the match fails. Then @@ -401,7 +402,7 @@ character, or the match fails. Then Now, even C<[0-9]> can be a bother to write multiple times, so in the interest of saving keystrokes and making regexps more readable, Perl has several abbreviations for common character classes, as shown below. -Since the introduction of Unicode, unless the C<//a> modifier is in +Since the introduction of Unicode, unless the C</a> modifier is in effect, these character classes match more than just a few characters in the ASCII range. @@ -409,46 +410,46 @@ the ASCII range. =item * -\d matches a digit, not just [0-9] but also digits from non-roman scripts +C<\d> matches a digit, not just C<[0-9]> but also digits from non-roman scripts =item * -\s matches a whitespace character, the set [\ \t\r\n\f] and others +C<\s> matches a whitespace character, the set C<[\ \t\r\n\f]> and others =item * -\w matches a word character (alphanumeric or _), not just [0-9a-zA-Z_] +C<\w> matches a word character (alphanumeric or C<'_'>), not just C<[0-9a-zA-Z_]> but also digits and characters from non-roman scripts =item * -\D is a negated \d; it represents any other character than a digit, or [^\d] +C<\D> is a negated C<\d>; it represents any other character than a digit, or C<[^\d]> =item * -\S is a negated \s; it represents any non-whitespace character [^\s] +C<\S> is a negated C<\s>; it represents any non-whitespace character C<[^\s]> =item * -\W is a negated \w; it represents any non-word character [^\w] +C<\W> is a negated C<\w>; it represents any non-word character C<[^\w]> =item * -The period '.' matches any character but "\n" (unless the modifier C<//s> is +The period C<'.'> matches any character but C<"\n"> (unless the modifier C</s> is in effect, as explained below). =item * -\N, like the period, matches any character but "\n", but it does so -regardless of whether the modifier C<//s> is in effect. +C<\N>, like the period, matches any character but C<"\n">, but it does so +regardless of whether the modifier C</s> is in effect. =back -The C<//a> modifier, available starting in Perl 5.14, is used to -restrict the matches of \d, \s, and \w to just those in the ASCII range. +The C</a> modifier, available starting in Perl 5.14, is used to +restrict the matches of C<\d>, C<\s>, and C<\w> to just those in the ASCII range. It is useful to keep your program from being needlessly exposed to full Unicode (and its accompanying security considerations) when all you want -is to process English-like text. (The "a" may be doubled, C<//aa>, to +is to process English-like text. (The "a" may be doubled, C</aa>, to provide even more restrictions, preventing case-insensitive matching of ASCII with non-ASCII characters; otherwise a Unicode "Kelvin Sign" would caselessly match a "k" or "K".) @@ -510,48 +511,48 @@ of it as empty. Then This behavior is convenient, because we usually want to ignore newlines when we count and match characters in a line. Sometimes, -however, we want to keep track of newlines. We might even want C<^> -and C<$> to anchor at the beginning and end of lines within the +however, we want to keep track of newlines. We might even want C<'^'> +and C<'$'> to anchor at the beginning and end of lines within the string, rather than just the beginning and end of the string. Perl allows us to choose between ignoring and paying attention to newlines -by using the C<//s> and C<//m> modifiers. C<//s> and C<//m> stand for +by using the C</s> and C</m> modifiers. C</s> and C</m> stand for single line and multi-line and they determine whether a string is to be treated as one continuous string, or as a set of lines. The two modifiers affect two aspects of how the regexp is interpreted: 1) how -the C<'.'> character class is defined, and 2) where the anchors C<^> -and C<$> are able to match. Here are the four possible combinations: +the C<'.'> character class is defined, and 2) where the anchors C<'^'> +and C<'$'> are able to match. Here are the four possible combinations: =over 4 =item * -no modifiers (//): Default behavior. C<'.'> matches any character -except C<"\n">. C<^> matches only at the beginning of the string and -C<$> matches only at the end or before a newline at the end. +no modifiers: Default behavior. C<'.'> matches any character +except C<"\n">. C<'^'> matches only at the beginning of the string and +C<'$'> matches only at the end or before a newline at the end. =item * -s modifier (//s): Treat string as a single long line. C<'.'> matches -any character, even C<"\n">. C<^> matches only at the beginning of -the string and C<$> matches only at the end or before a newline at the +s modifier (C</s>): Treat string as a single long line. C<'.'> matches +any character, even C<"\n">. C<'^'> matches only at the beginning of +the string and C<'$'> matches only at the end or before a newline at the end. =item * -m modifier (//m): Treat string as a set of multiple lines. C<'.'> -matches any character except C<"\n">. C<^> and C<$> are able to match +m modifier (C</m>): Treat string as a set of multiple lines. C<'.'> +matches any character except C<"\n">. C<'^'> and C<'$'> are able to match at the start or end of I<any> line within the string. =item * -both s and m modifiers (//sm): Treat string as a single long line, but +both s and m modifiers (C</sm>): Treat string as a single long line, but detect multiple lines. C<'.'> matches any character, even -C<"\n">. C<^> and C<$>, however, are able to match at the start or end +C<"\n">. C<'^'> and C<'$'>, however, are able to match at the start or end of I<any> line within the string. =back -Here are examples of C<//s> and C<//m> in action: +Here are examples of C</s> and C</m> in action: $x = "There once was a girl\nWho programmed in Perl\n"; @@ -565,11 +566,11 @@ Here are examples of C<//s> and C<//m> in action: $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n" $x =~ /girl.Who/sm; # matches, "." matches "\n" -Most of the time, the default behavior is what is wanted, but C<//s> and -C<//m> are occasionally very useful. If C<//m> is being used, the start +Most of the time, the default behavior is what is wanted, but C</s> and +C</m> are occasionally very useful. If C</m> is being used, the start of the string can still be matched with C<\A> and the end of the string can still be matched with the anchors C<\Z> (matches both the end and -the newline before, like C<$>), and C<\z> (matches only the end): +the newline before, like C<'$'>), and C<\z> (matches only the end): $x =~ /^Who/m; # matches, "Who" at start of second line $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string @@ -588,7 +589,7 @@ choices are described in the next section. Sometimes we would like our regexp to be able to match different possible words or character strings. This is accomplished by using -the I<alternation> metacharacter C<|>. To match C<dog> or C<cat>, we +the I<alternation> metacharacter C<'|'>. To match C<dog> or C<cat>, we form the regexp C<dog|cat>. As before, Perl will try to match the regexp at the earliest possible point in the string. At each character position, Perl will first try to match the first @@ -662,7 +663,7 @@ C<"20"> is two digits. The process of trying one alternative, seeing if it matches, and moving on to the next alternative, while going back in the string from where the previous alternative was tried, if it doesn't, is called -I<backtracking>. The term 'backtracking' comes from the idea that +I<backtracking>. The term "backtracking" comes from the idea that matching a regexp is like a walk in the woods. Successfully matching a regexp is like arriving at a destination. There are many possible trailheads, one for each string position, and each one is tried in @@ -680,62 +681,59 @@ of what Perl does when it tries to match the regexp =over 4 -=item Z<>0 +=item Z<>0. Start with the first letter in the string C<'a'>. -Start with the first letter in the string 'a'. +E<nbsp> -=item Z<>1 +=item Z<>1. Try the first alternative in the first group C<'abd'>. -Try the first alternative in the first group 'abd'. +E<nbsp> -=item Z<>2 +=item Z<>2. Match C<'a'> followed by C<'b'>. So far so good. -Match 'a' followed by 'b'. So far so good. +E<nbsp> -=item Z<>3 +=item Z<>3. C<'d'> in the regexp doesn't match C<'c'> in the string - a +dead end. So backtrack two characters and pick the second alternative +in the first group C<'abc'>. -'d' in the regexp doesn't match 'c' in the string - a dead -end. So backtrack two characters and pick the second alternative in -the first group 'abc'. +E<nbsp> -=item Z<>4 +=item Z<>4. Match C<'a'> followed by C<'b'> followed by C<'c'>. We are on a roll +and have satisfied the first group. Set C<$1> to C<'abc'>. -Match 'a' followed by 'b' followed by 'c'. We are on a roll -and have satisfied the first group. Set $1 to 'abc'. +E<nbsp> -=item Z<>5 +=item Z<>5 Move on to the second group and pick the first alternative C<'df'>. -Move on to the second group and pick the first alternative -'df'. +E<nbsp> -=item Z<>6 +=item Z<>6 Match the C<'d'>. -Match the 'd'. +E<nbsp> -=item Z<>7 - -'f' in the regexp doesn't match 'e' in the string, so a dead +=item Z<>7. C<'f'> in the regexp doesn't match C<'e'> in the string, so a dead end. Backtrack one character and pick the second alternative in the -second group 'd'. +second group C<'d'>. -=item Z<>8 +E<nbsp> -'d' matches. The second grouping is satisfied, so set $2 to -'d'. +=item Z<>8. C<'d'> matches. The second grouping is satisfied, so set +C<$2> to C<'d'>. -=item Z<>9 +E<nbsp> -We are at the end of the regexp, so we are done! We have -matched 'abcd' out of the string "abcde". +=item Z<>9. We are at the end of the regexp, so we are done! We have +matched C<'abcd'> out of the string C<"abcde">. =back There are a couple of things to note about this analysis. First, the -third alternative in the second group 'de' also allows a match, but we +third alternative in the second group C<'de'> also allows a match, but we stopped before we got to it - at a given character position, leftmost wins. Second, we were able to get a match at the first character -position of the string 'a'. If there were no matches at the first -position, Perl would move to the second character position 'b' and +position of the string C<'a'>. If there were no matches at the first +position, Perl would move to the second character position C<'b'> and attempt the match all over again. Only when all possible paths at all possible character positions have been exhausted does Perl give up and declare S<C<$string =~ /(abd|abc)(df|d|de)/;>> to be false. @@ -752,7 +750,7 @@ The grouping metacharacters C<()> also serve another completely different function: they allow the extraction of the parts of a string that matched. This is very useful to find out what matched and for text processing in general. For each grouping, the part that matched -inside goes into the special variables C<$1>, C<$2>, etc. They can be +inside goes into the special variables C<$1>, C<$2>, I<etc>. They can be used just as ordinary variables: # extract hours, minutes, seconds @@ -772,7 +770,7 @@ C<($1,$2,$3)>. So we could write the code more compactly as If the groupings in a regexp are nested, C<$1> gets the group with the leftmost opening parenthesis, C<$2> the next opening parenthesis, -etc. Here is a regexp with nested groups: +I<etc>. Here is a regexp with nested groups: /(ab(cd|ef)((gi)|j))/; 1 2 34 @@ -784,7 +782,7 @@ or it remains undefined. For convenience, Perl sets C<$+> to the string held by the highest numbered C<$1>, C<$2>,... that got assigned (and, somewhat related, C<$^N> to the -value of the C<$1>, C<$2>,... most-recently assigned; i.e. the C<$1>, +value of the C<$1>, C<$2>,... most-recently assigned; I<i.e.> the C<$1>, C<$2>,... associated with the rightmost closing parenthesis used in the match). @@ -796,12 +794,12 @@ the I<backreferences> C<\g1>, C<\g2>,... Backreferences are simply matching variables that can be used I<inside> a regexp. This is a really nice feature; what matches later in a regexp is made to depend on what matched earlier in the regexp. Suppose we wanted to look -for doubled words in a text, like 'the the'. The following regexp finds +for doubled words in a text, like "the the". The following regexp finds all 3-letter doubles with a space in between: /\b(\w\w\w)\s\g1\b/; -The grouping assigns a value to \g1, so that the same 3-letter sequence +The grouping assigns a value to C<\g1>, so that the same 3-letter sequence is used for both parts. A similar task is to find words consisting of two identical parts: @@ -815,7 +813,7 @@ A similar task is to find words consisting of two identical parts: papa The regexp has a single grouping which considers 4-letter -combinations, then 3-letter combinations, etc., and uses C<\g1> to look for +combinations, then 3-letter combinations, I<etc>., and uses C<\g1> to look for a repeat. Although C<$1> and C<\g1> represent the same thing, care should be taken to use matched variables C<$1>, C<$2>,... only I<outside> a regexp and backreferences C<\g1>, C<\g2>,... only I<inside> a regexp; not doing @@ -869,7 +867,7 @@ capture group is accessible through the C<%+> hash. Assuming that we have to match calendar dates which may be given in one of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write -three suitable patterns where we use 'd', 'm' and 'y' respectively as the +three suitable patterns where we use C<'d'>, C<'m'> and C<'y'> respectively as the names of the groups capturing the pertaining components of a date. The matching operation combines the three patterns as alternatives: @@ -935,7 +933,7 @@ prints Even if there are no groupings in a regexp, it is still possible to find out what exactly matched in a string. If you use them, Perl will set C<$`> to the part of the string before the match, will set C<$&> -to the part of the string that matched, and will set C<$'> to the part +to the part of the string that matched, and will set C<'$'> to the part of the string after the match. An example: $x = "the cat caught the mouse"; @@ -944,10 +942,10 @@ of the string after the match. An example: In the second match, C<$`> equals C<''> because the regexp matched at the first character position in the string and stopped; it never saw the -second 'the'. +second "the". If your code is to run on Perl versions earlier than -5.20, it is worthwhile to note that using C<$`> and C<$'> +5.20, it is worthwhile to note that using C<$`> and C<'$'> slows down regexp matching quite a bit, while C<$&> slows it down to a lesser extent, because if they are used in one regexp in a program, they are generated for I<all> regexps in the program. So if raw @@ -964,7 +962,7 @@ variables may be used. These are only set if the C</p> modifier is present. Consequently they do not penalize the rest of the program. In Perl 5.20, C<${^PREMATCH}>, C<${^MATCH}> and C<${^POSTMATCH}> are available whether the C</p> has been used or not (the modifier is ignored), and -C<$`>, C<$'> and C<$&> do not cause any speed difference. +C<$`>, C<'$'> and C<$&> do not cause any speed difference. =head2 Non-capturing groupings @@ -1011,8 +1009,8 @@ less. We'd like to be able to match words or, more generally, strings of any length, without writing out tedious alternatives like C<\w\w\w\w|\w\w\w|\w\w|\w>. -This is exactly the problem the I<quantifier> metacharacters C<?>, -C<*>, C<+>, and C<{}> were created for. They allow us to delimit the +This is exactly the problem the I<quantifier> metacharacters C<'?'>, +C<'*'>, C<'+'>, and C<{}> were created for. They allow us to delimit the number of repeats for a portion of a regexp we consider to be a match. Quantifiers are put immediately after the character, character class, or grouping that we want to specify. They have the following @@ -1022,15 +1020,15 @@ meanings: =item * -C<a?> means: match 'a' 1 or 0 times +C<a?> means: match C<'a'> 1 or 0 times =item * -C<a*> means: match 'a' 0 or more times, i.e., any number of times +C<a*> means: match C<'a'> 0 or more times, I<i.e.>, any number of times =item * -C<a+> means: match 'a' 1 or more times, i.e., at least once +C<a+> means: match C<'a'> 1 or more times, I<i.e.>, at least once =item * @@ -1070,9 +1068,9 @@ Here are some examples: For all of these quantifiers, Perl will try to match as much of the string as possible, while still allowing the regexp to succeed. Thus -with C</a?.../>, Perl will first try to match the regexp with the C<a> +with C</a?.../>, Perl will first try to match the regexp with the C<'a'> present; if that fails, Perl will try to match the regexp without the -C<a> present. For the quantifier C<*>, we get the following: +C<'a'> present. For the quantifier C<'*'>, we get the following: $x = "the cat in the hat"; $x =~ /^(.*)(cat)(.*)$/; # matches, @@ -1119,7 +1117,7 @@ that allows a match for the whole regexp will be the one used. =item * -Principle 2: The maximal matching quantifiers C<?>, C<*>, C<+> and +Principle 2: The maximal matching quantifiers C<'?'>, C<'*'>, C<'+'> and C<{n,m}> will in general match as much of the string as possible while still allowing the whole regexp to match. @@ -1149,8 +1147,8 @@ Here is an example of these principles in action: # $3 = 'l' This regexp matches at the earliest string position, C<'T'>. One -might think that C<e>, being leftmost in the alternation, would be -matched, but C<r> produces the longest string in the first quantifier. +might think that C<'e'>, being leftmost in the alternation, would be +matched, but C<'r'> produces the longest string in the first quantifier. $x =~ /(m{1,2})(.*)$/; # matches, # $1 = 'mm' @@ -1175,7 +1173,7 @@ C<'m'> for the second quantifier C<m{1,2}>. Here, C<.?> eats its maximal one character at the earliest possible position in the string, C<'a'> in C<programming>, leaving C<m{1,2}> -the opportunity to match both C<m>'s. Finally, +the opportunity to match both C<'m'>'s. Finally, "aXXXb" =~ /(X*)/; # matches with $1 = '' @@ -1187,23 +1185,23 @@ Sometimes greed is not good. At times, we would like quantifiers to match a I<minimal> piece of string, rather than a maximal piece. For this purpose, Larry Wall created the I<minimal match> or I<non-greedy> quantifiers C<??>, C<*?>, C<+?>, and C<{}?>. These are -the usual quantifiers with a C<?> appended to them. They have the +the usual quantifiers with a C<'?'> appended to them. They have the following meanings: =over 4 =item * -C<a??> means: match 'a' 0 or 1 times. Try 0 first, then 1. +C<a??> means: match C<'a'> 0 or 1 times. Try 0 first, then 1. =item * -C<a*?> means: match 'a' 0 or more times, i.e., any number of times, +C<a*?> means: match C<'a'> 0 or more times, I<i.e.>, any number of times, but as few times as possible =item * -C<a+?> means: match 'a' 1 or more times, i.e., at least once, but +C<a+?> means: match C<'a'> 1 or more times, I<i.e.>, at least once, but as few times as possible =item * @@ -1232,9 +1230,9 @@ Let's look at the example above, but with minimal quantifiers: # $2 = 'e' # $3 = ' programming republic of Perl' -The minimal string that will allow both the start of the string C<^> +The minimal string that will allow both the start of the string C<'^'> and the alternation to match is C<Th>, with the alternation C<e|r> -matching C<e>. The second quantifier C<.*> is free to gobble up the +matching C<'e'>. The second quantifier C<.*> is free to gobble up the rest of the string. $x =~ /(m{1,2}?)(.*?)$/; # matches, @@ -1245,7 +1243,7 @@ The first string position that this regexp can match is at the first C<'m'> in C<programming>. At this position, the minimal C<m{1,2}?> matches just one C<'m'>. Although the second quantifier C<.*?> would prefer to match no characters, it is constrained by the end-of-string -anchor C<$> to match the rest of the string. +anchor C<'$'> to match the rest of the string. $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches, # $1 = 'The progra' @@ -1253,12 +1251,12 @@ anchor C<$> to match the rest of the string. # $3 = 'ming republic of Perl' In this regexp, you might expect the first minimal quantifier C<.*?> -to match the empty string, because it is not constrained by a C<^> +to match the empty string, because it is not constrained by a C<'^'> anchor to match the beginning of the word. Principle 0 applies here, however. Because it is possible for the whole regexp to match at the start of the string, it I<will> match at the start of the string. Thus -the first quantifier has to match everything up to the first C<m>. The -second minimal quantifier matches just one C<m> and the third +the first quantifier has to match everything up to the first C<'m'>. The +second minimal quantifier matches just one C<'m'> and the third quantifier matches the rest of the string. $x =~ /(.??)(m{1,2})(.*)$/; # matches, @@ -1299,37 +1297,36 @@ backtracking. Here is a step-by-step analysis of the example =over 4 -=item Z<>0 - -Start with the first letter in the string 't'. +=item Z<>0. Start with the first letter in the string C<'t'>. -=item Z<>1 +E<nbsp> -The first quantifier '.*' starts out by matching the whole -string 'the cat in the hat'. +=item Z<>1. The first quantifier C<'.*'> starts out by matching the whole +string "C<the cat in the hat>". -=item Z<>2 +E<nbsp> -'a' in the regexp element 'at' doesn't match the end of the -string. Backtrack one character. +=item Z<>2. C<'a'> in the regexp element C<'at'> doesn't match the end +of the string. Backtrack one character. -=item Z<>3 +E<nbsp> -'a' in the regexp element 'at' still doesn't match the last -letter of the string 't', so backtrack one more character. +=item Z<>3. C<'a'> in the regexp element C<'at'> still doesn't match +the last letter of the string C<'t'>, so backtrack one more character. -=item Z<>4 +E<nbsp> -Now we can match the 'a' and the 't'. +=item Z<>4. Now we can match the C<'a'> and the C<'t'>. -=item Z<>5 +E<nbsp> -Move on to the third element '.*'. Since we are at the end of -the string and '.*' can match 0 times, assign it the empty string. +=item Z<>5. Move on to the third element C<'.*'>. Since we are at the +end of the string and C<'.*'> can match 0 times, assign it the empty +string. -=item Z<>6 +E<nbsp> -We are done! +=item Z<>6. We are done! =back @@ -1341,14 +1338,14 @@ string. A typical structure that blows up in your face is of the form /(a|b+)*/; The problem is the nested indeterminate quantifiers. There are many -different ways of partitioning a string of length n between the C<+> -and C<*>: one repetition with C<b+> of length n, two repetitions with +different ways of partitioning a string of length n between the C<'+'> +and C<'*'>: one repetition with C<b+> of length n, two repetitions with the first C<b+> length k and the second with length n-k, m repetitions -whose bits add up to length n, etc. In fact there are an exponential +whose bits add up to length n, I<etc>. In fact there are an exponential number of ways to partition a string as a function of its length. A regexp may get lucky and match early in the process, but if there is no match, Perl will try I<every> possibility before giving up. So be -careful with nested C<*>'s, C<{n,m}>'s, and C<+>'s. The book +careful with nested C<'*'>'s, C<{n,m}>'s, and C<'+'>'s. The book I<Mastering Regular Expressions> by Jeffrey Friedl gives a wonderful discussion of this and other efficiency issues. @@ -1363,15 +1360,15 @@ the simple pattern Whenever this is applied to a string which doesn't quite meet the pattern's expectations such as S<C<"abc ">> or S<C<"abc def ">>, -the regex engine will backtrack, approximately once for each character +the regexp engine will backtrack, approximately once for each character in the string. But we know that there is no way around taking I<all> of the initial word characters to match the first repetition, that I<all> spaces must be eaten by the middle part, and the same goes for the second word. With the introduction of the I<possessive quantifiers> in Perl 5.10, we -have a way of instructing the regex engine not to backtrack, with the -usual quantifiers with a C<+> appended to them. This makes them greedy as +have a way of instructing the regexp engine not to backtrack, with the +usual quantifiers with a C<'+'> appended to them. This makes them greedy as well as stingy; once they succeed they won't give anything back to permit another solution. They have the following meanings: @@ -1459,12 +1456,12 @@ Now consider floating point numbers with exponents. The key observation here is that I<both> integers and numbers with decimal points are allowed in front of an exponent. Then exponents, like the overall sign, are independent of whether we are matching numbers with -or without decimal points, and can be 'decoupled' from the +or without decimal points, and can be "decoupled" from the mantissa. The overall form of the regexp now becomes clear: /^(optional sign)(integer | f.p. mantissa)(optional exponent)$/; -The exponent is an C<e> or C<E>, followed by an integer. So the +The exponent is an C<'e'> or C<'E'>, followed by an integer. So the exponent regexp is /[eE][+-]?\d+/; # exponent @@ -1474,10 +1471,10 @@ Putting all the parts together, we get a regexp that matches numbers: /^[+-]?(\d+\.\d+|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; # Ta da! Long regexps like this may impress your friends, but can be hard to -decipher. In complex situations like this, the C<//x> modifier for a +decipher. In complex situations like this, the C</x> modifier for a match is invaluable. It allows one to put nearly arbitrary whitespace and comments into a regexp without affecting their meaning. Using it, -we can rewrite our 'extended' regexp in the more pleasing form +we can rewrite our "extended" regexp in the more pleasing form /^ [+-]? # first, match an optional sign @@ -1586,8 +1583,8 @@ We have already introduced the matching operator in its default C</regexp/> and arbitrary delimiter C<m!regexp!> forms. We have used the binding operator C<=~> and its negation C<!~> to test for string matches. Associated with the matching operator, we have discussed the -single line C<//s>, multi-line C<//m>, case-insensitive C<//i> and -extended C<//x> modifiers. There are a few more things you might +single line C</s>, multi-line C</m>, case-insensitive C</i> and +extended C</x> modifiers. There are a few more things you might want to know about matching operators. =head3 Prohibiting substitution @@ -1602,7 +1599,7 @@ special delimiter C<m''>: } Similar to strings, C<m''> acts like apostrophes on a regexp; all other -C<m> delimiters act like quotes. If the regexp evaluates to the empty string, +C<'m'> delimiters act like quotes. If the regexp evaluates to the empty string, the regexp in the I<last successful match> is used instead. So we have "dog" =~ /d/; # 'd' matches @@ -1612,15 +1609,15 @@ the regexp in the I<last successful match> is used instead. So we have =head3 Global matching The final two modifiers we will discuss here, -C<//g> and C<//c>, concern multiple matches. -The modifier C<//g> stands for global matching and allows the +C</g> and C</c>, concern multiple matches. +The modifier C</g> stands for global matching and allows the matching operator to match within a string as many times as possible. In scalar context, successive invocations against a string will have -C<//g> jump from match to match, keeping track of position in the +C</g> jump from match to match, keeping track of position in the string as it goes along. You can get or set the position with the C<pos()> function. -The use of C<//g> is shown in the following example. Suppose we have +The use of C</g> is shown in the following example. Suppose we have a string that consists of words separated by spaces. If we know how many words there are in advance, we could extract the words using groupings: @@ -1632,7 +1629,7 @@ groupings: # $3 = 'house' But what if we had an indeterminate number of words? This is the sort -of task C<//g> was made for. To extract all words, form the simple +of task C</g> was made for. To extract all words, form the simple regexp C<(\w+)> and loop over all matches with C</(\w+)/g>: while ($x =~ /(\w+)/g) { @@ -1647,12 +1644,12 @@ prints A failed match or changing the target string resets the position. If you don't want the position reset after failure to match, add the -C<//c>, as in C</regexp/gc>. The current position in the string is +C</c>, as in C</regexp/gc>. The current position in the string is associated with the string, not the regexp. This means that different strings have different positions and their respective positions can be set or read independently. -In list context, C<//g> returns a list of matched groupings, or if +In list context, C</g> returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regexp. So if **** PATCH TRUNCATED AT 2000 LINES -- 527 NOT SHOWN **** -- Perl5 Master Repository
