RFC 331 (v2) Consolidate the $1 and C\1 notations
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Consolidate the $1 and C\1 notations =head1 VERSION Maintainer: David Storrs [EMAIL PROTECTED] Date: 28 Sep 2000 Last Modified: 30 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 331 Version: 2 Status: Frozen =head1 ABSTRACT Currently, C\1 and $1 have only slightly different meanings within a regex. It is possible to consolidate them without losing any functionality and, in the process, we gain intuitiveness. =head1 CHANGES v1-v2: A major rewrite: =over 4 =item * Reformatted the argument into "The Problem" and "The Solution" sections =item * Added "Some Examples" section =item * Added "Why do this?" section =item * Added "P526 migration" section =item * Proposed the @/ variable =item * Various trivial edits and typo-fixs =back =head1 DESCRIPTION Note: For convenience, I am going to talk about C\1 and $1 in this RFC. In actuality, these notations extend indefinitely: C\1..\n and C$1..$n. Take it as read that anything which applies to $1 also applies to C$2, $3, etc. =head2 The Problem In current versions of Perl, C\1 and C$1 mean different things. Specifically, C\1 means "whatever was matched by the first set of grouping parens Iin this regex match." $1 means "whatever was matched by the first set of grouping parens Iin the previously-run regex match." For example: =over 4 =item * C/(foo)_$1_bar/ =item * C/(foo)_\1_bar/ =back the second will match 'foo_foo_bar', while the first will match 'foo_[SOMETHING]_bar' where [SOMETHING] is whatever was captured in the Bprevious match...which could be a long, long way away, possibly even in some module that you didn't even realize you were including (because it was included by a module that was included by a module that was included by a...). The primary reason for this distinction is s///, in which the left hand side is a pattern while the right hand side is a string (assuming no 'e' modifier). Therefore: =over 4 =item * Cs/(foo)$1/$1bar/ # changes "foo???" to "foobar" where ??? is from the last match =item * Cs/(foo)\1/$1bar/ # changes "foofoo" to "foobar" =back Note that, in the first example, the two $1s refer to different things, whereas in the second example, $1 and C\1 refer to the same thing. This is counterintuitive and non-Perlish; Perl should be intuitive and DWIMish. A separate, though less important, problem with the way backreferences are currently implemented is that it is difficult for a human to tell at a glance whether \10 means "escape character 10" or "backreference 10"...the only way to tell is to count the number of captured elements and see if there actually are ten of them, in which case \10 is a backreference and otherwise it is an escape character. In general, this isn't a problem because most patterns don't have ten sets of capturing parens. =head2 The Solution Ok, so the problem is that $1 and C\1 are counterintuitive. How do we make them intuitive without losing any functionality? First, let's get rid of the C\1 form for backreferences. Second, let's say that $n refers to the nth captured subelement of the pattern match which occured in this Bstatement--note that this is distinct from "in this pattern match." That means that, in Cs/(foo)$1/$1bar/, both $1s refer to the same thing (the string 'foo'), even though one of them occured inside a pattern and one occured inside a string. (See note [1] in the IMPLEMENTATION section.) Third, let's create a new special variable, @/ (mnemonic: the / is the default delimiter for a pattern match; if the English module remains extant, then @/ could have the long name of @LAST_MATCH, but there are currently several threads concerning removal of the English module). Much like the current C$1, $2... variables, this array will only be created (and hence, the speed price will only be paid), if you access its members. The 0th element of @/ will contain the qr()d form of the last pattern match, while successive elements refer to the captured subelements. Fourth, let's change when we update the variables which store the captures (the current C$1, $2, etc). @/ will only be updated when the entire statement which contains a pattern match has finished running (e.g., when the entire s/// is completed), rather than as soon as the pattern match is done (and therefore before the substitution happens). =head2 Some Examples =over 4 =item 1 If you did the following: C"Bilbo Baggins" =~ /((\w+)\s+(\w+))/ Then @/ would contain the following: C$/[0] the compiled equivalent of C/((\w+)\s+(\w+))/, C$/[1] the string "Bilbo Baggins" C$/[2] the string "Bilbo" C$/[3] the string "Baggins" Note that after the match, C$/[1], C$/[2], and C$/[3] contain exactly what C$1, $2, and C$3 would contain with present-day syntax. Furthermore, the compiled form of the match is available so if you want to repeat the match later (or insert it into a larger regex), you can
RFC 347 (v2) Remove long-deprecated $* (aka $MULTILINE_MATCHING)
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Remove long-deprecated $* (aka $MULTILINE_MATCHING) =head1 VERSION Maintainer: Hugo van der Sanden [EMAIL PROTECTED] Date: 29 Sep 2000 Last Modified: 30 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 347 Version: 2 Status: Frozen =head1 ABSTRACT The magic $* variable (known in English as $MULTILINE_MATCHING) has been deprecated for years. It is time to kill it. =head1 DESCRIPTION In days of yore, you would set $* to 1 to achieve in all regexps the same as you can now achieve on a per-regexp basis with the /m flag. Nowadays, when most perl programmers have never heard of it, it is an accident waiting to happen and requires ugly additional cruft for the defensive programmer to avoid. The particular danger of $* is its 'action at a distance' effect: as a global variable, its effect reaches into and out of scopes that we normally expect to protect us. =head1 MIGRATION The long deprecation cycle helps here. p52p6 should complain and die if it sees any attempt to set $* or $MULTILINE_MATCHING to a non-zero value, or any attempt to alias it other than in English. It should silently (or maybe with a warning) ignore any attempt to set it to a zero value, and silently (or maybe with a warning) replace any attempt to read it with a constant undef. =head1 IMPLEMENTATION This only simplifies the regexp engine, and should help fix some longstanding bugs in the scope of /m. There is a bit of work to do to extricate it, but nothing seriously difficult. =head1 REFERENCES perlvar manpage for discussion of $*
RFC 360 (v1) Allow multiply matched groups in regexes to return a listref of all matches
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Allow multiply matched groups in regexes to return a listref of all matches =head1 VERSION Maintainer: Kevin Walker [EMAIL PROTECTED] Date: 30 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 360 Version: 1 Status: Developing =head1 DESCRIPTION Since the October 1 RFC deadline is nigh, this will be pretty informal. Suppose you want to parse text with looks like: name: John Abajace children: Tom, Dick, Harry favorite colors: red, green, blue name: I. J. Reilly children: Jane, Gertrude favorite colors: black, white ... Currently, this takes two passes: while ($text =~ /name:\s*(.*?)\n\s* children:\s*(.*?)\n\s* favorite\ colors:\s*(.*?)\n/sigx) { # now second pass for $2 ( = "Tom, Dick, Harry") and $3, yielding # list of children and favorite colors } If we introduce a new construction, (?@ ... ), which means "spit out a list ref of all matches, not just the last match", then this could be done in one pass: while ($text =~ /name:\s*(.*?)\n\s* children:\s*(?:(?@\S+)[, ]*)*\n\s* favorite\ colors:\s*(?:(?@\S+)[, ]*)*\n/sigx) { # now we have: # $1 = "John Abajace"; # $2 = ["Tom", "Dick", "Harry"] # $3 = ["red", "green", "blue"] } Although the above example is contrived, I have very often felt the need for this feature in real-world projects. =head1 IMPLEMENTATION Unknown. =head1 REFERENCES None.
RFC 112 (v4) Assignment within a regex
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Assignment within a regex =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 16 Aug 2000 Last Modified: 1 Oct 2000 Mailing List: [EMAIL PROTECTED] Number: 112 Version: 4 Status: Frozen =head1 ABSTRACT Provide a simple way of naming and picking out information from a regex without having to count the brackets. =head1 DESCRIPTION If a regex is complex, counting the bracketed sub-expressions to find the ones you wish to pick out can be messy. It is also prone to maintainability problems if and when you wish to add to the expression. Using (?:) can be used to surpress picking up brackets, it helps, but it still gets "complex". I would sometimes rather just pickout the bits I want within the regex itself. Suggested syntax: (?$foo= ... ) would assign the string that is matched by the patten ... to $foo when the patten matches. These assignments would be made left to right after the match has succeded but before processing a replacement or other results (or prior to a some (?{...}) or (??{...}) code). There may be whitespace between the $foo and the "=". Potentially the $foo could be any scalar LHS, as in (?$foo{$bar}= ... ), likewise the '=' could be any asignment operator. The camel and the docs include this example: if (/Time: (..):(..):(..)/) { $hours = $1; $minutes = $2; $seconds = $3; } This then becomes: /Time: (?$hours=..):(?$minutes=..):(?$seconds=..)/ This is more maintainable than counting the brackets and easier to understand for a complex regex. And one does not have to worry about the scope of $1 etc. =head2 When does the assignment actually happen? In general all assignments should wait to the very end, and then assign them all. However before code callouts (?{...}) and friends, the named assignments that are currently defined should be made so that the code can refer to them by name. It may be appropriate for any assignments made before a code callout to be localised so they can unrolled should the expression finally fail. =head2 Named Backrefs The first versions of this RFC did not allow for backrefs. I now think this was a shortcoming. It can be done with (??{quotemeta $foo}), but I find this clumsy, a better way of using a named back ref might be (?\$foo). =head2 Scoping The question of scoping for these assignments has been raised, but I don't currently have a feel for the "best" way to handle this. Input welcome. Hugo: I think it should be defined to act the same as in (??{...}), whenever we get around to defining that. =head2 Brackets Using this method for capturing wanted content, it might be desirable to stop ordinary brackets capturing, and needing to use (?:...). I therefore suggest that as an enhancement to regexes that /b (bracket?) ordinary brackets just group, without capture - in effect they all behave as (?:...). =head1 CHANGES V3 - added bit about backrefs, and brackets. V4 - Clarified a few things and froze =head1 IMPLENTATION Currently all $scalars in regexes are expanded before the main regex compiler gets to analyse the syntax. This problem also affects several other RFCs (166 for example). The expansion of variables in regexes needs for these (and other RFCs) to be driven from within the regex compiler so that the regex can expand as and where appropriate. Changing this should not affect any existing behaviour. =head1 REFERENCES I brought this up on p5p a couple of years ago, but it was lost in the noise... RFC 166 Perlstorm #0040
RFC 166 (v4) Alternative lists and quoting of things
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Alternative lists and quoting of things =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 27 Aug 2000 Last Modified: 1 Oct 2000 Mailing List: [EMAIL PROTECTED] Number: 166 Version: 4 Status: Frozen =head1 ABSTRACT Expand Alternate Lists from Arrays and Quote the contents of things inside regexes. =head1 DESCRIPTION These are a couple of constructs to make it easy to build up regexes from other things. =head2 Alternative Lists from arrays The basic idea is to expand an array as a list of alternatives. There are two possible syntaxs (?@foo) and just plain @foo. @foo might just have existing uses (just), therefore I prefer the (?@foo) syntax. (?@foo) is just syntactic sugar for (?:(??{ join('|',@foo) })) A bracketed list of alternatives. But built at regex compile time maybe its @{[ join('|',@foo) ]}. =head2 Quoting the contents of things If a regex uses $foo or @bar there are problems if the content of the variables contain special characters. What is needed is a way of \Quoting the content of scalars $foo or arrays (?@foo). Suggested syntax: (?Q$foo) Quotes the contents of the scalar $foo - equivalent to (??{ quotemeta $foo }). (?Q@foo) Quotes each item in a list (as above) this is equivalent to (?:(??{ join ('|', map quotemeta, @foo)})). In this syntax the Q is used as it represents a more inteligent \Quot\E. It is recognised that (?Q$foo) is equivalent to \Q$foo\E, but it does not mean that this is a bad idea to add this at the same time as (?Q@foo) for reasons of symetry and perl DWIM. It is recognised the (?Q might be reserved for control of a hypothetical Q flag, but this does feel "appropriate" as its about \Quoting. =head2 Comments Hugo: (?@foo) and (?Q@foo) are both things I've wanted before now. I'm not sure if this is the right syntax, particularly if RFC 112 is adopted: it would be confusing to have (?@foo) to have so different a meaning from (?$foo=...), and even more so if the latter is ever extended to allow (?@foo=...). I see no reason that implementation should cause any problems since this is purely a regexp-compile time issue. Me: I cant see any reasonable meaning to (?@foo=...) this seams an appropriate syntax, but I am open for others to be suggested. =head1 CHANGES V1 of this RFC had three ideas, one has been dropped, the other is now part of RFC 198. V2 Expands the list expansion and quoting with quoting of scalars and Implemention issues. V3 In an error what should have been 165 V2 was issued as 166 V2 so this is V3 with a change in (?Q$foo). This is in a pre-frozen state. V4 Added a couple of minor changes from Hugo and frozen. =head1 MIGRATION As (?@foo) and (?Q...) these are additions with out any compatibility issues. The option of just @foo for list exansion, might represent a small problem if people already use the construct. =head1 IMPLENTATION Both of these are changes are regex compile time issues. Generating lists from arrays almost works by localising $" as '|' for the regex and just using @foo. MJD has demonstrated implementing (?@foo) as (?\@foo) by means of an overload of regexes, this slight change was necessary because of the expansion of @foo - see below. Both of these changes are currently affected by the expansion of variables in the regex before the regex compiler gets to work on the regex. This problem also affects several other RFCs. The expansion of variables in regexes needs for these (and other RFCs) to be driven from within the regex compiler so that the regex can expand as and where appropriate. Changing this should not affect any existing behaviour. =head1 REFERENCES RFC 198: Boolean Regexes
RFC 308 (v1) Ban Perl hooks into regexes
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Ban Perl hooks into regexes =head1 VERSION Maintainer: Simon Cozens [EMAIL PROTECTED] Date: 25 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 308 Version: 1 Status: Developing =head1 ABSTRACT Remove C?{ code }, C??{ code } and friends. =head1 DESCRIPTION The regular expression engine may well be rewritten from scratch or borrowed from somewhere else. One of the scarier things we've seen recently is that Perl's engine casts back its Krakken tentacles into Perl and executes Perl code. This is spooky, tangled, and incestuous. (Although admittedly fun.) It would be preferable to keep the regular expression engine as self-contained as possible, if nothing else to enable it to be used either outside Perl or inside standalone translated Perl programs without a Perl runtime. To do this, we'll have to remove the bits of the engine that call Perl code. In short: C?{ code } and C??{ code } must die. =head1 IMPLEMENTATION It's more of an unimplementation really. =head1 REFERENCES None.
RFC 317 (v1) Access to optimisation information for regular expressions
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Access to optimisation information for regular expressions =head1 VERSION Maintainer: Hugo van der Sanden ([EMAIL PROTECTED]) Date: 25 September 2000 Mailing List: [EMAIL PROTECTED] Number: 317 Version: 1 Status: Developing =head1 ABSTRACT Currently you can see optimisation information for a regexp only by running with -Dr in a debugging perl and looking at STDERR. There should be an interface that allows us to read this information programmatically and possibly to alter it. =head1 DESCRIPTION At its core, the regular expression matcher knows how to check whether a pattern matches a string starting at a particular location. When the regular expression is compiled, perl may also look for optimisation information that can be used to rule out some or all of the possible starting locations in advance. Currently you can find out about the optimisation information captured for a particular regexp only in a perl built with DEBUGGING, by turning on -Dr: % perl -Dr -e 'qr{test.*pattern}' Compiling REx `test.*pattern' size 8 first at 1 rarest char p at 0 rarest char s at 2 1: EXACT test(3) 3: STAR(5) 4: REG_ANY(0) 5: EXACT pattern(8) 8: END(0) anchored `test' at 0 floating `pattern' at 4..2147483647 (checking floating) minlen 11 Omitting $` $ $' support. EXECUTING... Freeing REx: `test.*pattern' % For some purposes it would help to be able to get at this information programmatically: the test suite could take advantage of this (to test that optimisations occur as expected), and it could also be useful for enhanced development tools, such as a graphical regexp debugger. Additionally there are times that the programmer is able to supply optimisation that the regexp engine cannot discover for itself. While we could consider making it possible to modify these values, it is important to remember that these are only hints: the regexp engine is free to ignore them. So there is a danger that people will misuse writable optimisation information to move part of the logic out of the regexp, and then blame us when it breaks. Suggested example usage: % perl -wl use re; $a = qr{test.*pattern}; print join ':', $a-fixed_string, $a-floating_string, $a-minlen; __END__ test:pattern:11 % .. but perhaps a single new method returning a hashref would be cleaner and more extensible: $opt = $a-optimisation; print join ':', @$opt{qw/ fixed_string floating_string minlen /}; =head1 IMPLEMENTATION Straightforward: add interface functions within the perl core to give access to read and/or write the optimisation values; add methods in re.pm that use XS code to reach the internal functions. =head1 REFERENCES Prompted by discussion of RFC 72: RFC 72: Variable-length lookbehind: the regexp engine should also go backward.
RFC 276 (v1) Localising Paren Counts in qr()s.
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Localising Paren Counts in qr()s. =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 24 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 276 Version: 1 Status: Developing =head1 ABSTRACT The Paren Counts and backreferences should be localised in each qr(), to prevent surprises when qr()s are used in combination. =head1 DESCRIPTION TomCs perl storm #0040 has: Figure out way to do /$e1 $e2/ safely, where $e1 might have '(foo) \1' in it. and $e2 might have '(bar) \1' in it. Those won't work. =head2 DISCUSSION Me: If e1 and e2 are qr// type things the answer might be to localise the backref numbers in each qr// expression. Use of assignment in a regex and named backrefs (RFC 112) would make this a lot safer. Hugo: I think it is reaonable to ask whether the current handling of qr{} subpatterns is correct: perl -wle '$a=qr/(a)\1/; $b=qr/(b).*\1/; /$a($b)/g and print join ":", $1, pos for "aabbac"' a:5 I'm tempted to suggest it isn't; that the paren count should be local to each qr{}, so that the above prints 'bb:4'. I think that most people currently construct their qr{} patterns as if they are going to be handled in isolation, without regard to the context in which they are embedded - why else do they override the embedder's flags if not to achieve that? The problem then becomes: do we provide a mechansim to access the nested backreferences outside of the qr{} in which they were referenced, and if so what syntax do we offer to achieve that? I don't have an answer to the latter, which tempts me to answer 'no' to the former for all the wrong reasons. I suspect (and suggest) that complication is the only reason we don't currently have the behaviour I suggest the rest of the semantics warrant - that backreferences are localised within a qr(). I lie: the other reason qr{} currently doesn't behave like that is that when we interpolate a compiled regexp into a context that requires it be recompiled, we currently ignore the compiled form and act only on the original string. Perhaps this is also an insufficiently intelligent thing to do. MJD: Interpolated qr() items shouldn't be recompiled anyway. They should be treated as subroutine calls. Unfortunately, this requires a reentrant regex engine, which Perl doesn't have. But I think it's the right way to go, and it would solve the backreference problem, as well as many other related problems. Me: You can access the nested backreferences outside of the qr{} in which they were referenced by use of the named backref see RFC 112. =head2 AGREEMENTS The paren count in each qr() is localised to each qr(). There is no way to access the nested backrefernces outside of the qr() by number they may be accessed by name (see RFC 112). The regex engine must be made re-entrant. The regex compiler should not need to recompile qr()s when used as part of another regex. =head1 IMPLENTATION The Regex engine must be made re-entrant. The expansion of variables in regexes must be driven by the regex compiler (Same problem as for RFCs 112, 166 ...) =head1 REFERENCES Perlstorm #0040 from TomC. RFC 112: Assignment within a regex
RFC 112 (v3) Asignment within a regex
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Asignment within a regex =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 16 Aug 2000 Last Modified: 23 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 112 Version: 3 Status: Developing =head1 ABSTRACT Provide a simple way of naming and picking out information from a regex without having to count the brackets. =head1 DESCRIPTION If a regex is complex, counting the bracketed sub-expressions to find the ones you wish to pick out can be messy. It is also prone to maintainability problems if and when you wish to add to the expression. Using (?:) can be used to surpress picking up brackets, it helps, but it still gets "complex". I would sometimes rather just pickout the bits I want within the regex itself. Suggested syntax: (?$foo= ... ) would assign the string that is matched by the patten ... to $foo when the patten matches. These assignments would be made left to right after the match has succeded but before processing a replacement or other results (or prior to a some (?{...}) or (??{...}) code). There may be whitespace between the $foo and the "=". Potentially the $foo could be any scalar LHS, as in (?$foo{$bar}= ... )!, likewise the '=' could be any asignment operator. The camel and the docs include this example: if (/Time: (..):(..):(..)/) { $hours = $1; $minutes = $2; $seconds = $3; } This then becomes: /Time: (?$hours=..):(?$minutes=..):(?$seconds=..)/ This is more maintainable than counting the brackets and easier to understand for a complex regex. And one does not have to worry about the scope of $1 etc. =head2 Named Backrefs The first versions of this RFC did not allow for backrefs. I now think this was a shortcoming. It can be done with (??{quotemeta $foo}), but I find this clumsy, a better way of using a named back ref might be (?\$foo). =head2 Scoping The question of scoping for these assignments has been raised, but I don't currently have a feel for the "best" way to handle this. Input welcome. =head2 Brackets Using this method for capturing wanted content, it might be desirable to stop ordinary brackets capturing, and needing to use (?:...). I therefore suggest that as an enhancement to regexes that /b (bracket?) ordinary brackets just group, without capture - in effect they all behave as (?:...). =head1 CHANGES V3 - added bit about backrefs, and brackets. =head1 IMPLENTATION Currently all $scalars in regexes are expanded before the main regex compiler gets to analyse the syntax. This problem also affects several other RFCs (166 for example). The expansion of variables in regexes needs for these (and other RFCs) to be driven from within the regex compiler so that the regex can expand as and where appropriate. Changing this should not affect any existing behaviour. =head1 REFERENCES I brought this up on p5p a couple of years ago, but it was lost in the noise... RFC 166: Alternative lists and quoting of things Perlstorm #0040
RFC 158 (v3) Regular Expression Special Variables
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Regular Expression Special Variables =head1 VERSION Maintainer: Uri Guttman [EMAIL PROTECTED] Date: 25 Aug 2000 Last Modified: 22 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 158 Version: 3 Status: Frozen Frozen since: v2 =head1 ABSTRACT This RFC addresses ways to make the regex special variables $`, $ and $' not be such pariahs like they are now. =head1 CHANGES I dropped the local scoping of $`, $ and $' as they are already localized now. =head1 DESCRIPTION $`, $ and $' are useful variables which are never used by any experienced Perl hacker since they have well known problems with efficiency. Since they are globals, any use of them anywhere in your code forces all regexes to copy their data for potential later referencing by one of them. I will describe some ideas to make this issue go away and return these variables back into the toolbox where they belong. =head1 IMPLEMENTATION The copy all regex data problem is solved by a new modifier k (for keep). This tells the regex to do the copy so the 3 vars will work properly. So you would use code like this: $str = 'prefoopost' ; if ( $str =~ /foo/k ) { print "pre is [$`]\n" ; print "match is [$]\n" ; print "post is [$']\n" ; } =head1 IMPACT None =head1 UNKNOWNS None =head1 REFERENCES None.
RFC 165 (v3) Allow Varibles in tr///
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Allow Varibles in tr/// =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 27 Aug 2000 Last Modified: 22 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 165 Version: 3 Status: Frozen =head1 ABSTRACT Allow variables in a tr///. At present the only way to do a tr/$foo/$bar/ is to wrap it up in an eval. I dont like using evals for this sort of thing. =head1 DESCRIPTION Suggested syntax: tr/$foo/$bar/e With a /e, tr will expand both the LHS and RHS of the translate function. Either or both could be variables. I am suggesting /e as it is sort of like /e for s///e. These words from MJD: The way tr/// works is that a 256-byte table is constructed at compile time that say for each input character what output character is produced. Then when it's time to apply the tr/// to a string, Perl iterates over the string one character at a time, looks up each character in the table, and replaces it with the corresponding character from the table. With tr///e, you would have to generate the table at run-time. This would suggest that you want the same sorts of optimizations that Perl applies when it encounters a regex that contains variables: 1. Perl should examine the strings to see if they have changed since the last time it executed the code 2. It should rebuild the tables only if the strings changed 3. There should be a /o modifier that promises Perl that the variables will never change. The implementation could be analogous to the way m/.../o is implemented, with two separate op nodes: One that tells Perl 'construct the tables' and one that tells Perl 'transform the string'. The 'construct the tables' node would remove itself from the op tree if it saw that the tr//o modifier was used. Hugo wrote: Definitely. Should be easy to implement. There is a potential for confusion, since it makes the tr/ lists look even more like m/ and s/ patterns, but I think it can only be less confusion than the current state of affairs. It is tempting to make it the default, and have a flag to turn it off (or just backwhack the dagnabbed dollar), and auto-translation of existing scripts would be pretty easy, except that it would presumably fail exactly where people are using the current workaround, by way of eval. Comments by me: Therefore tr///o might be a good idea as well. If Hugo's idea of making this the normal behaviour, the problem of existing evals is avoided by p52p6 changing the eval to a perl5_eval which acts accordingly. (One of MJD's ideas). =head1 IMPLENTATION Hugo: Should be easy to implement. Me: Should not be too complicated, this is just a case of doing existing things in a different context. =head1 CHANGES V2 - Added words from MJD and Hugo - This hopefully in a pre freeze state. V3 - re issued due to an error in posting V2 and now frozen =head1 REFERENCES None yet.
RFC 166 (v3) Alternative lists and quoting of things
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Alternative lists and quoting of things =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 27 Aug 2000 Last Modifiedj: 22 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 166 Version: 3 Status: Developing =head1 ABSTRACT Expand Alternate Lists from Arrays and Quote the contents of things inside regexes. =head1 DESCRIPTION These are a couple of constructs to make it easy to build up regexes from other things. =head2 Alternative Lists from arrays The basic idea is to expand an array as a list of alternatives. There are two possible syntaxs (?@foo) and just plain @foo. @foo might just have existing uses (just), therefore I prefer the (?@foo) syntax. (?@foo) is just syntactic sugar for (?:(??{ join('|',@foo) })) A bracketed list of alternatives. =head2 Quoting the contents of things If a regex uses $foo or @bar there are problems if the content of the variables contain special characters. What is needed is a way of \Quoting the content of scalars $foo or arrays (?@foo). Suggested syntax: (?Q$foo) Quotes the contents of the scalar $foo - equivalent to (??{ quotemeta $foo }). (?Q@foo) Quotes each item in a list (as above) this is equivalent to (?:(??{ join ('|', map quotemeta, @foo)})). In this syntax the Q is used as it represents a more inteligent \Quot\E. It is recognised that (?Q$foo) is equivalent to \Q$foo\E, but it does not mean that this is a bad idea to add this at the same time as (?Q@foo) for reasons of symetry and perl DWIM. =head2 Comments Hugo: (?@foo) and (?Q@foo) are both things I've wanted before now. I'm not sure if this is the right syntax, particularly if RFC 112 is adopted: it would be confusing to have (?@foo) to have so different a meaning from (?$foo=...), and even more so if the latter is ever extended to allow (?@foo=...). I see no reason that implementation should cause any problems since this is purely a regexp-compile time issue. Me: I cant see any reasonable meaning to (?@foo=...) this seams an appropriate syntax, but I am open for others to be suggested. =head1 CHANGES V1 of this RFC had three ideas, one has been dropped, the other is now part of RFC 198. V2 Expands the list expansion and quoting with quoting of scalars and Implemention issues. V3 In an error what should have been 165 V2 was issued as 166 V2 so this is V3 with a change in (?Q$foo). This is in a pre-frozen state. =head1 MIGRATION As (?@foo) and (?Q...) these are additions with out any compatibility issues. The option of just @foo for list exansion, might represent a small problem if people already use the construct. =head1 IMPLENTATION Both of these are changes are regex compile time issues. Generating lists from arrays almost works by localising $" as '|' for the regex and just using @foo. MJD has demonstrated implementing (?@foo) as (?\@foo) by means of an overload of regexes, this slight change was necessary because of the expansion of @foo - see below. Both of these changes are currently affected by the expansion of variables in the regex before the regex compiler gets to work on the regex. This problem also affects several other RFCs. The expansion of variables in regexes needs for these (and other RFCs) to be driven from within the regex compiler so that the regex can expand as and where appropriate. Changing this should not affect any existing behaviour. =head1 REFERENCES RFC 198
RFC 198 (v2) Boolean Regexes
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Boolean Regexes =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 6 Sep 2000 Last Modified: 22 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 198 Version: 2 Status: Developing =head1 ABSTRACT This is a development of the proposal for the "not a pattern" concept in RFC 166 V1. Looking deeper into the handling of advanced regexs, there are potential needs for many other concepts, to allow a regex to extract information directly from a complex file in one go, rather than a mixture of splits and nested regexes as is typically needed today. With these parsing data should become easier (in some cases). =head1 CHANGES V2 - Changed the "Fail Pattern", enhanced the wording for many things. =head1 DESCRIPTION It would be nice (in my opinion) to be able to build more elaborate regexes allowing data to be mined out of a sting in one go. These ideas allow you to apply several patterns to one substring (each must match), to fail a match from within, to look for patterns that do not contain other patterns, and to handle looking for cases such as (foo.*bar)|(bar.*foo) in a more general way of saying "A substring that contains both foo and bar". These are ideas, at present with some proposed syntax. The ideas are more important than the exact syntax at this stage. This is very much work in progress. I have called these boolean regexs as they bring the concepts of and () or (||) and not(!) into the realm of regexes. Within a boolean regex (or the boolean part of a regex), several new symbols have meanings, and some have enhanced meanings. =head2 The Ideas Are these part of a boolean (?...) construct within an existing regex, or is the advanced syntax (and meaning of |!^$) invoked by a new flag such as /B? These can look like line noise so the use of white space with /x is used throughout, and it might be appropriate to enforce (or assume) /x within (...). =head3 Boolean construct (?...) grabs a substring, and applies one or more tests to the substring. =head3 Substring matching multiple patterns () (? pattern1 pattern2 pattern3 ) A substring is definied that matches each pattern. For example, the first pattern may say specify a substring of at least 30 chars, the next two have a foo and a bar. =head3 Substring matching alternative patterns (||) (? pattern1 || pattern2 || pattern3) This is similar to the existing alternative syntax "|" but the alternatives to "|" behave as /^pattern$/ rather than /pattern/ (^ and $ taken as refereing to the substring in this case - see below). (pattern1 || pattern2 || pattern3) can be mixed in with the case above to build up more advanced cases. and || operators can be nested with brackets in normal ways. =head3 Brackets within boolean regexes Within a complex boolean regex there are likely to be lots and lots of brackets to nest and control the behaviour of the regex. Rather than having to sprinkle the regex with (?:) line noise, it would be nicer to just use ordinary brackets () and only support capturing of elements by using one of the (?$=) or (?%=) constructs that have been proposed elsewhere (RFC 112 and RFC 150). There might be some case for this as a general capability using some flag /b = brackets? =head3 Substring not matching a pattern In RFC 166 I originally proposed (?^ pattern ). This proposal replaces that. Though it could be used as well outside of the (?) construct. !pattern matches anything that does not match the pattern. On its own it is not terribly useful, but in conjuction with and || one can do things such as /(? img ! alt=)/ ie does it have an image not have an alt. ! is chosen as it has the same basic meaning outside of regexes. !pattern is a non greedy construct that matches any string/substring that does not match the pattern. =head3 Meaning of $ and ^ inside a boolean regex ^ and $ are taken to mean the begining and end of the substring, not begining and and of the line/string from within a boolean regex. =head3 Greediness Should the (?...) construct be greedy or nongreedy? To some extent this depends on the elements it contains. If all the matching set of patterns are greedy then it will be greedy, if they are not greedy then it will not be. This might or might be sufficient. If the situation is ambiguous (or might be) The boolean can be expresed as (?? ...) to force non greediness. =head3 Delivering a substring to some code that generates a pass/fail (?*{code}) delivers a substring to the code, which returns with success or failure. The code sees the substring as $_. This is not dependant on the Boolean regex concept and could be used for other things, though it is most useful in this context. This is sort of equivalent to (?: (.*)(??{$_ = $1; code})) ie it matches an arbitary long substring and deliveres it to the code. But not dependant on how many brackets have been
RFC 274 (v1) Generalised Additions to Regexs
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Generalised Additions to Regexs =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 22 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 274 Version: 1 Status: Developing =head1 ABSTRACT This proposes a way for generalised additions to regex capabilities. =head1 DESCIPTION Given that expansion of regexes could include (+...) and (*...) I have been thinking about providing a general purpose way of adding functionality. Hence I propose that the entire (+...) syntax is kept free from formal specification for this. (+ = addition) A module or anything that wants to support some enhanced syntax registers something that handles "regex enhancements". At regex compile time, if and when (+foo) is found perl calls each of the registered regex enhancements in turn, these: 1) Are passed the foo string as a parameter exactly as is. (There is an issue of actually finding the end of the generic foo.) 2) The regex enhancement can either recognise the content or not. 3) If not the enhancement returns undef and perl goes to the next regex enhancement (Does it handle the enhancements as a stack (Last checked first) or a list (First checked first?) how are they scoped? Job here for the OO/scoping fanatics) 4) If perl runs out of registered regex enhancements it reports an error. 5) if an enhancement recognises the content it could do either of: a) return replacement expanded regex using existing capabilities perl will then pass this back through the regex compiler. b) return a coderef that is called at run time when the regex gets to this point. The referenced code needs to have enough access to the regex internals to be able to see the current sub-expression, request more characters, access to relevant flags and visability of greediness. It may also need a coderef that is simarly called when the regex is being unwound when it backtracks. These features would also be of interest to the existing code inside regexes as well. Thinking from that - the last case should be generalised (it is sort of like my (?*{...}) from RFC 198 or an enhancement to (??{...}). If so cases (a) and (b) are the same as case (b) is just a case of returning (?*{...}) the appropriate code. Following on, if (?{...}) etc code is evaluated in forward match, it would be a good idea to likewise support some code block that is ignored on a forward match but is executed when the code is unwound due to backtracking. Thus (?{ foo })(?\{ bar }) executes foo on the forward case and bar if it unwinds. I dont care at the moment what the syntax is - what about the concepts. Think about foo putting something on a stack (eg the bracket to match [RFC 145]) and bar taking it off for example. Note: I dont consider this RFC complete, but after posting this on the regex list to no effect I am making it an RFC to see if it gets a little more feedback... =head1 MIGRATION This is a new feature - no compatibity problems =head1 IMPLENTATION This has not been looked at in detail, but the desciption above provides some views as to how it may operate. =head1 REFERENCES RFC 145 - Bracket matching RFC 198 - Boolean Regexes
RFC 110 (v6) counting matches
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE counting matches =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 16 Aug 2000 Last Modified: 20 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 110 Version: 6 Status: Frozen =head1 ABSTRACT Provide a simple way of giving a count of matches of a pattern. =head1 DESCRIPTION Have you ever wanted to count the number of matches of a patten? s///g returns the number of matches it finds. m//g just returns 1 for matching. Counts can be made using s//$/g but this is wastefull, or by putting some counting loop round a m//g. But this all seams rather messy. TomC (and a couple of others) have said that it can also be done as : $count = () = $string =~ /pattern/g; However many people do not like this construct, here are a couple of quotes: jhi: Which I find cute as a demonstration of the Perl's context concept, but ugly as hell from usability viewpoint. Bart Lateur: '()=' is not perfect. It is also butt ugly. It is a "dirty hack". This construct is also likely to be inefficient as perl will have to build up a list of all the matches, store them somewhere, count them, then throw them away. Therefore I would like a way of counting matches. =head2 Proposal m//gt (or m//t see below) would be defined to do the match, and return the count of matches, this leaves all existing uses consistent and unaffected. /t is suggested for "counT", as /c is already taken. Relationship of m//t and m//g - there are three possibilities, my original: m//gt, where /t adds counting to a group match (/t without /g would just return 0 or 1). However \G loses its meaning. The Alternative By Uri : m//t and m//g are mutually exclusive and m//gt should be regarded as an error. Hugo: I like this too. I'd suggest /t should mean a) return a scalar of the number of matches and b) don't set any special variables. Then /t without /g would return 0 or 1, but be faster since no extra information need be captured (except internally for (.)\1 type matching - compile time checks could determine if these are needed, though (?{..}) and (??{..}) patterns would require disabling of that optimisation). /tg would give a scalar count of the total number of matches. \G would retain its meaning. I think Hugo's wording about the relationship makes the best sense, and this is the suggested way forward. =head1 CHANGES RFC110 V1 - Original posting to perl6-language RFC110 V2 - Reposted to perl6-language-regex RFC110 V3 - Added Uri's alternitive m//t RFC110 V4 - Added notes about $count = () = $string =~ /pattern/g RFC110 V5 - Added Hugo's wording about /g and /t relationship, suggested this is the way forward. RFC110 V6 - Frozen =head1 IMPLENTATION Hugo: Implementation should be fairly straightforward, though ensuring that optimisations occurred precisely when they are safe would probably involve a few bug-chasing cycles. =head1 REFERENCES I brought this up on p5p a couple of years ago, but it was lost in the noise...
RFC 93 (v3) Regex: Support for incremental pattern matching
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Regex: Support for incremental pattern matching =head1 VERSION Maintainer: Damian Conway [EMAIL PROTECTED] Date: 11 Aug 2000 Last Modified: 18 Sep 2000 Number: 93 Version: 3 Mailing List: [EMAIL PROTECTED] Status: Frozen =head1 ABSTRACT This RFC proposes that, in addition to strings, subroutine references may be bound (with =~ or !~ or implicitly) to a regular expression. =head1 DESCRIPTION It is proposed that the Perl 6 regular expression engine be extended to allow a regex to match against an incremental character source, rather than only against a fixed string. Specifically, it is proposed that a subroutine reference could be bound to a regular expression: sub {...} =~ /pattern/; As the regular expression is matched, it would make calls to the subroutine to request additional characters to match, or (after it has matched) to return any unused characters. When the regex engine requires additional characters to match, the subroutine would be called with a single argument, and would be expected to return a character string containing the extra characters. The single argument would specify how many characters should be returned (typically this would be 1, unless internal analysis by the regex engine can deduce that more than one character will be required). Returning fewer than the requested number of characters would typically indicate a premature end-of-string and would probably trigger backtracking and/or failure to match. When the match is finished, the subroutine would be called one final time, and passed two arguments: a string containing the "unused" characters (what would be $' for a fixed string), and a flag set to 1. The subroutine could use this call to push-back (or cache) unused data. In the case of a failure to match (or success of the !~ operator), every character requested during the match would be sent back. A typical structure for a subroutine against which a regex was matched would therefore be: sub s { if ($_[1]) {# "putback unused data" request recache($_[0]); } else { # "send more data" request return get_chars(max=$_[0]) } } =head2 Examples The most obvious example would be matching against an input stream: sub { $_[1] ? $fh-pushback($_[0]) : $fh-getn($_[0]) } =~ /pat/; which could also be written: ^1 ? $fh-pushback(^0) : $fh-getn(^0) =~ /pat/; Of course, it would often be useful to have a subroutine that returns a closure on a particular filehandle: sub fhmatch { ^1 ? $_[0]-pushback(^0) : $_[0]-getn(^0) } fhmatch($fh) =~ /pat/ fhmatch(\*STDIN) =~ /pat/ # etc. In fact, this might be so commonly useful that matching against a file handle should be made to work directly. That is: $fh =~ /pat/ \*STDIN =~ /pat/ One could then do interactive lexing cleanly: until (eof $fh) { switch ($fh) { /^\s*/; # skip leading whitespace case /^(lexeme1)/ { push @tokens, $1=LEX1 } case /^(lexeme2)/ { interact_somehow } case /^(lexeme3)/ { push @tokens, $1=LEX3 } # etc. } } Note the use of the proposed PAIR data structure to store tokens in the above example. Because the character source is a subroutine, one could also match against data coming out of a socket: my $cache = ""; sub matching_socks { if ($_[1]) { $cache .= $_[0]; return } # putback if (length($cache) $_[0]) { # not enough cached my $extra; # so get some more recv(SOCKET, $extra, $_[0]-length($cache)); $cache .= $extra; } return substr($cache,0,$_[0],""); } switch (\matching_socks) { case /pat1/ { action1() } case /pat2/ { action1() } case /pat3/ { action1() } #etc. } or any other source: sub mega_ape { return join "", map {['a'..'z',(' ')x6]-[rand 32]} (1..$_[0]) unless $_[1] } \mega_ape =~ /Now is the Winter of our discontent.../i; print "Art after ", length($`), "chars\n"; =head1 IMPLEMENTATION Dammit, Jim, I'm a doctor, not an magician! Probably needs to be integrated with IO disciplines too. =head1 REFERENCES RFC 22: Builtin switch statement RFC 23: Higher order functions RFC 84: Replace = (stringifying comma) with = (pair constructor)
RFC 170 (v2) Generalize =~ to a special apply-to assignment operator
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Generalize =~ to a special "apply-to" assignment operator =head1 VERSION Maintainer: Nathan Wiger [EMAIL PROTECTED] Date: 29 Aug 2000 Last-Modified: 16 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 170 Version: 2 Status: Frozen =head1 ABSTRACT Currently, C=~ is only available for use in specific builtin pattern matches. This is too bad, because it's really a neat operator. This RFC proposes a simple way to make it more general-purpose. =head1 NOTES ON FREEZE Probably the only way this could be implemented is if BRFC 164 was also implemented, freeing C=~ for use as a more general-purpose operator. Indeed, a main point of this RFC is to provide a means for a backwards-compatible syntax for regex's. Unlink BRFC 164, most people I heard from liked this. Some criticized it as being too sugary, since this: $string =~ quotemeta;# $string = quotemeta $string; Is not as clear as the original. However, there is fairly similar precedent in: $x += 5; # $x = $x + 5; And to me it seems to be quite clear that Cquotemeta is acting on C$string in the above example, even when you take into account C=~'s current binding meaning (perhaps more so, in fact). =head1 DESCRIPTION First off, this assumes RFC 164. Second, it requires you drop any knowledge of how C=~ currently works. Finally, it runs directly counter to RFC 139, which proposes another application for C=~. This RFC proposes a simple use for C=~ as a generic "apply-to" operator. When used, any values on the left side of the expression are implicitly passed to the end of the right-side expression. What this means is that an expression such as this: $value = dostuff($arg1, $arg2, $value); Could now be rewritten as: $value =~ dostuff($arg1, $arg2); And C$value would be implicitly transferred over to the right side as the last argument. It's simple, but it makes what is being operated on quite obvious. This enables us to rewrite the following constructs: $string = quotemeta($string); @array = reverse @array; ($name) = split /\s+/, $name; @vals = sort { $a = $b } @vals; @file = grep !/^#/, @file; $string = s/\s+/SPACE/, $string;# RFC 164 $matches = m/\w+/, $string; # RFC 164 @strs = s/foo/bar/gi, @strs;# RFC 164 As the shorter and more readable: $string =~ quotemeta; @array =~ reverse; ($name) =~ split /\s+/; @vals =~ sort { $a = $b }; @file =~ grep /!^#/; $string =~ s/\s+/SPACE/;# looks familiar $string =~ m/\w+/; # this too [1] @strs =~ s/foo/bar/gi; # cool extension It's a simple solution, true, but it has a good amount of flexibility and brevity. It could also be the case that multiple values could be called and returned, so that: ($name, $email) = special_parsing($name, $email); Becomes: ($name, $email) =~ special_parsing; Again, it's simple, but seems to have useful applications. One nice thing is that in many (most?) situations it appears to be working very much like C=~ currently works with regex's (from a user perspective). Finally, note this can only work with functions and function-like constructs. An attempt to do something like this: $x =~ 5 +; Should Idefinitely remain a syntax error. =head2 Possible addition of C~= operator A symmetric operator, C~=, was proposed informally on the list which would left-pad the argument list: $stuff =~ dojunk(@args); # $stuff = dojunk(@args, $stuff); $stuff ~= dojunk(@args); # $stuff = dojunk($stuff, @args); but the consensus that I received was about 50/50: half liked it, half thought it was too confusing. Even though we don't have a Cbitnot= operator currently, creating something that looks like one but acts completely differently is probably not a good idea. If something like this was included, it would probably be best to go with another operator, like C=^: $stuff =~ dojunk(@args); # $stuff = dojunk(@args, $stuff); $stuff =^ dojunk(@args); # $stuff = dojunk($stuff, @args); But that's awfully close to C^=. Hmmm. Regardless, this operator is unlikely to be used nearly as widely since Perl functions usually take the argument to act on in the last position. =head1 IMPLEMENTATION Simplistic (hopefully). Should just involve stacking values onto a function's argument list. =head1 MIGRATION This introduces new functionality, which allows backwards compatibility for regular expressions. As such, it should require no special translation of code. This RFC assumes RFC 164 will be adopted (which it may not be) for changes to regular expressions. =head1 NOTES [1] That m// one doesn't quite work right, but that's a special case that I would suggest should be caught by some other part of the grammar to maintain backwards compatability (like bare //). =head1 REFERENCES RFC 164: Replace =~, !~, m//, s///,
RFC 166 (v2) Alternative lists and quoting of things
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Alternative lists and quoting of things =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 27 Aug 2000 Last Modified: 15 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 166 Version: 2 Status: Developing =head1 ABSTRACT Expand Alternate Lists from Arrays and Quote the contents of things inside regexes. =head1 DESCRIPTION These are a couple of constructs to make it easy to build up regexes from other things. =head2 Alternative Lists from arrays The basic idea is to expand an array as a list of alternatives. There are two possible syntaxs (?@foo) and just plain @foo. @foo might just have existing uses (just), therefore I prefer the (?@foo) syntax. (?@foo) is just syntactic sugar for (?:(??{ join('|',@foo) })) A bracketed list of alternatives. =head2 Quoting the contents of things If a regex uses $foo or @bar there are problems if the content of the variables contain special characters. What is needed is a way of \Quoting the content of scalars $foo or arrays (?@foo). Suggested syntax: (?Q$foo) Quotes the contents of the scalar $foo - equivalent to (??{ quotemeta $foo }). (?Q@foo) Quotes each item in a list (as above) this is equivalent to (?:(??{ join ('|', map quotemeta, @foo)})). In this syntax the Q is used as it represents a more inteligent \Quot\E. =head2 Comments Hugo: (?@foo) and (?Q@foo) are both things I've wanted before now. I'm not sure if this is the right syntax, particularly if RFC 112 is adopted: it would be confusing to have (?@foo) to have so different a meaning from (?$foo=...), and even more so if the latter is ever extended to allow (?@foo=...). I see no reason that implementation should cause any problems since this is purely a regexp-compile time issue. Me: I cant see any reasonable meaning to (?@foo=...) this seams an appropriate syntax, but I am open for others to be suggested. =head1 CHANGES RFC 166, v1 was entitled "Additions to regexes". V1 of this RFC had three ideas, one has been dropped, the other is now part of RFC 198. V2 Expands the list expansion and quoting with quoting of scalars and Implemention issues. =head1 MIGRATION As (?@foo) and (?Q...) these are additions with out any compatibility issues. The option of just @foo for list exansion, might represent a small problem if people already use the construct. =head1 IMPLENTATION Both of these are changes are regex compile time issues. Generating lists from arrays almost works by localising $" as '|' for the regex and just using @foo. MJD has demonstrated implementing (?@foo) as (?\@foo) by means of an overload of regexes, this slight change was necessary because of the expansion of @foo - see below. Both of these changes are currently affected by the expansion of variables in the regex before the regex compiler gets to work on the regex. This problem also affects several other RFCs. The expansion of variables in regexes needs for these (and other RFCs) to be driven from within the regex compiler so that the regex can expand as and where appropriate. Changing this should not affect any existing behaviour. =head1 REFERENCES RFC 198: Boolean Regexes
RFC 110 (v5) counting matches
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE counting matches =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 16 Aug 2000 Last Modified: 12 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 110 Version: 5 Status: Developing =head1 ABSTRACT Provide a simple way of giving a count of matches of a pattern. =head1 DESCRIPTION Have you ever wanted to count the number of matches of a patten? s///g returns the number of matches it finds. m//g just returns 1 for matching. Counts can be made using s//$/g but this is wastefull, or by putting some counting loop round a m//g. But this all seams rather messy. TomC (and a couple of others) have said that it can also be done as : $count = () = $string =~ /pattern/g; However many people do not like this construct, here are a couple of quotes: jhi: Which I find cute as a demonstration of the Perl's context concept, but ugly as hell from usability viewpoint. Bart Lateur: '()=' is not perfect. It is also butt ugly. It is a "dirty hack". This construct is also likely to be inefficient as perl will have to build up a list of all the matches, store them somewhere, count them, then throw them away. Therefore I would like a way of counting matches. =head2 Proposal m//gt (or m//t see below) would be defined to do the match, and return the count of matches, this leaves all existing uses consistent and unaffected. /t is suggested for "counT", as /c is already taken. Relationship of m//t and m//g - there are three possibilities, my original: m//gt, where /t adds counting to a group match (/t without /g would just return 0 or 1). However \G loses its meaning. The Alternative By Uri : m//t and m//g are mutually exclusive and m//gt should be regarded as an error. Hugo: I like this too. I'd suggest /t should mean a) return a scalar of the number of matches and b) don't set any special variables. Then /t without /g would return 0 or 1, but be faster since no extra information need be captured (except internally for (.)\1 type matching - compile time checks could determine if these are needed, though (?{..}) and (??{..}) patterns would require disabling of that optimisation). /tg would give a scalar count of the total number of matches. \G would retain its meaning. I think Hugo's wording about the relationship makes the best sense, and this is the suggested way forward. =head1 CHANGES RFC110 V1 - Original posting to perl6-language RFC110 V2 - Reposted to perl6-language-regex RFC110 V3 - Added Uri's alternitive m//t RFC110 V4 - Added notes about $count = () = $string =~ /pattern/g RFC110 V5 - Added Hugo's wording about /g and /t relationship, suggested this is the way forward. Unless any significant discussion takes place this RFC will move to frozen within a week. =head1 IMPLENTATION Hugo: Implementation should be fairly straightforward, though ensuring that optimisations occurred precisely when they are safe would probably involve a few bug-chasing cycles. =head1 REFERENCES I brought this up on p5p a couple of years ago, but it was lost in the noise...
RFC 197 (v1) Numberic Value Ranges In Regular Expressions
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Numberic Value Ranges In Regular Expressions =head1 VERSION Maintainer: David Nicol [EMAIL PROTECTED] Date: 5 september 2000 Mailing List: [EMAIL PROTECTED] Version: 1 Number: 197 Status: Developing =head1 ABSTRACT round and square bratches mated around two optional comma separated numbers match iff a gobbled number is within the described range. =head1 DESCRIPTION =head2 the syntax of the numeric range regex element Given a passage of regex text matching ($B1,$N1,$N2,$B2) = /(\[|\()(\-?\d*\.?\d*),(\-?\d*\.?\d*)(\]|\))/ and ($N1 = $N2 or $N1 eq '' or $N2 eq '') we've got something we hereinafter call a "range." =head2 what the range matches A range matches, in the target string, a passage C(\-?\d*\.?\d*) also known as a "number" if and only if the number is within the range. In the normal agebraic sense. =head2 "within the range" Square bracket means, that end of the range may include the range specifying number, and round parenthesis means, that end of the range includes numbers ov value up to (or down to) the number but not equal to it. =head2 infinity in the event that one or the other of the range specifying numbers is the empty string, that end of the range is unbounded. In the further event that we have defined infinity and negative infinity on our numbers, the square/round distinction will come into play. =head1 COMPATIBILITY To disambiguate ranges from character sets indluding digits, commas, and parentheses, either put a backslash on the right parentheses, or the comma, or arrange things so the left hand side of the comma is greater than the right hand side, that way this special case will not apply: /(37.3,200)/; # matches any number x, 37.3 x 200 /([37,))/; # matches and saves any number = 37. /(37\,200)/;# matches and saves the literal text '37,200' /[-35,9)]/; # matches any number x, -35 = x 9; followed by a ] /[3-5,9)]/; # matches a string containing any of 3,4,5,,,9 or ) =head1 IMPLEMENTATION When applying regular expressions to numeric data, ranges may optimize away all of the digit lookahead we must currently indulge in to implement them in perl5. If we have infinity defined, we'll have to recognize it in strings. =head1 BUT WAIT THERE'S MORE It is possible that the syntax described in this document may help slice multidimensional containers. (RFC 191) =head1 REFERENCES high school algebra
RFC 110 (v4) counting matches
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE counting matches =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 16 Aug 2000 Last Modified: 2 Sep 2000 Version: 4 Mailing List: [EMAIL PROTECTED] Number: 110 Status: Developing =head1 ABSTRACT Provide a simple way of giving a count of matches of a pattern. =head1 DESCRIPTION Have you ever wanted to count the number of matches of a patten? s///g returns the number of matches it finds. m//g just returns 1 for matching. Counts can be made using s//$/g but this is wastefull, or by putting some counting loop round a m//g. But this all seams rather messy. TomC (and a couple of others) have said that it can also be done as : $count = () = $string =~ /pattern/g; However many people do not like this construct, here are a couple of quotes: jhi: Which I find cute as a demonstration of the Perl's context concept, but ugly as hell from usability viewpoint. Bart Lateur: '()=' is not perfect. It is also butt ugly. It is a "dirty hack". This construct is also likely to be inefficient as perl will have to build up a list of all the matches, store them somewhere, count them, then throw them away. Therefore I would like a way of counting matches. =head2 Proposal m//gt (or m//t see below) would be defined to do the match, and return the count of matches, this leaves all existing uses consistent and unaffected. /t is suggested for "counT", as /c is already taken. Relationship of m//t and m//g - there are two possibilities, my original: m//gt, where /t adds counting to a group match (/t without /g would just return 0 or 1). However \G loses its meaning. The Alternative By Uri : m//t and m//g are mutually exclusive and m//gt should be regarded as an error. I have no preference. =head1 CHANGES RFC110 V1 - Original posting to perl6-language RFC110 V2 - Reposted to perl6-language-regex RFC110 V3 - Added Uri's alternitive m//t RFC110 V4 - Added notes about $count = () = $string =~ /pattern/g =head1 IMPLENTATION No idea =head1 REFERENCES I brought this up on p5p a couple of years ago, but it was lost in the noise...
RFC 170 (v1) Generalize =~ to a special-purpose assignment operator
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Generalize =~ to a special-purpose assignment operator =head1 VERSION Maintainer: Nathan Wiger [EMAIL PROTECTED] Date: 29 Aug 2000 Mailing List: [EMAIL PROTECTED] Version: 1 Number: 170 Status: Developing Requires: RFC 164 =head1 ABSTRACT Currently, C=~ is only available for use in specific builtin pattern matches. This is too bad, because it's really a neat operator. This RFC proposes a simple way to make it more general-purpose. =head1 DESCRIPTION First off, this assumes RFC 164. Second, it requires you drop any knowledge of how C=~ currently works. Finally, it runs directly counter to RFC 139, which proposes another application for C=~. This RFC proposes a simple use for C=~: as a last-argument rvalue duplicator. What this means is that an expression such as this: $value = dostuff($arg1, $arg2, $value); Could now be rewritten as: $value =~ dostuff($arg1, $arg2); And C$value would be implicitly transferred over to the right side as the last argument. It's simple, but it makes what is being operated on very obvious. This enables us to rewrite the following constructs: ($name) = split /\s+/, $name; $string = quotemeta($string); @array = reverse @array; @vals = sort { $a = $b } @vals; $string = s/\s+/SPACE/, $string;# RFC 164 $matches = m/\w+/, $string; # RFC 164 @strs = s/foo/bar/gi, @strs;# RFC 164 As the shorter and more readable: ($name) =~ split /\s+/; $string =~ quotemeta; @array =~ reverse; @vals =~ sort { $a = $b }; $string =~ s/\s+/SPACE/;# looks familiar $string =~ m/\w+/; # this too [1] @strs =~ s/foo/bar/gi; # cool extension It's a simple solution, true, but it has a good amount of flexibility and brevity. It could also be the case that multiple values could be called and returned, so that: ($name, $email) = special_parsing($name, $email); Becomes: ($name, $email) =~ special_parsing; Again, it's simple, but seems to have useful applications. =head1 IMPLEMENTATION Simplistic (hopefully). =head1 MIGRATION This introduces new functionality, which allows backwards compatibility for regular expressions. As such, it should require no special translation of code. This RFC assumes RFC 164 will be adopted (which it may not be) for changes to regular expressions. True void contexts may also render some parts of this moot, in which case coming up with a more advanced use for C=~ may be desirable. =head1 NOTES [1] That m// one doesn't quite work right, but that's a special case that I would suggest should be caught by some other part of the grammar to maintain backwards compatability (like bare //). =head1 REFERENCES RFC 164: Replace =~, !~, m//, and s/// with match() and subst() RFC 139: Allow Calling Any Function With A Syntax Like s///
RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Replace =~, !~, m//, and s/// with match() and subst() =head1 VERSION Maintainer: Nathan Wiger [EMAIL PROTECTED] Date: 27 Aug 2000 Version: 1 Mailing List: [EMAIL PROTECTED] Number: 164 =head1 ABSTRACT Several people (including Larry) have expressed a desire to get rid of C=~ and C!~. This RFC proposes a way to replace Cm// and Cs/// with two new builtins, Cmatch() and Csubst(). =head1 DESCRIPTION =head2 Overview Everyone knows how C=~ and C!~ work. Several proposals, such as RFCs 135 and 138, attempt to fix some stuff with the current pattern-matching syntax. Most proposals center around minor modifications to Cm// and Cs///. This RFC proposes that Cm// and Cs/// be dropped from the language altogether, and instead be replaced with new Cmatch and Csubst builtins, with the following syntaxes: $res = match /pattern/flags, $string $new = subst /pattern/newpattern/flags, $string These subs are designed to mirror the format of Csplit, making them more consistent. Unlike the current forms, these return the modified string, leaving C$string alone. (Unless they are called in a void context, in which case they act on and modify C$_ consistent with current behavior). Extra arguments can be dropped, consistent with Csplit and many other builtins: match; # all defaults (pattern is /\w+/?) match /pat/;# match $_ match /pat/, $str; # match $str match /pat/, @strs; # match any of @strs subst; # like s///, pretty useless :-) subst /pat/new/;# sub on $_ subst /pat/new/, $str; # sub on $str subst /pat/new/, @strs; # return array of modified strings These new builtins eliminate the need for C=~ and C!~ altogether, since they are functions just like Csplit, Cjoin, Csplice, and so on. Sometimes examples are easiest, so here are some examples of the new syntax: Perl 5 Perl 6 -- if ( /\w+/ ) { } if ( match ) { } die "Bad!" if ( $_ !~ /\w+/ ); die "Bad!" if ( ! match ); ($res) = m#^(.*)$#g; ($res) = match #^(.*)$#g; next if /\s+/ || /\w+/; next if match /\s+/ or match /\w+/; next if ($str =~ /\s+/) || next if match /\s+/, $str or ($str =~ /\w+/) match /\w+/, $str; next unless $str =~ /^N/;next unless match /^N/, $str; $str =~ s/\w+/$bob/gi; $str = subst /\w+/$bob/gi, $str; ($str = $_) =~ s/\d+/func/ge; $str = subst /\d+/func/ge; s/\w+/this/; subst /\w+/this/; # These are pretty cool... foreach (@old) { @new = subst /hello/X/gi, @old; s/hello/X/gi; push @new, $_; } foreach (@str) { print "Got it" if match /\w+/, @str; print "Got it" if (/\w+/); } This gives us a cleaner, more consistent syntax. In addition, it makes several things easier, is more easily extensible: callsomesub(subst(/old/new/gi, $mystr)); $str = subst /old/new/i, $r-getsomeval; and is easier to read English-wise. However, it requires a little too much typing. See below. =head2 Concerns This should be carefully considered. It's good because it gets rid of "yet another odditty" with a more standard syntax that I would argue is more powerful and consistent. However, it also causes everyone to relearn how to match and substitute patterns. This must be a careful, conscious decision, lest we really screw stuff up. That being said, since my intial post I have received several personal emails endorsing this, hence the reason I decided to RFC it. So it's an option, it just has to be powerful enough for people to see the "big win". Finally, it requires a little too much typing still for my tastes. Perhaps we should make "m" and "s" at least shortcuts to the names, possibly allowing users to bind them to the front of the pattern (similar to some of RFC 138's suggestions). Maybe these two could be equivalent: $new = subst /old/new/i, $old; ==$new = s/old/new/i, $old; And then it doesn't look that radical anymore. This is similar to RFC 138, only C$old is not modified. =head1 IMPLEMENTATION Hold your horses =head1 MIGRATION This would be huge. Every pattern match would have to be translated, every Perl hacker would have to relearn patterns, and every Perl 5 book's regexp section would be instantly out of date. Like I said, this is not a simple decision. But if there's obvious increases in power, I think people will appreciate the change, not dread it. At the very least it makes Perl much more consistent. =head1 REFERENCES This is a synthesis of several ideas from myself, Ed Mills, and Tom C RFC 138: Eliminate =~ operator. RFC 135: Require explicit m on matches, even with ?? and // as delimiters.
RFC 165 (v1) Allow Varibles in tr///
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Allow Varibles in tr/// =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 27 Aug 2000 Mailing List: [EMAIL PROTECTED] Version: 1 Number: 165 =head1 ABSTRACT Allow variables in a tr///. At present the only way to do a tr/$foo/$bar/ is to wrap it up in an eval. I dont like using evals for this sort of thing. =head1 DESCRIPTION Suggested syntax: tr/$foo/$bar/e With a /e, tr will expand both the LHS and RHS of the translate function. Either or both could be variables. I am suggesting /e as it is sort of like /e for s///e. =head1 IMPLENTATION No idea, but should be straight forward. =head1 REFERENCES None yet.