Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations
I can see the motivation for wanting this, but there's a cost I haven't read anyone mentioning yet: this is abandoning backward compatibility with a regex notation that has remained pretty consistent in ed(1) and grep(1) and things inspired by them since I guess the early '70s, when they were born. It may be we want to pay that price, but do please comment on it in the RFC, it needs to be on the table. Perl is a prodigiously rich language, and in large part it avoids the problem made famous by PL/1, of being nearly impossible to learn completely, by the amount that it borrows from other familiar tools. Up until now, a wizard with regexps can carry all their knowlege forward; everything they've done in the past keeps working identically, and regexps composed for most other engines work fine in perl. There's always some differences between engines; some require (...) for grouping and/or tagging, others require \(...\); some support "|" for alternation, some support "+" for "one or more", some support "?" for "zero or one", and like that. And of course perl has some heroic extensions. But using \1 for backrefs within a pattern match has been standard; if an engine can match the likes of beriberi and murmur, it might demand ^\(.*\)\1$ or it might prefer ^(.*)\1$, but the \1 at least has been consistent. The further we wander from supporting the same regexp languages (as a subset of ours) as the other popular engines, the more support work we're making for ourselves. -Bennett PGP signature
Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations
On Sat, 30 Sep 2000, Bart Lateur wrote: > I wrote this before, but apparently you didn't hear it. Let me repeat: You're right, I missed your email when I was incorporating things into the new version. Apologies. > $foo on the LHS allows metacharacter matching, for example "a.*b" can > match "a foo b". But \1 only allows literal strings. If $1 captured I don't believe it matters...my version of $1 works exactly like the current \1 and my $/[1] works exactly like the current $1. Dave
Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations
On 28 Sep 2000 20:57:39 -, Perl6 RFC Librarian wrote: >Currently, C<\1> and $1 have only slightly different meanings within a >regex. Let's consolidate them together, eliminate the differences, and >settle on $1 as the standard. I wrote this before, but apparently you didn't hear it. Let me repeat: $foo on the LHS allows metacharacter matching, for example "a.*b" can match "a foo b". But \1 only allows literal strings. If $1 captured "a.*b", then \1 will only match the literal string "a.*b", as if the regex contained "a\.\*b". I don't see how you can possibly consider this a "tiny difference". -- Bart.
Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations
On Fri, 29 Sep 2000, Hildo Biersma wrote: > > Currently, C<\1> and $1 have only slightly different meanings within a > > regex. Let's consolidate them together, eliminate the differences, and > > settle on $1 as the standard. > > Sigh. That would remove functionality from the language. > > The reason why you need \1 in a regular expression is that $1, $2, ... > are interpolated from the previous regular expression. This allows me > to do a pattern match that captures variables, then use the results of > that to create a second regular expression. (Remember: A regexp > interpolates first, then compiles the pattern). Umm...with all due respect, did you read the RFC? Because what I proposed does not eliminate any functionality. Dave
Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations
On Thu, 28 Sep 2000, Hugo wrote: > :=item * > :/(foo)_C<\1>_bar/ > > Please don't do this: write C or /(foo)_\1_bar/, but > don't insert C<> in the middle: that makes it much more difficult to > read. Sorry; that was a global-replace error that I missed on proofreading. > :mean different things: the second will match 'foo_foo_bar', while the > :first will match 'foo[SOMETHING]bar' where [SOMETHING] is whatever was > > should be: foo_[SOMETHING]_bar Um, yeah, it should...(jeez...I proofed this like three times, honest!) *blush* > :captured in the B match...which could be a long, long way away, > > This seems a bit unfair. It is just another variable. Any variable > you include in a pattern, you are assumed to know that it contains > the intended value - there is nothing special about $1 in this regard. Fair enough; the point I was trying to make was that \1 was captured right here, while $1 was capturd long, long ago in a pattern match far, far away. The visual/cognitive difference is small, but the programming difference is huge. > :=item * > :${P1} means what $1 currently means (first match in last regex) > > Do you understand that this is the same variable as $P1? Traditionally, > perl very rarely coopts variable names that start with alphanumerics, > and (off the top of my head) all the ones it does so coopt are letters > only (ARGV, AUTOLOAD, STDOUT etc). I think we need better reasons to > extend that to all $P1-style variables. I do understand that, and I agree with your concern. Actually, I didn't think that ${P1} was a particularly good notation even as I was suggesting it...I just wanted to get the RFC up there before the deadline so that people could discuss it. Having now thought about it more, I think that (?P1) is better...in other words, make references to the previous pattern match be a regex _extension_, not a core feature (if that's a valid way to phrase the distinction). > What is the migration path for existing uses of $P1-style variables? Wherever p526 sees a pattern that contains a $1, it should replace it with (?P1). > :=item * > :s/(bar)(bell)/${P1}$2/ # changes "barbell" to "foobell" > > Note that in the current regexp engine, ${P1} has disappeared by the > time matching starts. Can you explain why we need to change this? > Note also that if you are sticking with ${P1} either we need to > rename all existing user variables of this form, or we can no longer > use the existing 'interpolate this string' (or eval, double-eval etc) > routines, and have to roll our own for this (these) as well. I'm a bit confused by the way this came out but, if I understand what you're asking, then I believe your concerns are solved by the new proposed syntax. Am I right? > :This may require significant changes to the regex engine, which is a topic > :on which I am not qualified to speak. Could someone with more > :knowledge/experience please chime in? > > Currently the regexp compiler is handed a string in which $variables > have already interpolated. [...] I know there are certain exceptions to this...my Camel III says (something to the effect of--I don't have it in front of me) "if there is any doubt as to whether something should be interpolated or left for the Engine, it will be left for the Engine." In any case, I don't think this needs to change. I'm simply changing what the names of the variables and backreferences are...\1 becomes (the new) $1, and (the current) $1 becomes (?P1) > Changing the lifetime of backreferences feels likely to be difficult, > but it isn't clear to me what you are trying to achieve here. I think > you at least need to add an example of how it would act under s///g > and s///ge. Good point. I'll do that. > :RFC 276: Localising Paren Counts in qr()s. > > I didn't see a mention of these in the body of the proposal. 276 is rather tangentially related, I grant. However, I felt that if my proposal went forward, it could impact on how 276 was implemented, so I crossreferenced to it. Dave
Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations
> =head1 ABSTRACT > > Currently, C<\1> and $1 have only slightly different meanings within a > regex. Let's consolidate them together, eliminate the differences, and > settle on $1 as the standard. Sigh. That would remove functionality from the language. The reason why you need \1 in a regular expression is that $1, $2, ... are interpolated from the previous regular expression. This allows me to do a pattern match that captures variables, then use the results of that to create a second regular expression. (Remember: A regexp interpolates first, then compiles the pattern). To come up with a silly example: if ($line =~ //i) { if ($line =~ /<(P|DIV|SPAN) class='$1'>.*?<\/\1>/i) { ^^ The class from the previous regexp ... } } If we implement this RFC, this would no longer be possible without the use of an extra variable to store the first $1. Interpolation for regular expressions would no longer work the same as it is for double-quoted strings. Hildo
Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations
Jonathan Scott Duff <[EMAIL PROTECTED]> writes: > On Thu, Sep 28, 2000 at 08:57:39PM -, Perl6 RFC Librarian wrote: > > ${P1} means what $1 currently means (first match in last regex) > > I'm sorry that I don't have anything more constructive to say than > "ick", but ... Ick. I'm with the 'Ick' camp too. And possibly with the 'Leave it the hell alone! If you're that bloody stupid you deserve to lose' camp too. -- Piers
Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations
> =item * > C<\1> goes away as a special form > > =item * > $1 means what C<\1> currently means (first match in this regex) > > =item * > ${1} is the same as $1 (first match in this regex) > > =item * > ${P1} means what $1 currently means (first match in last regex) Here's the big problem with this, and I think others have said it similarly: If we need the functionality of both \1 and $1, then there is no reason redoing the syntax. Period. If \1 is unneeded, then let's ditch it and just use $1 everywhere. However, this is not the case, as Randal, Bart, and others have shown. If we need \1, then we should leave as-is. There's no reason to force literally millions of people to relearn this. Renaming something just to rename it does not add value. -Nate
Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations
:=item * :/(foo)_$1_bar/ : :=item * :/(foo)_C<\1>_bar/ Please don't do this: write C or /(foo)_\1_bar/, but don't insert C<> in the middle: that makes it much more difficult to read. :mean different things: the second will match 'foo_foo_bar', while the :first will match 'foo[SOMETHING]bar' where [SOMETHING] is whatever was should be: foo_[SOMETHING]_bar :captured in the B match...which could be a long, long way away, :possibly even in some module that you didn't even realize you were :including (because it was included by a module that was included by a :module that was included by a...). This seems a bit unfair. It is just another variable. Any variable you include in a pattern, you are assumed to know that it contains the intended value - there is nothing special about $1 in this regard. :The key fact here is that, in the first section of a s/// you are supposed :to use C<\1>, but in the second portion you are supposed to use $1. If :you understand the whole logical structure behind it and understand how an :s/// works (i.e., the right hand side of an s/// is a double-quoted :string, not a regex), you will understand the distinction. For newbies, :however, it is apt to be quite confusing. I think the whole idea that the LHS of s/// is a pattern, but the RHS is a string (module /e, of course) is apt to be confusing when you first encounter it. You won't be able to make sense of any but the simplest use of s/// until you understand it, I think, and the documentation expresses it quite clearly. :=item * :${P1} means what $1 currently means (first match in last regex) Do you understand that this is the same variable as $P1? Traditionally, perl very rarely coopts variable names that start with alphanumerics, and (off the top of my head) all the ones it does so coopt are letters only (ARGV, AUTOLOAD, STDOUT etc). I think we need better reasons to extend that to all $P1-style variables. If you are suggesting that they should have a special meaning only in regexps, and only if braced, then I'd find it even more confusing. The use of braces is usually the easiest (and only?) way to split out a variable from following alphanumerics: /foo${P1}bar/ :These changes eliminate a potential source of confusion, retain all :functionality, provide an easy migration path for P526, and the last :notation (${P1}) serves as a clear indicator that you are talking about :something from outside the current regex. What is the migration path for existing uses of $P1-style variables? :=item * :s/(bar)(bell)/${P1}$2/ # changes "barbell" to "foobell" Note that in the current regexp engine, ${P1} has disappeared by the time matching starts. Can you explain why we need to change this? Note also that if you are sticking with ${P1} either we need to rename all existing user variables of this form, or we can no longer use the existing 'interpolate this string' (or eval, double-eval etc) routines, and have to roll our own for this (these) as well. :=head1 IMPLEMENTATION : :This may require significant changes to the regex engine, which is a topic :on which I am not qualified to speak. Could someone with more :knowledge/experience please chime in? Currently the regexp compiler is handed a string in which $variables have already interpolated. We'd need to avoid that and get either the the raw data for the string or some list that has undergone a minimum of preparation. It is possible we need that anyway - it is a prerequisite for some of the other proposed enhancements (such as the meta-referred-to RFC 112) and would certainly make the regexp engine more flexible - but it is certainly substantial work. I don't know what gotchas may arise. In general it seems a shame to recreate large parts of the existing string parsing/interpolation code, but it may not be possible to avoid it. Changing the lifetime of backreferences feels likely to be difficult, but it isn't clear to me what you are trying to achieve here. I think you at least need to add an example of how it would act under s///g and s///ge. :=head1 REFERENCES : :RFC 112: Assignment within a regex : :RFC 276: Localising Paren Counts in qr()s. I didn't see a mention of these in the body of the proposal. To me, the prime issue is with \1. The backslash is heavily overloaded in perl, and that makes it difficult to suggest a consistent and legible extension that would allow us to refer back to either variables (RFC 112) or hash keys (RFC 150). I don't think switching to $1 is any help for those, though. Hugo
Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations
On Thu, Sep 28, 2000 at 08:57:39PM -, Perl6 RFC Librarian wrote: > ${P1} means what $1 currently means (first match in last regex) I'm sorry that I don't have anything more constructive to say than "ick", but ... Ick. Well, maybe I do. Forget $P1. If the user wanted $1 from the previous RE, then they should have saved it somewhere. This would eliminate the "major" RE-engine changes to make $P1 work. But it would require that the p52p6 translator make some really smart modifications. -Scott -- Jonathan Scott Duff [EMAIL PROTECTED]
RFC 331 (v1) Consolidate the $1 and C<\1> notations
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE Consolidate the $1 and C<\1> notations =head1 VERSION Maintainer: David Storrs <[EMAIL PROTECTED]> Date: 28 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 331 Version: 1 Status: Developing =head1 ABSTRACT Currently, C<\1> and $1 have only slightly different meanings within a regex. Let's consolidate them together, eliminate the differences, and settle on $1 as the standard. =head1 DESCRIPTION Note: For convenience, I am going to talk about C<\1> and $1 in this RFC. In actuality, these notations extend indefinitely: C<\1..\n> and C<$1..$n>. Take it as read that anything which applies to $1 also applies to C<$2, $3>, etc. In current versions of Perl, C<\1> means "whatever was matched by the first set of grouping parens I." $1 means "whatever was matched by the first set of grouping parens I." For example: =over 4 =item * /(foo)_$1_bar/ =item * /(foo)_C<\1>_bar/ =back mean different things: the second will match 'foo_foo_bar', while the first will match 'foo[SOMETHING]bar' where [SOMETHING] is whatever was captured in the B match...which could be a long, long way away, possibly even in some module that you didn't even realize you were including (because it was included by a module that was included by a module that was included by a...). Probably the primary reason for this distinction is the following: =over 4 =item * s/(foo)C<\1>/$1bar/ # changes "foofoo" to "foobar" =back The key fact here is that, in the first section of a s/// you are supposed to use C<\1>, but in the second portion you are supposed to use $1. If you understand the whole logical structure behind it and understand how an s/// works (i.e., the right hand side of an s/// is a double-quoted string, not a regex), you will understand the distinction. For newbies, however, it is apt to be quite confusing. Aside from this confusion is the fact that, in general, when you use a backreference you want it to refer to something that you just matched...i.e., something from this regex. To resolve all these issues, let's remove the C<\1> notation and consolidate meanings as follows: =over 4 =item * C<\1> goes away as a special form =item * $1 means what C<\1> currently means (first match in this regex) =item * ${1} is the same as $1 (first match in this regex) =item * ${P1} means what $1 currently means (first match in last regex) =back These changes eliminate a potential source of confusion, retain all functionality, provide an easy migration path for P526, and the last notation (${P1}) serves as a clear indicator that you are talking about something from outside the current regex. Using this new syntax, you could then write: =over 4 =item * s/(foo)$1/$1bar/# changes "foofoo" to "foobar" =item * s/(bar)(bell)/${P1}$2/ # changes "barbell" to "foobell" =back =head2 Updating $1...When should it happen? After a regex is finished, it must update the ${Pn} variables so that the next match can access them if desired (if we wanted to get really pathological, we could have multidimensional access such as: ${P2,2} which is the second capture from the second-to-most-recent regex. This would seem to be a Bad Idea, however). This should not happen until after the statement containing the regex is finished, in order that the $1 variables on the right hand side of an s/// will still refer to the correct things. =head1 IMPLEMENTATION This may require significant changes to the regex engine, which is a topic on which I am not qualified to speak. Could someone with more knowledge/experience please chime in? =head1 REFERENCES RFC 112: Assignment within a regex RFC 276: Localising Paren Counts in qr()s. perlre manpage