Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-10-06 Thread Bennett Todd

I can see the motivation for wanting this, but there's a cost I
haven't read anyone mentioning yet: this is abandoning backward
compatibility with a regex notation that has remained pretty
consistent in ed(1) and grep(1) and things inspired by them since I
guess the early '70s, when they were born.

It may be we want to pay that price, but do please comment on it in
the RFC, it needs to be on the table. Perl is a prodigiously rich
language, and in large part it avoids the problem made famous by
PL/1, of being nearly impossible to learn completely, by the amount
that it borrows from other familiar tools.

Up until now, a wizard with regexps can carry all their knowlege
forward; everything they've done in the past keeps working
identically, and regexps composed for most other engines work fine
in perl. There's always some differences between engines; some
require (...) for grouping and/or tagging, others require \(...\);
some support "|" for alternation, some support "+" for "one or
more", some support "?" for "zero or one", and like that. And of
course perl has some heroic extensions. But using \1 for backrefs
within a pattern match has been standard; if an engine can match the
likes of beriberi and murmur, it might demand ^\(.*\)\1$ or it might
prefer ^(.*)\1$, but the \1 at least has been consistent. The
further we wander from supporting the same regexp languages (as a
subset of ours) as the other popular engines, the more support work
we're making for ourselves.

-Bennett

 PGP signature


Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-30 Thread Dave Storrs



On Sat, 30 Sep 2000, Bart Lateur wrote:

> I wrote this before, but apparently you didn't hear it. Let me repeat:

You're right, I missed your email when I was incorporating things
into the new version.  Apologies.


> $foo on the LHS allows metacharacter matching, for example "a.*b" can
> match "a foo b". But \1 only allows literal strings. If $1 captured

I don't believe it matters...my version of $1 works exactly like
the current \1 and my $/[1] works exactly like the current $1.  

Dave




Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-30 Thread Bart Lateur

On 28 Sep 2000 20:57:39 -, Perl6 RFC Librarian wrote:

>Currently, C<\1> and $1 have only slightly different meanings within a
>regex.  Let's consolidate them together, eliminate the differences, and
>settle on $1 as the standard.

I wrote this before, but apparently you didn't hear it. Let me repeat:
$foo on the LHS allows metacharacter matching, for example "a.*b" can
match "a foo b". But \1 only allows literal strings. If $1 captured
"a.*b", then \1 will only match the literal string "a.*b", as if the
regex contained "a\.\*b".

I don't see how you can possibly consider this a "tiny difference".

-- 
Bart.



Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-29 Thread Dave Storrs



On Fri, 29 Sep 2000, Hildo Biersma wrote:

> > Currently, C<\1> and $1 have only slightly different meanings within a
> > regex.  Let's consolidate them together, eliminate the differences, and
> > settle on $1 as the standard.
> 
> Sigh.  That would remove functionality from the language.
> 
> The reason why you need \1 in a regular expression is that $1, $2, ...
> are interpolated from the previous regular expression.  This allows me
> to do a pattern match that captures variables, then use the results of
> that to create a second regular expression. (Remember: A regexp
> interpolates first, then compiles the pattern).


Umm...with all due respect, did you read the RFC?  Because what I
proposed does not eliminate any functionality.  

Dave




Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-29 Thread Dave Storrs



On Thu, 28 Sep 2000, Hugo wrote:

> :=item *
> :/(foo)_C<\1>_bar/
> 
> Please don't do this: write C or /(foo)_\1_bar/, but
> don't insert C<> in the middle: that makes it much more difficult to
> read.

Sorry; that was a global-replace error that I missed on
proofreading.

 
> :mean different things:  the second will match 'foo_foo_bar', while the
> :first will match 'foo[SOMETHING]bar' where [SOMETHING] is whatever was
> 
> should be: foo_[SOMETHING]_bar

Um, yeah, it should...(jeez...I proofed this like three times,
honest!)  *blush*

 
> :captured in the B match...which could be a long, long way away,
> 
> This seems a bit unfair. It is just another variable. Any variable
> you include in a pattern, you are assumed to know that it contains
> the intended value - there is nothing special about $1 in this regard.

Fair enough; the point I was trying to make was that \1 was
captured right here, while $1 was capturd long, long ago in a pattern
match far, far away. The visual/cognitive difference is small, but the
programming difference is huge.


> :=item *
> :${P1} means what $1 currently means (first match in last regex)
> 
> Do you understand that this is the same variable as $P1? Traditionally,
> perl very rarely coopts variable names that start with alphanumerics,
> and (off the top of my head) all the ones it does so coopt are letters
> only (ARGV, AUTOLOAD, STDOUT etc). I think we need better reasons to
> extend that to all $P1-style variables.

I do understand that, and I agree with your concern.  Actually, I
didn't think that ${P1} was a particularly good notation even as I was
suggesting it...I just wanted to get the RFC up there before the deadline
so that people could discuss it.

Having now thought about it more, I think that (?P1) is
better...in other words, make references to the previous pattern match be
a regex _extension_, not a core feature (if that's a valid way to phrase
the distinction).


> What is the migration path for existing uses of $P1-style variables?

Wherever p526 sees a pattern that contains a $1, it should replace
it with (?P1).

 

> :=item *
> :s/(bar)(bell)/${P1}$2/   # changes "barbell" to "foobell"
> 
> Note that in the current regexp engine, ${P1} has disappeared by the
> time matching starts. Can you explain why we need to change this?
> Note also that if you are sticking with ${P1} either we need to
> rename all existing user variables of this form, or we can no longer
> use the existing 'interpolate this string' (or eval, double-eval etc)
> routines, and have to roll our own for this (these) as well.

I'm a bit confused by the way this came out but, if I understand
what you're asking, then I believe your concerns are solved by the new
proposed syntax.  Am I right?


> :This may require significant changes to the regex engine, which is a topic
> :on which I am not qualified to speak.  Could someone with more
> :knowledge/experience please chime in?
> 
> Currently the regexp compiler is handed a string in which $variables
> have already interpolated. [...]

I know there are certain exceptions to this...my Camel III says
(something to the effect of--I don't have it in front of me) "if there is
any doubt as to whether something should be interpolated or left for the
Engine, it will be left for the Engine."

In any case, I don't think this needs to change.  I'm simply
changing what the names of the variables and backreferences are...\1
becomes (the new) $1, and (the current) $1 becomes (?P1)

> Changing the lifetime of backreferences feels likely to be difficult,
> but it isn't clear to me what you are trying to achieve here. I think
> you at least need to add an example of how it would act under s///g
> and s///ge.

Good point.  I'll do that.

> :RFC 276: Localising Paren Counts in qr()s.
> 
> I didn't see a mention of these in the body of the proposal.

276 is rather tangentially related, I grant.  However, I felt that
if my proposal went forward, it could impact on how 276 was implemented,
so I crossreferenced to it.

Dave 




Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-29 Thread Hildo Biersma

 
> =head1 ABSTRACT
> 
> Currently, C<\1> and $1 have only slightly different meanings within a
> regex.  Let's consolidate them together, eliminate the differences, and
> settle on $1 as the standard.

Sigh.  That would remove functionality from the language.

The reason why you need \1 in a regular expression is that $1, $2, ...
are interpolated from the previous regular expression.  This allows me
to do a pattern match that captures variables, then use the results of
that to create a second regular expression. (Remember: A regexp
interpolates first, then compiles the pattern).

To come up with a silly example:

if ($line =~ //i) {
  if ($line =~ /<(P|DIV|SPAN) class='$1'>.*?<\/\1>/i) {
 ^^
   The class from the previous regexp
 ...
  }
}

If we implement this RFC, this would no longer be possible without the
use of an extra variable to store the first $1.  Interpolation for
regular expressions would no longer work the same as it is for
double-quoted strings.

Hildo



Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-29 Thread Piers Cawley

Jonathan Scott Duff <[EMAIL PROTECTED]> writes:

> On Thu, Sep 28, 2000 at 08:57:39PM -, Perl6 RFC Librarian wrote:
> > ${P1} means what $1 currently means (first match in last regex)
> 
> I'm sorry that I don't have anything more constructive to say than
> "ick", but ... Ick.

I'm with the 'Ick' camp too. And possibly with the 'Leave it the hell
alone! If you're that bloody stupid you deserve to lose' camp too.

-- 
Piers




Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-28 Thread Nathan Wiger

> =item *
> C<\1> goes away as a special form
> 
> =item *
> $1 means what C<\1> currently means (first match in this regex)
> 
> =item *
> ${1} is the same as $1 (first match in this regex)
> 
> =item *
> ${P1} means what $1 currently means (first match in last regex)

Here's the big problem with this, and I think others have said it
similarly: If we need the functionality of both \1 and $1, then there is
no reason redoing the syntax. Period.

If \1 is unneeded, then let's ditch it and just use $1 everywhere.
However, this is not the case, as Randal, Bart, and others have shown.

If we need \1, then we should leave as-is. There's no reason to force
literally millions of people to relearn this. Renaming something just to
rename it does not add value.

-Nate



Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-28 Thread Hugo

:=item *
:/(foo)_$1_bar/
:
:=item *
:/(foo)_C<\1>_bar/

Please don't do this: write C or /(foo)_\1_bar/, but
don't insert C<> in the middle: that makes it much more difficult to
read.

:mean different things:  the second will match 'foo_foo_bar', while the
:first will match 'foo[SOMETHING]bar' where [SOMETHING] is whatever was

should be: foo_[SOMETHING]_bar

:captured in the B match...which could be a long, long way away,
:possibly even in some module that you didn't even realize you were
:including (because it was included by a module that was included by a
:module that was included by a...). 

This seems a bit unfair. It is just another variable. Any variable
you include in a pattern, you are assumed to know that it contains
the intended value - there is nothing special about $1 in this regard.

:The key fact here is that, in the first section of a s/// you are supposed
:to use C<\1>, but in the second portion you are supposed to use $1.  If
:you understand the whole logical structure behind it and understand how an
:s/// works (i.e., the right hand side of an s/// is a double-quoted
:string, not a regex), you will understand the distinction.  For newbies,
:however, it is apt to be quite confusing.

I think the whole idea that the LHS of s/// is a pattern, but the
RHS is a string (module /e, of course) is apt to be confusing when
you first encounter it. You won't be able to make sense of any but
the simplest use of s/// until you understand it, I think, and the
documentation expresses it quite clearly.

:=item *
:${P1} means what $1 currently means (first match in last regex)

Do you understand that this is the same variable as $P1? Traditionally,
perl very rarely coopts variable names that start with alphanumerics,
and (off the top of my head) all the ones it does so coopt are letters
only (ARGV, AUTOLOAD, STDOUT etc). I think we need better reasons to
extend that to all $P1-style variables.

If you are suggesting that they should have a special meaning only
in regexps, and only if braced, then I'd find it even more confusing.
The use of braces is usually the easiest (and only?) way to split
out a variable from following alphanumerics:
  /foo${P1}bar/

:These changes eliminate a potential source of confusion, retain all
:functionality, provide an easy migration path for P526, and the last
:notation (${P1}) serves as a clear indicator that you are talking about
:something from outside the current regex.

What is the migration path for existing uses of $P1-style variables?

:=item *
:s/(bar)(bell)/${P1}$2/ # changes "barbell" to "foobell"

Note that in the current regexp engine, ${P1} has disappeared by the
time matching starts. Can you explain why we need to change this?
Note also that if you are sticking with ${P1} either we need to
rename all existing user variables of this form, or we can no longer
use the existing 'interpolate this string' (or eval, double-eval etc)
routines, and have to roll our own for this (these) as well.

:=head1 IMPLEMENTATION
:
:This may require significant changes to the regex engine, which is a topic
:on which I am not qualified to speak.  Could someone with more
:knowledge/experience please chime in?

Currently the regexp compiler is handed a string in which $variables
have already interpolated. We'd need to avoid that and get either
the the raw data for the string or some list that has undergone a
minimum of preparation. It is possible we need that anyway - it is
a prerequisite for some of the other proposed enhancements (such as
the meta-referred-to RFC 112) and would certainly make the regexp
engine more flexible - but it is certainly substantial work. I don't
know what gotchas may arise. In general it seems a shame to recreate
large parts of the existing string parsing/interpolation code, but
it may not be possible to avoid it.

Changing the lifetime of backreferences feels likely to be difficult,
but it isn't clear to me what you are trying to achieve here. I think
you at least need to add an example of how it would act under s///g
and s///ge.

:=head1 REFERENCES
:
:RFC 112: Assignment within a regex
:
:RFC 276: Localising Paren Counts in qr()s.

I didn't see a mention of these in the body of the proposal.

To me, the prime issue is with \1. The backslash is heavily overloaded
in perl, and that makes it difficult to suggest a consistent and
legible extension that would allow us to refer back to either variables
(RFC 112) or hash keys (RFC 150). I don't think switching to $1 is any
help for those, though.

Hugo



Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-28 Thread Jonathan Scott Duff

On Thu, Sep 28, 2000 at 08:57:39PM -, Perl6 RFC Librarian wrote:
> ${P1} means what $1 currently means (first match in last regex)

I'm sorry that I don't have anything more constructive to say than
"ick", but ... Ick.

Well, maybe I do.   Forget $P1.  If the user wanted $1 from the
previous RE, then they should have saved it somewhere.  This would
eliminate the "major" RE-engine changes to make $P1 work.  But it
would require that the p52p6 translator make some really smart
modifications.

-Scott
-- 
Jonathan Scott Duff
[EMAIL PROTECTED]



RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-28 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Consolidate the $1 and C<\1> notations

=head1 VERSION

  Maintainer: David Storrs <[EMAIL PROTECTED]>
  Date: 28 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number:  331
  Version: 1
  Status: Developing

=head1 ABSTRACT

Currently, C<\1> and $1 have only slightly different meanings within a
regex.  Let's consolidate them together, eliminate the differences, and
settle on $1 as the standard.

=head1 DESCRIPTION

Note:  For convenience, I am going to talk about C<\1> and $1 in this RFC.
In actuality, these notations extend indefinitely:  C<\1..\n> and
C<$1..$n>.  Take it as read that anything which applies to $1 also applies
to C<$2, $3>, etc.


In current versions of Perl, C<\1> means "whatever was matched by the
first set of grouping parens I."  $1 means "whatever
was matched by the first set of grouping parens I."  For example:

=over 4

=item *
/(foo)_$1_bar/

=item *
/(foo)_C<\1>_bar/

=back

mean different things:  the second will match 'foo_foo_bar', while the
first will match 'foo[SOMETHING]bar' where [SOMETHING] is whatever was
captured in the B match...which could be a long, long way away,
possibly even in some module that you didn't even realize you were
including (because it was included by a module that was included by a
module that was included by a...). 

Probably the primary reason for this distinction is the following:

=over 4

=item *
s/(foo)C<\1>/$1bar/ # changes "foofoo" to "foobar"

=back

The key fact here is that, in the first section of a s/// you are supposed
to use C<\1>, but in the second portion you are supposed to use $1.  If
you understand the whole logical structure behind it and understand how an
s/// works (i.e., the right hand side of an s/// is a double-quoted
string, not a regex), you will understand the distinction.  For newbies,
however, it is apt to be quite confusing.

Aside from this confusion is the fact that, in general, when you use a
backreference you want it to refer to something that you just
matched...i.e., something from this regex.

To resolve all these issues, let's remove the C<\1> notation and
consolidate meanings as follows:

=over 4

=item *
C<\1> goes away as a special form 

=item *
$1 means what C<\1> currently means (first match in this regex)

=item *
${1} is the same as $1 (first match in this regex)

=item *
${P1} means what $1 currently means (first match in last regex)

=back

These changes eliminate a potential source of confusion, retain all
functionality, provide an easy migration path for P526, and the last
notation (${P1}) serves as a clear indicator that you are talking about
something from outside the current regex.

Using this new syntax, you could then write:

=over 4

=item *
s/(foo)$1/$1bar/# changes "foofoo" to "foobar"

=item *
s/(bar)(bell)/${P1}$2/  # changes "barbell" to "foobell"

=back

=head2 Updating $1...When should it happen?

After a regex is finished, it must update the ${Pn} variables so that the
next match can access them if desired (if we wanted to get really
pathological, we could have multidimensional access such as:  ${P2,2}
which is the second capture from the second-to-most-recent regex.  This
would seem to be a Bad Idea, however).  This should not happen until after
the statement containing the regex is finished, in order that the $1
variables on the right hand side of an s/// will still refer to the
correct things.

=head1 IMPLEMENTATION

This may require significant changes to the regex engine, which is a topic
on which I am not qualified to speak.  Could someone with more
knowledge/experience please chime in?

=head1 REFERENCES

RFC 112: Assignment within a regex

RFC 276: Localising Paren Counts in qr()s.

perlre manpage