Re: RFC 331 (v2) Consolidate the $1 and C\1 notations

2000-10-03 Thread Dave Storrs



On Mon, 2 Oct 2000, Bart Lateur wrote:

 On Mon, 2 Oct 2000 12:46:06 -0700 (PDT), Dave Storrs wrote:
 
  Well, the main reason is that @/ worked best for my particular
 brain.
 
 But you cannot use it in an ordinary regex, can you? There's no way you
 can put $/[1] between slashes in s/.../.../. BAckslashing it doesn't
 work.

True...which means that either perl does Deep Magic to allow it (a
solution I don't like) or (the solution I DO like) the programmer uses
different delimiters on pattern matchs that will contain the @/
variable...which is a good hint that something unusual is happening, which
is a good thing.

 @
 wouldn't be quite the right match...after all, $ contains the _string_
 
 No, but it's closer. $ is closer in meaning to $1 than is, for example,
 $/. *Much* closer.

Hmmm...I see your point.  I've frozen the RFC as per
deadline...Nate, is it too late for me to make a minor semantic change and
rename a proposed variable?

Dave




RFC 331 (v2) Consolidate the $1 and C\1 notations

2000-10-01 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Consolidate the $1 and C\1 notations

=head1 VERSION

  Maintainer: David Storrs [EMAIL PROTECTED]
  Date: 28 Sep 2000
  Last Modified: 30 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 331
  Version: 2
  Status: Frozen

=head1 ABSTRACT

Currently, C\1 and $1 have only slightly different meanings within a
regex.  It is possible to consolidate them without losing any
functionality and, in the process, we gain intuitiveness.

=head1 CHANGES

v1-v2:  
A major rewrite:

=over 4

=item *
Reformatted the argument into "The Problem" and "The Solution" sections

=item *
Added "Some Examples" section

=item *
Added "Why do this?" section

=item *
Added "P526 migration" section

=item *
Proposed the @/ variable

=item *
Various trivial edits and typo-fixs

=back


=head1 DESCRIPTION

Note:  For convenience, I am going to talk about C\1 and $1 in this RFC.
In actuality, these notations extend indefinitely:  C\1..\n and
C$1..$n.  Take it as read that anything which applies to $1 also applies
to C$2, $3, etc.


=head2 The Problem

In current versions of Perl, C\1 and C$1 mean different things.
Specifically, C\1 means "whatever was matched by the first set of
grouping parens Iin this regex match."  $1 means "whatever was matched
by the first set of grouping parens Iin the previously-run regex match."
For example:

=over 4

=item *
C/(foo)_$1_bar/

=item *
C/(foo)_\1_bar/

=back

the second will match 'foo_foo_bar', while the first will match
'foo_[SOMETHING]_bar' where [SOMETHING] is whatever was captured in the
Bprevious match...which could be a long, long way away, possibly even in
some module that you didn't even realize you were including (because it
was included by a module that was included by a module that was included
by a...).

The primary reason for this distinction is s///, in which the left hand
side is a pattern while the right hand side is a string (assuming no 'e'
modifier).  Therefore:

=over 4

=item *
Cs/(foo)$1/$1bar/ # changes "foo???" to "foobar" where ??? is from the
last match

=item *
Cs/(foo)\1/$1bar/ # changes "foofoo" to "foobar"

=back

Note that, in the first example, the two $1s refer to different things,
whereas in the second example, $1 and C\1 refer to the same thing.  This
is counterintuitive and non-Perlish; Perl should be intuitive and DWIMish.

A separate, though less important, problem with the way backreferences are
currently implemented is that it is difficult for a human to tell at a
glance whether \10 means "escape character 10" or "backreference 10"...the
only way to tell is to count the number of captured elements and see if
there actually are ten of them, in which case \10 is a backreference and
otherwise it is an escape character.  In general, this isn't a problem
because most patterns don't have ten sets of capturing parens.


=head2 The Solution

Ok, so the problem is that $1 and C\1 are counterintuitive.  How do we
make them intuitive without losing any functionality?

First, let's get rid of the C\1 form for backreferences.

Second, let's say that $n refers to the nth captured subelement of the
pattern match which occured in this Bstatement--note that this is
distinct from "in this pattern match."  That means that, in
Cs/(foo)$1/$1bar/, both $1s refer to the same thing (the string 'foo'),
even though one of them occured inside a pattern and one occured inside a
string.  (See note [1] in the IMPLEMENTATION section.)

Third, let's create a new special variable, @/ (mnemonic: the / is the
default delimiter for a pattern match; if the English module remains
extant, then @/ could have the long name of @LAST_MATCH, but there are
currently several threads concerning removal of the English module). Much
like the current C$1, $2... variables, this array will only be created
(and hence, the speed price will only be paid), if you access its members.
The 0th element of @/ will contain the qr()d form of the last pattern
match, while successive elements refer to the captured subelements.

Fourth, let's change when we update the variables which store the captures
(the current C$1, $2, etc).  @/ will only be updated when the entire
statement which contains a pattern match has finished running (e.g., when
the entire s/// is completed), rather than as soon as the pattern match is
done (and therefore before the substitution happens).  


=head2 Some Examples

=over 4

=item 1
If you did the following:

C"Bilbo Baggins" =~ /((\w+)\s+(\w+))/

Then @/ would contain the following:

C$/[0] the compiled equivalent of C/((\w+)\s+(\w+))/, 

C$/[1] the string "Bilbo Baggins"

C$/[2] the string "Bilbo"

C$/[3] the string "Baggins"

Note that after the match, C$/[1], C$/[2], and C$/[3] contain
exactly what C$1, $2, and C$3 would contain with present-day syntax.
Furthermore, the compiled form of the match is available so if you want to
repeat the match later (or insert it into a larger regex), you can