RFC 360 (v1) Allow multiply matched groups in regexes to return a listref of all matches

2000-09-30 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Allow multiply matched groups in regexes to return a listref of all matches

=head1 VERSION

  Maintainer: Kevin Walker <[EMAIL PROTECTED]>
  Date: 30 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 360
  Version: 1
  Status: Developing

=head1 DESCRIPTION

Since the October 1 RFC deadline is nigh, this will be pretty informal.

Suppose you want to parse text with looks like:

 name: John Abajace
 children: Tom, Dick, Harry
 favorite colors: red, green, blue

 name: I. J. Reilly
 children: Jane, Gertrude
 favorite colors: black, white
 
 ...

Currently, this takes two passes:

 while ($text =~ /name:\s*(.*?)\n\s*
children:\s*(.*?)\n\s*
favorite\ colors:\s*(.*?)\n/sigx) {
 # now second pass for $2 ( = "Tom, Dick, Harry") and $3, yielding
 # list of children and favorite colors
 }

If we introduce a new construction, (?@ ... ), which means "spit out a
list ref of all matches, not just the last match", then this could be
done in one pass:

 while ($text =~ /name:\s*(.*?)\n\s*
children:\s*(?:(?@\S+)[, ]*)*\n\s*
favorite\ colors:\s*(?:(?@\S+)[, ]*)*\n/sigx) {
 # now we have:
 #  $1 = "John Abajace";
 #  $2 = ["Tom", "Dick", "Harry"]
 #  $3 = ["red", "green", "blue"]
 }

Although the above example is contrived, I have very often felt the need
for this feature in real-world projects.

=head1 IMPLEMENTATION

Unknown.

=head1 REFERENCES

None.




RFC 347 (v2) Remove long-deprecated $* (aka $MULTILINE_MATCHING)

2000-09-30 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Remove long-deprecated $* (aka $MULTILINE_MATCHING)

=head1 VERSION

  Maintainer: Hugo van der Sanden <[EMAIL PROTECTED]>
  Date: 29 Sep 2000
  Last Modified: 30 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 347
  Version: 2
  Status: Frozen

=head1 ABSTRACT

The magic $* variable (known in English as $MULTILINE_MATCHING)
has been deprecated for years. It is time to kill it.

=head1 DESCRIPTION

In days of yore, you would set $* to 1 to achieve in all regexps
the same as you can now achieve on a per-regexp basis with the
/m flag. Nowadays, when most perl programmers have never heard
of it, it is an accident waiting to happen and requires ugly
additional cruft for the defensive programmer to avoid.

The particular danger of $* is its 'action at a distance' effect:
as a global variable, its effect reaches into and out of scopes
that we normally expect to protect us.

=head1 MIGRATION

The long deprecation cycle helps here. p52p6 should complain and
die if it sees any attempt to set $* or $MULTILINE_MATCHING to a
non-zero value, or any attempt to alias it other than in English.
It should silently (or maybe with a warning) ignore any attempt to
set it to a zero value, and silently (or maybe with a warning)
replace any attempt to read it with a constant undef.

=head1 IMPLEMENTATION

This only simplifies the regexp engine, and should help fix some
longstanding bugs in the scope of /m. There is a bit of work to
do to extricate it, but nothing seriously difficult.

=head1 REFERENCES

perlvar manpage for discussion of $*




RFC 331 (v2) Consolidate the $1 and C<\1> notations

2000-09-30 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Consolidate the $1 and C<\1> notations

=head1 VERSION

  Maintainer: David Storrs <[EMAIL PROTECTED]>
  Date: 28 Sep 2000
  Last Modified: 30 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 331
  Version: 2
  Status: Frozen

=head1 ABSTRACT

Currently, C<\1> and $1 have only slightly different meanings within a
regex.  It is possible to consolidate them without losing any
functionality and, in the process, we gain intuitiveness.

=head1 CHANGES

v1->v2:  
A major rewrite:

=over 4

=item *
Reformatted the argument into "The Problem" and "The Solution" sections

=item *
Added "Some Examples" section

=item *
Added "Why do this?" section

=item *
Added "P526 migration" section

=item *
Proposed the @/ variable

=item *
Various trivial edits and typo-fixs

=back


=head1 DESCRIPTION

Note:  For convenience, I am going to talk about C<\1> and $1 in this RFC.
In actuality, these notations extend indefinitely:  C<\1..\n> and
C<$1..$n>.  Take it as read that anything which applies to $1 also applies
to C<$2, $3>, etc.


=head2 The Problem

In current versions of Perl, C<\1> and C<$1> mean different things.
Specifically, C<\1> means "whatever was matched by the first set of
grouping parens I."  $1 means "whatever was matched
by the first set of grouping parens I."
For example:

=over 4

=item *
C

=item *
C

=back

the second will match 'foo_foo_bar', while the first will match
'foo_[SOMETHING]_bar' where [SOMETHING] is whatever was captured in the
B match...which could be a long, long way away, possibly even in
some module that you didn't even realize you were including (because it
was included by a module that was included by a module that was included
by a...).

The primary reason for this distinction is s///, in which the left hand
side is a pattern while the right hand side is a string (assuming no 'e'
modifier).  Therefore:

=over 4

=item *
C

=item *
C

=back

Note that, in the first example, the two $1s refer to different things,
whereas in the second example, $1 and C<\1> refer to the same thing.  This
is counterintuitive and non-Perlish; Perl should be intuitive and DWIMish.

A separate, though less important, problem with the way backreferences are
currently implemented is that it is difficult for a human to tell at a
glance whether \10 means "escape character 10" or "backreference 10"...the
only way to tell is to count the number of captured elements and see if
there actually are ten of them, in which case \10 is a backreference and
otherwise it is an escape character.  In general, this isn't a problem
because most patterns don't have ten sets of capturing parens.


=head2 The Solution

Ok, so the problem is that $1 and C<\1> are counterintuitive.  How do we
make them intuitive without losing any functionality?

First, let's get rid of the C<\1> form for backreferences.

Second, let's say that $n refers to the nth captured subelement of the
pattern match which occured in this B--note that this is
distinct from "in this pattern match."  That means that, in
C, both $1s refer to the same thing (the string 'foo'),
even though one of them occured inside a pattern and one occured inside a
string.  (See note [1] in the IMPLEMENTATION section.)

Third, let's create a new special variable, @/ (mnemonic: the / is the
default delimiter for a pattern match; if the English module remains
extant, then @/ could have the long name of @LAST_MATCH, but there are
currently several threads concerning removal of the English module). Much
like the current C<$1, $2...> variables, this array will only be created
(and hence, the speed price will only be paid), if you access its members.
The 0th element of @/ will contain the qr()d form of the last pattern
match, while successive elements refer to the captured subelements.

Fourth, let's change when we update the variables which store the captures
(the current C<$1, $2>, etc).  @/ will only be updated when the entire
statement which contains a pattern match has finished running (e.g., when
the entire s/// is completed), rather than as soon as the pattern match is
done (and therefore before the substitution happens).  


=head2 Some Examples

=over 4

=item 1
If you did the following:

C<"Bilbo Baggins" =~ /((\w+)\s+(\w+))/>

Then @/ would contain the following:

C<$/[0]> the compiled equivalent of C, 

C<$/[1]> the string "Bilbo Baggins"

C<$/[2]> the string "Bilbo"

C<$/[3]> the string "Baggins"

Note that after the match, C<$/[1]>, C<$/[2]>, and C<$/[3]> contain
exactly what C<$1, $2>, and C<$3> would contain with present-day syntax.
Furthermore, the compiled form of the match is available so if you want to
repeat the match later (or insert it into a larger regex), you can simply
refer to it as $/[0].


=item 2
With references to the previous regex being handled by the @/ variable, we
are free to use $1 for the current statement only.  Therefore:

C
C 

Note that in the substitut

RFC 317 (v2) Access to optimisation information for regular expressions

2000-09-30 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Access to optimisation information for regular expressions

=head1 VERSION

  Maintainer: Hugo van der Sanden ([EMAIL PROTECTED])
  Date: 25 Sep 2000
  Last Modified: 30 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 317
  Version: 2
  Status: Frozen

=head1 NOTES ON FREEZING

No comments to this except for a reference from RFC 72 (v4), which
hopes that the concept will be extended to permit the caller to
set study()-type information.

If/when time permits I'll try a patch to perl5 to see how easy it
is and to discover whether anyone other than Peter and I want it.

=head1 ABSTRACT

Currently you can see optimisation information for a regexp only
by running with -Dr in a debugging perl and looking at STDERR.
There should be an interface that allows us to read this information
programmatically and possibly to alter it.

=head1 DESCRIPTION

At its core, the regular expression matcher knows how to check
whether a pattern matches a string starting at a particular location.
When the regular expression is compiled, perl may also look for
optimisation information that can be used to rule out some or all
of the possible starting locations in advance.

Currently you can find out about the optimisation information
captured for a particular regexp only in a perl built with
DEBUGGING, by turning on -Dr:

  % perl -Dr -e 'qr{test.*pattern}'
  Compiling REx `test.*pattern'
  size 8 first at 1
  rarest char p at 0
  rarest char s at 2
 1: EXACT (3)
 3: STAR(5)
 4:   REG_ANY(0)
 5: EXACT (8)
 8: END(0)
  anchored `test' at 0 floating `pattern' at 4..2147483647 (checking floating) minlen 
11 
  Omitting $` $& $' support.
  
  EXECUTING...
  
  Freeing REx: `test.*pattern'
  %

For some purposes it would help to be able to get at this information
programmatically: the test suite could take advantage of this (to test
that optimisations occur as expected), and it could also be useful for
enhanced development tools, such as a graphical regexp debugger.

Additionally there are times that the programmer is able to supply
optimisation that the regexp engine cannot discover for itself. While
we could consider making it possible to modify these values, it is
important to remember that these are only hints: the regexp engine
is free to ignore them. So there is a danger that people will misuse
writable optimisation information to move part of the logic out of
the regexp, and then blame us when it breaks.

Suggested example usage:

  % perl -wl
  use re;
  $a = qr{test.*pattern};
  print join ':', $a->fixed_string, $a->floating_string, $a->minlen;
  __END__
  test:pattern:11
  %

.. but perhaps a single new method returning a hashref would be
cleaner and more extensible:

  $opt = $a->optimisation;
  print join ':', @$opt{qw/ fixed_string floating_string minlen /};

=head1 IMPLEMENTATION

Straightforward: add interface functions within the perl core to give
access to read and/or write the optimisation values; add methods in
re.pm that use XS code to reach the internal functions.

=head1 REFERENCES

Prompted by discussion of RFC 72: The regexp engine should go backward
as well as forward.




RFC 276 (v2) Localising Paren Counts in qr()s.

2000-09-30 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1  TITLE

Localising Paren Counts in qr()s.

=head1 VERSION

  Maintainer: Richard Proctor <[EMAIL PROTECTED]>
  Date: 24 Sep 2000
  Last Modified: 30 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 276
  Version: 2
  Status: Frozen

=head1 ABSTRACT

The Paren Counts and backreferences should be localised in each qr(), to prevent
surprises when qr()s are used in combination.

=head1 DESCRIPTION

TomCs perl storm #0040 has:

> Figure out way to do 
> 
> /$e1 $e2/
> 
> safely, where $e1 might have '(foo) \1' in it. 
> and $e2 might have '(bar) \1' in it.  Those won't work.

=head2 DISCUSSION

Me: If e1 and e2 are qr// type things the answer might be to localise 
the backref numbers in each qr// expression.   Use of assignment in a regex
and named backrefs (RFC 112) would make this a lot safer.


Hugo: 
I think it is reaonable to ask whether the current handling of qr{}
subpatterns is correct:

perl -wle '$a=qr/(a)\1/; $b=qr/(b).*\1/; /$a($b)/g and print join ":", $1, 
pos for "aabbac"' 
a:5

I'm tempted to suggest it isn't; that the paren count should be local
to each qr{}, so that the above prints 'bb:4'. I think that most people
currently construct their qr{} patterns as if they are going to be
handled in isolation, without regard to the context in which they are
embedded - why else do they override the embedder's flags if not to
achieve that?

The problem then becomes: do we provide a mechansim to access the
nested backreferences outside of the qr{} in which they were referenced,
and if so what syntax do we offer to achieve that? I don't have an answer
to the latter, which tempts me to answer 'no' to the former for all the
wrong reasons. I suspect (and suggest) that complication is the only
reason we don't currently have the behaviour I suggest the rest of the
semantics warrant - that backreferences are localised within a qr().

I lie: the other reason qr{} currently doesn't behave like that is that
when we interpolate a compiled regexp into a context that requires it be
recompiled, we currently ignore the compiled form and act only on the
original string. Perhaps this is also an insufficiently intelligent thing
to do.

MJD:
Interpolated qr() items shouldn't be recompiled anyway.  They should
be treated as subroutine calls.  Unfortunately, this requires a
reentrant regex engine, which Perl doesn't have.  But I think it's the
right way to go, and it would solve the backreference problem, as well
as many other related problems.

Me: You can access the nested backreferences outside of the qr{} in which 
they were referenced by use of the named backref see RFC 112.

=head2 AGREEMENTS

The paren count in each qr() is localised to each qr().

There is no way to access the nested backrefernces outside of the qr() by
number they may be accessed by name (see RFC 112). 

The regex engine must be made re-entrant.

The regex compiler should not need to recompile qr()s when used as part of
another regex.

=head1 CHANGES

V2 - Added a some more comments under implementation and froze - 
no significant changes.

=head1 IMPLENTATION

The Regex engine must be made re-entrant.

The expansion of variables in regexes must be driven by the regex compiler
(Same problem as for RFCs 112, 166 ...)

Hugo:

None of these are necessarily true - we could change the overloading
of the Regexp object instead. Currently we have:

  my $re = qr{pattern};
  print "$re";

.. giving 'pattern' by overloading stringification. If we overload it
instead to give '(??{ $re })' (or a moral equivalent) we have a nasty
hack, it is true, but it could allow us to defer the much trickier
proper solution. Of course it breaks every other use of the string
value, and I'm not sure how big a problem that might be.


=head1 REFERENCES

Perlstorm #0040 from TomC.

RFC 112: Assignment within a regex




RFC 150 (v2) Extend regex syntax to provide for return of a hash of matched subpatterns

2000-09-30 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Extend regex syntax to provide for return of a hash of matched subpatterns

=head1 VERSION

  Maintainer: Kevin Walker <[EMAIL PROTECTED]>
  Date: 23 Aug 2000
  Last Modified: 30 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 150
  Version: 2
  Status: Frozen

=head1 ABSTRACT

Currently regexes return matched subpatterns as a list.  This is
inconvenient in at least two situations: (1) long, complicated regexes,
where counting parentheses can be difficult and error-prone; and, more
importantly, (2) matching against a list of regexes, when the corresponding
fields of the various regexes do not occur in the same order.


=head1 DESCRIPTION

I suggest that (?% field_name : pattern) spit out 'field_name', in addition
to the matched pattern, when matching in a list context:

 $text = "abajace -- mailbox full";
%hash = $text =~ /^ (?% username : \S+) \s*--\s* (?% reason : .*)$/xsi;

would result in %hash = (username => 'abajace', reason => 'mailbox full').

Suggestions for better syntax are hereby solicited.  (?% field_name ->
pattern) and (?% field_name => pattern) come immediately to mind.


Why This Would be Useful:

Often one wants to match a string against a list of patterns which extract
similar information from the string, but the fields occur in varying orders.
Also, some optional fields might get extracted by some patterns and not by
others.  Continuing with the (over-simplified) example of analyzing e-mail
bounce messages:

   my @regexps = (

   # 'abajace -- mailbox full' or 'abajace -- user unknown'
   q/^ \s* (?% username  : \S+) \s*--\s* (?% reason : .*)$/,
 
   # 'Unknown local part: flycrake'
   q/^ \s* (?% reason : Unknown\ local\ part): \s* (?% username  : \S+)/,
 
   # 'New address for abajace is [EMAIL PROTECTED]'
   q/(?% reason : new\ address\ for) \s+ (?% username  : \S+) \s+ is \s+
(?% new_address : \S+\@\S+)/,

   );

   while (my $bounce_text = get_next_message()) {
   my %field = ();
   for my $regexp (@regexps) {
   if ( %field = $bounce_text =~ /$regexp/xsi;) {
   print "username: $field{username}, reason: $field{reason}\n";
   if ($field{new_address}) {
   change_address($field{username}, $field{new_address});
   }
   last;
   }
   }
   }


Backrefs

It would also be useful to have named backrefs.  I propose that (\%field_name)
match a previous a previous named bracket.  As before, I'm not attached to
the proposed syntax.


=head1 IMPLEMENTATION

I confess that I'm not an expert in regex internals.  Nevertheless, I'll go
out on a limb and assert that this will be relatively easy to implement,
with relatively few entangling side-issues.


=head1 REFERENCES

See also: RFC 112: Assignment within a regex




Regex Extension RFC

2000-09-30 Thread Kevin Walker

=head1 TITLE

Allow multiply matched groups in regexes to return a listref of all matches

=head1 VERSION

   Maintainer: Kevin Walker <[EMAIL PROTECTED]>
   Date: 30 Sep 2000
   Version: 1
   Mailing List: [EMAIL PROTECTED]
   Status: Frozen


=head1 DESCRIPTION

Since the October 1 RFC deadline is nigh, this will be pretty informal.

Suppose you want to parse text with looks like:

 name: John Abajace
 children: Tom, Dick, Harry
 favorite colors: red, green, blue

 name: I. J. Reilly
 children: Jane, Gertrude
 favorite colors: black, white
 
 ...

Currently, this takes two passes:

 while ($text =~ /name:\s*(.*?)\n\s*
children:\s*(.*?)\n\s*
favorite\ colors:\s*(.*?)\n/sigx) {
 # now second pass for $2 ( = "Tom, Dick, Harry") and $3, yielding
 # list of children and favorite colors
 }

If we introduce a new construction, (?@ ... ), which means "spit out a
list ref of all matches, not just the last match", then this could be
done in one pass:

 while ($text =~ /name:\s*(.*?)\n\s*
children:\s*(?:(?@\S+)[, ]*)*\n\s*
favorite\ colors:\s*(?:(?@\S+)[, ]*)*\n/sigx) {
 # now we have:
 #  $1 = "John Abajace";
 #  $2 = ["Tom", "Dick", "Harry"]
 #  $3 = ["red", "green", "blue"]
 }

Although the above example is contrived, I have very often felt the need
for this feature in real-world projects.

=head1 IMPLEMENTATION

Unknown.

=head1 REFERENCES

None.



RFC 150

2000-09-30 Thread Kevin Walker

=head1 TITLE

Extend regex syntax to provide for return of a hash of matched subpatterns

=head1 VERSION

   Maintainer: Kevin Walker <[EMAIL PROTECTED]>
   Date: 23 Aug 2000
   Mailing List: [EMAIL PROTECTED]
   Number: 150
   Version: 2
   Status: Frozen

=head1 ABSTRACT

Currently regexes return matched subpatterns as a list.  This is
inconvenient in at least two situations: (1) long, complicated regexes,
where counting parentheses can be difficult and error-prone; and, more
importantly, (2) matching against a list of regexes, when the corresponding
fields of the various regexes do not occur in the same order.


=head1 DESCRIPTION

I suggest that (?% field_name : pattern) spit out 'field_name', in addition
to the matched pattern, when matching in a list context:

 $text = "abajace -- mailbox full";
%hash = $text =~ /^ (?% username : \S+) \s*--\s* (?% reason : .*)$/xsi;

would result in %hash = (username => 'abajace', reason => 'mailbox full').

Suggestions for better syntax are hereby solicited.  (?% field_name ->
pattern) and (?% field_name => pattern) come immediately to mind.


Why This Would be Useful:

Often one wants to match a string against a list of patterns which extract
similar information from the string, but the fields occur in varying orders.
Also, some optional fields might get extracted by some patterns and not by
others.  Continuing with the (over-simplified) example of analyzing e-mail
bounce messages:

   my @regexps = (

   # 'abajace -- mailbox full' or 'abajace -- user unknown'
   q/^ \s* (?% username  : \S+) \s*--\s* (?% reason : .*)$/,
 
   # 'Unknown local part: flycrake'
   q/^ \s* (?% reason : Unknown\ local\ part): \s* (?% username  : \S+)/,
 
   # 'New address for abajace is [EMAIL PROTECTED]'
   q/(?% reason : new\ address\ for) \s+ (?% username  : \S+) \s+ is \s+
(?% new_address : \S+\@\S+)/,

   );

   while (my $bounce_text = get_next_message()) {
   my %field = ();
   for my $regexp (@regexps) {
   if ( %field = $bounce_text =~ /$regexp/xsi;) {
   print "username: $field{username}, reason: $field{reason}\n";
   if ($field{new_address}) {
   change_address($field{username}, $field{new_address});
   }
   last;
   }
   }
   }


Backrefs

It would also be useful to have named backrefs.  I propose that (\%field_name)
match a previous a previous named bracket.  As before, I'm not attached to
the proposed syntax.


=head1 IMPLEMENTATION

I confess that I'm not an expert in regex internals.  Nevertheless, I'll go
out on a limb and assert that this will be relatively easy to implement,
with relatively few entangling side-issues.


=head1 REFERENCES

See also RFC 112.



Re: RFC 72 (v4) Variable-length lookbehind.

2000-09-30 Thread Hugo

:The original proposal suggested that it would be nice if C in list
:context were to return the offset into the the target string where
:the match begins, as well as where it ends.  Some people seemed to like
:this idea, so I am leaving it in, even though it no longer has much to
:do with the content of this RFC.

I didn't spot this fragment before. Note that you can get this
with $-[0] (and pos() with $+[0]), though the info is associated
with the last successful match rather than tied to the string.
See @- and @+ in perlvar for the complete details.

Note that this aspect of the proposal needs a MIGRATION section to
add scalar() around existing uses in a list context. Such use is
quite common in parsing contexts; C::Scan does this for example:

push @out, pos $txt;

Hugo



Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-30 Thread Dave Storrs



On Sat, 30 Sep 2000, Bart Lateur wrote:

> I wrote this before, but apparently you didn't hear it. Let me repeat:

You're right, I missed your email when I was incorporating things
into the new version.  Apologies.


> $foo on the LHS allows metacharacter matching, for example "a.*b" can
> match "a foo b". But \1 only allows literal strings. If $1 captured

I don't believe it matters...my version of $1 works exactly like
the current \1 and my $/[1] works exactly like the current $1.  

Dave




Re: RFC 331 (v1) Consolidate the $1 and C<\1> notations

2000-09-30 Thread Bart Lateur

On 28 Sep 2000 20:57:39 -, Perl6 RFC Librarian wrote:

>Currently, C<\1> and $1 have only slightly different meanings within a
>regex.  Let's consolidate them together, eliminate the differences, and
>settle on $1 as the standard.

I wrote this before, but apparently you didn't hear it. Let me repeat:
$foo on the LHS allows metacharacter matching, for example "a.*b" can
match "a foo b". But \1 only allows literal strings. If $1 captured
"a.*b", then \1 will only match the literal string "a.*b", as if the
regex contained "a\.\*b".

I don't see how you can possibly consider this a "tiny difference".

-- 
Bart.



Re: RFC 72 (v4) Variable-length lookbehind.

2000-09-30 Thread Bart Lateur

On 30 Sep 2000 19:50:27 -, Perl6 RFC Librarian wrote:

>In Perl6, lookbehind in regular expressions should be extended to permit
>not only fixed-length, but also variable-length lookbehind.

I see no mention of negative lookbehind.

As I wrote before, in:

/(?


Re: RFC 316 (v1) Regex modifier for support of chunk processing and prefix matching

2000-09-30 Thread Bart Lateur

On Tue, 26 Sep 2000 11:55:32 +1100 (EST), Damian Conway wrote:

>Wouldn't this interact rather badly with the /gc option (which also leaves
>C set on failure)?

Yes.

The easy way out is disallow combining /gc wit h/z. But, since this
typically one of the applications it is aimed for, I should find a
solution. A different interface, is one option.

>This question arose because I was trying to work out how one would write a
>lexer with the new /z option, and it made my head ache ;-)

Heheh. Your turn.   ;-)


>I'm not sure I see that this:
...
>is less intimidating or closer to the "ordinary program flow"  than:
>
>   \*FH =~ /(abcd|bc)/g;
>
>(as proposed in RFC 93).

Was that what was proposed? I think not. It was:

sub { ... } =~ /(abcd|bc)/g;


But I kinda like that syntax. But, in practice, it looks too much like
black magic:

 * where is the sting stored? It looks like it disappears into thin air.
 
 * What about pushback? Your proposal depends on it, but standard
filehandles don't support it, IMO. Does this require a TIEHANDLE
implementation?

 * Your regex shouldn't consume any more characters friom the filehandle
than it matches? Where are the reamining characters pushed back into?

>   > After every single keystroke, you can test what he just 
>   > entered against a regex matching the valid format for a number, so that 
>   > C<1234E> can be recognized as a prefix for the regex
>   > 
>   > /^\d+\.?\d*(?:E[+-]?\d+)$/
>
>Isn't this just:
>
>   \*STDIN =~ /^\d+\.?\d*(?:E[+-]?\d+)$/
>   or die "Not a number";
>
>???

No. First of all, you can't override the behaviour of STDIN. That reads
a whole line, then checks it, and then your script dies if it's not
right.

I want a test on every single keystroke, see if it's in sync with the
regex, and if it's not, reject it, i.e. no insertion in the uinput
buffer, and no echo on screen. Besides, you can't be sure your data
comes from a filehandle (or compatible handle). Not in a GUI.

-- 
Bart.



Re: regexp RFCs: freeze 'em or lose 'em

2000-09-30 Thread Bart Lateur

On Sat, 30 Sep 2000 13:55:40 +0100, Hugo wrote:

>The RFCs listed below are still listed as 'developing'. The deadline is
>given as 1st October, but I'm not sure where the precise cutoff point
>is - Nat, can you confirm?
>
>As I understand it, RFCs not frozen by the deadline will be treated as
>withdrawn. So please hurry.

Is that midnight (wherever) from 30 September to 1 October, or midnight
from 1 October to 2 October?

Surely, giving comment on other people's RFC's is a lot easier -- and
faster -- than writing one of your own. It took me hours. That just
proves that I'm not cut out to be a professional writer...  :-)

-- 
Bart.



Re: RFC 348 (v1) Regex assertions in plain Perl code

2000-09-30 Thread Bart Lateur

On Sat, 30 Sep 2000 00:57:47 +0100, Hugo wrote:

>:"local" inside embedded code will no longer be supported, nor will
>:consitional regexes. The Perl5 -> Perl6 translator should warn if it
>:ever encounters one of these.
>
>I'm not convinced that removing either of these are necessary to the
>main thrust of the proposal. They may both still be useful in their
>own right, and you seem to offer little evidence against them other
>than that you don't like them.

"local" promotes the idea of semi-permanently changes global data. That
is a very coding practice, it shouldn't be encouraged. The fact that
it's pretty hard to predict precisely when embedded code will be called
(see the example in the RFC), that too, conflicts with this.

It most definitely doesn't fit into the spirit of assertions.

There's an RFC requesting that *all* of these advanced features should
go. There's no justification there, either. I'm limiting myself here to
mentioning the features I do no consider essential for assertions to be
useful. It doesn't need local. Is that good enough for you? You may keep
it if you wish, but it is not essential.

And I do think that the semantics of "local" don't fit well into the
rest of Perl. Clearly, in

(?{local $c = $c+1 })

the scope of $c should be limited to this embedded code block!?!

>I do like the idea of making (?{...}) an assertion, all the more
>because we have a simple migration path that avoids unnecessarily
>breaking existing scripts: wrap $code as '$^R = do { $code }; 1'.

Good. :-)

>If you want to remove support for 'local' in embedded code, it is
>worth a full proposal in its own right that will explain what will
>happen if people try to do that. (I think it will make perl
>unnecessarily more complex to detect and disable it in this case.)

Quite the contrary, I think. My guess is that this support for loacl
*complicates* implementation, and probably by a substantial amount.

>Similarly if you want to remove support for (?(...)) completely,
>you need to address the utility and options for migration for all
>the available uses of it, not just the one addressed by the new
>handling of (?{...}).

You're talking about conditional regexes? I am curious to see just *one*
good reason to keep them in. I've not yet seen anything using a  regex
that makes use of it (appart from Perl5's embedded code assertions),
that can't be done without it. Anybody is free to prove me wrong. 

-- 
Bart.



regexp RFCs: freeze 'em or lose 'em

2000-09-30 Thread Hugo

The RFCs listed below are still listed as 'developing'. The deadline is
given as 1st October, but I'm not sure where the precise cutoff point
is - Nat, can you confirm?

As I understand it, RFCs not frozen by the deadline will be treated as
withdrawn. So please hurry.

If you make any substantial changes from the last published version of
the RFC, it would help if you could let the list have a last look at it
before submitting to the librarian to give us a chance to correct any
major new problems before they're writ in stone. I accept the short
timescales impose stringent limits on this.

I'm sorry I was unable to give an updated summary this week; if anyone
was waiting to discover specific information from it, please ask me.

Hugo
---
RFC 72: The regexp engine should go backward as well as forward. (Peter Heslin)
RFC 112: Assignment within a regex  (Richard Proctor)
RFC 145: Brace-matching for Perl Regular Expressions  (Eric Roode)
RFC 150: Extend regex syntax to provide for return of a hash of matched subpatterns  
(Kevin Walker)
RFC 166: Alternative lists and quoting of things  (Richard Proctor)
RFC 198: Boolean Regexes (Richard Proctor)
RFC 274: Generalised Additions to Regexs (Richard Proctor)
RFC 276: Localising Paren Counts in qr()s (Richard Proctor)
RFC 308: Ban Perl hooks into regexes (Simon Cozens)
RFC 316: Regex modifier for support of chunk processing and prefix matching (Bart 
Lateur)
RFC 317: Access to optimisation information for regular expression (Hugo)
RFC 331: Consolidate the $1 and \1 notations (David Storrs)
RFC 332: Regex: Make /$/ equivalent to /\z/ under the '/s' modifier (Bart Lateur)
RFC 347: Remove long-deprecated $* (aka $MULTILINE_MATCHING) (Hugo)
RFC 348: Regex assertions in plain Perl code (Bart Lateur)




More on RFC 93 (was Re: RFC 316 (v1) ...)

2000-09-30 Thread Hugo

In <[EMAIL PROTECTED]>, Bart Lateur writes:
:Yes, but RFC 93 has some other disadvantages.

In respect of the number of calls, there seems nothing in RFC 93
to stop us permitting the callback to return more or fewer than the
requested number of characters. So a filehandle, for example, could
choose to return some multiple of 4K blocks for every request. A
socket conenction that applies a line-based protocol would probably
read a line at a time, while another socket might return just those
characters available to read without blocking.

:Furthermore, where is the resulting buffer stored? People usually still
:want a copy of their data, to do yet other things with. Here, the data
:has disappeared into thin air. The only way to get it, is putting
:capturing parens in the regex.

It seems to me that $` and $& are the right solutions here. I assume
that perl6 will not allow this to cause an overreaching performance
problem. In this context we have the additional advantage that the
only copy of the accumulated string is owned by the regexp engine,
so no additional copy need be made to protect it.

:Compared to that, RFC 93 feels like a straightjacket. To me.

Strangely it feels uncommonly liberating to me.

:You may have to completely rewrite your script. So much for code reuse.

I don't believe that it need be so painful to take advantage of it
in existing code. We can ease that by providing a selection of
helpful ready-rolled routines for common tasks.

Hugo



Re: RFC 316 (v1) Regex modifier for support of chunk processing and prefix matching

2000-09-30 Thread Bart Lateur

On Sat, 30 Sep 2000 00:23:13 +0100, Hugo wrote:

>This is a strength of RFC 93 however, since in that context we
>don't need to restart the match each time we go off to fetch more
>data. In that situation if we run out of data after the 1234E2+2
>we fail the attempt to widen the \d+, match forward to the $, and
>are immediately finished.

Yes, but RFC 93 has some other disadvantages.

Look at the template of the sub we need for every callback funtion:

sub s {
if ($_[1]) {# "putback unused data" request
recache($_[0]);
}
else {  # "send more data" request
return get_chars(max=>$_[0])
}
}

This is not pretty, especially since recache() is not even defined yet.

Furthermore, where is the resulting buffer stored? People usually still
want a copy of their data, to do yet other things with. Here, the data
has disappeared into thin air. The only way to get it, is putting
capturing parens in the regex.

As a consequence, the regex shouldn't read any more characters than it
actually eats. So, reading and pushing back of the data will almost have
to be per byte. That's what RFC 93 says, too:

>The single
>argument would specify how many characters should be returned (typically
>this would be 1, unless internal analysis by the regex engine can deduce
>that more than one character will be required)

Imagine that you have a data file of 1Mb has to be processed. That is a
minimum: it hardly makes sense to process much smaller files in chunks,
because it willlikely just fit in memory as a whole. Imagine that
typically, your regex needs to proces and backtrack over each character
five times. That is 5 reads, and 4 pushbacks. That is a rather
conservative estimate, I think, for complex regexes. That means 9
invocations of this sub *per character*, or 9 million callback function
calls for your 1 Mb data file. I won't even like to start to think about
the effect this will have on the processing time required. The idea
would probably be OK if this was C, but it is not.

Imagine how my mechanism would do it. First of all, your getting and
storing of data all happen manually, so you have it a your disposal for
whatever else you'd like to use it. Let's make it small chunks of 1k. A
1Mb file then will be processed in (roughly) 1000 chunks. Add the need
for redoing the regex without the '/z' modifier at the end of the file,
that makes a total of 1001. Compare that to the 9 million callback calls
of RFC 93.

Look, I don't think that these two approaches really exclude one
another. There's no conflict. It is possible to implement both.

And finally: I'm not married to the interface. That might change
completely. All suggestions welcome. But I like the cheap way of making
the regex tell me that it needs more data to make up its mind for 100%.

Modifying a script that was written to process dat in lines, so that now
it can work with multiline data (multiline CSV files, HTML files with
tags split over several lines, ...) really requires a relatively small
change to your script. *That* is one of the features I really like.

Compared to that, RFC 93 feels like a straightjacket. To me. You may
have to completely rewrite your script. So much for code reuse.

-- 
Bart.