Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

2000-09-12 Thread mike mulligan

From: Hugo [EMAIL PROTECTED]
Sent: Monday, September 11, 2000 11:59 PM


 mike mulligan replied to Peter Heslin:
 : ... it is greedy in the sense of the forward matching "*" or "+"
constructs.
 : [snip]

 This is nothing to do with greediness and everything to do with
 left-to-rightness. The regexp engine does not look for x* except
 in those positions where the lookbehind has already matched.

I was trying to understand at what point the lookbehind was attempted, and
confused myself and posted a bad example.  My apologies to everyone.  Let's
see if I can make sense of it on a second try.

My question is: if I have the regex  /(?=[aeiou]X[yz]+/  then does Perl: 1.
scan first for 'X', test the lookbehind, and then test the '[yz]',  or 2.
scan for 'X[yz]' and then test the lookbehind?

I am expecting these two alternatives to give the same result, but certain
test strings might run slower or faster depending on the approach.

Running perl -Dr shows that alternative 1 is used:

qq(aXuhXvoXyz) =~ /(?=[aeiou])X[yz]/

Guessing start of match, REx `(?=[aeiou])X[yz]' against `aXuhXvoXyz'...
Found anchored substr `X' at offset 1...
Guessed: match at offset 1
Matching REx `(?=[aeiou])X[yz]' against `XuhXvoXyz'
  Setting an EVAL scope, savestack=3
   1 a XuhXvoXyz  |  1:  IFMATCH[-1]
   0  aXuhXvoXyz  |  3:ANYOF[aeiou]
   1 a XuhXvoXyz  | 12:SUCCEED
  could match...
   1 a XuhXvoXyz  | 14:  EXACT X
   2 aX uhXvoXyz  | 16:  ANYOF[yz]
failed...
  Setting an EVAL scope, savestack=3
   4 aXuh XvoXyz  |  1:  IFMATCH[-1]
   3 aXu hXvoXyz  |  3:ANYOF[aeiou]
  failed...
failed...
  Setting an EVAL scope, savestack=3
   7 aXuhXvo Xyz  |  1:  IFMATCH[-1]
   6 aXuhXv oXyz  |  3:ANYOF[aeiou]
   7 aXuhXvo Xyz  | 12:SUCCEED
  could match...
   7 aXuhXvo Xyz  | 14:  EXACT X
   8 aXuhXvoX yz  | 16:  ANYOF[yz]
   9 aXuhXvoXy z  | 25:  END
Match successful!





Re: RFC 166 (v1) Additions to regexs

2000-09-12 Thread Richard Proctor

On Mon 11 Sep, Mark-Jason Dominus wrote:
 
  (?@foo) is sort of equivalent to (??{join('|',@foo)}), ie it expands into
  a list of alternatives.  One could possible use just @foo, for this.
 
 It just occurs to me that this is already possible.  I've written a
 module, 'atq', such that if you write
 
 use atq;
 
 then your regexes may contain the sequence
 
 (?\@foo)
 
 with the meaning that you asked for.  

Yes, but is this a very good way to go forward, the use of overload is
"heavy", if somthing might be useful.  (See other note about generalised
additions to regexes).

 
 (The \ is necessary here because (?@foo) already has a meaning under
 Perl 5, and I think your proposal must address this.)

(?@foo) has no meaning I checked the code

 
 Anyway, since this is possible under Perl 5 with a fairly simple
 module, I wonder if it really needs to be in the Perl 6 core.  Perhaps
 it would be better to propose that the module be added to the Perl 6
 standard library?

The module is small, but this does not mean that adding functionality to the
core is necesarily a bad idea.

Richard

-- 

[EMAIL PROTECTED]




Generalised Additions to regexes

2000-09-12 Thread Richard Proctor

(proto RFC possibly, and some generalised ramblings)

Given that expansion of regexes could include (+...) and (*...) I have been
thinking about providing a general purpose way of adding functionality. 

I propose that the entire (+...) syntax is kept free from formal
specification for this and is available for pluggable (module) expansion. 
(+ = addition).

A module or anything that wants to support some enhanced syntax
registers something that handles "regex enhancements".

At regex compile time, if and when (+foo) is found perl calls
each of the registered regex enhancements in turn, these:

1) Are passed the foo string as a parameter exactly as is.  (There is
an issue of actually finding the end of the foo.)

2) The regex enhancement can either recognise the content or not.

3) If not it returns undef and perl goes to the next regex enhancement
(Does it handle the enhancements as a stack (Last checked first) or a list
(First checked first?) how are they scoped?  Job here for the OO fanatics)

4) If perl runs out of regex enhancements it reports an error.  

5) if an enhancement recognises the content it could do either of:

a) return replacement expanded regex using existing capabilities perl will
then pass this back through the regex compiler.

b) return a coderef that is called at run time when the regex gets to this
point.  The referenced code needs to have enough access to the regex
internals to be able to see the current sub-expression, request more
characters, access to relevant flags and visability of greediness.  It may
also need a coderef that is simarly called when the regex is being unwound
when it backtracks.


Thinking from that - the last case should be generalised (it is sort of
like my (?*{...}) from RFC 198.  If so both cases a and b are the same,
b is just a case of returning (?*{...}).  

Following on, if (?{...}) etc code is evaluated in forward match, it would
be a good idea to likewise support some code block that is ignored on a
forward match but is executed when the code is unwound due to backtracking. 
Thus (?{ foo })(?\{ bar }) could be defined to execute foo on the forward
case and bar if it unwinds.  

For example - Think about foo putting something on a stack (eg the
bracket to match [RFC 145]) and bar taking it off.

I dont care at the moment what the syntax is - what about the concepts?

Richard






-- 

[EMAIL PROTECTED]




RFC 110 (v5) counting matches

2000-09-12 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1  TITLE

counting matches

=head1 VERSION

Maintainer: Richard Proctor [EMAIL PROTECTED]
Date: 16 Aug 2000
Last Modified: 12 Sep 2000
Mailing List: [EMAIL PROTECTED]
Number: 110
Version: 5
Status: Developing

=head1 ABSTRACT

Provide a simple way of giving a count of matches of a pattern.

=head1 DESCRIPTION

Have you ever wanted to count the number of matches of a patten?  s///g 
returns the number of matches it finds.  m//g just returns 1 for matching.
Counts can be made using s//$/g but this is wastefull, or by putting some 
counting loop round a m//g.  But this all seams rather messy. 

TomC (and a couple of others) have said that it can also be done as :
$count = () = $string =~ /pattern/g;

However many people do not like this construct, here are a couple of quotes:

jhi: Which I find cute as a demonstration of the Perl's context concept,
but ugly as hell from usability viewpoint.  

Bart Lateur: '()=' is not perfect. It is also butt ugly. It is a "dirty hack".

This construct is also likely to be inefficient as perl will have to
build up a list of all the matches, store them somewhere, count them, then
throw them away.

Therefore I would like a way of counting matches.

=head2 Proposal

m//gt (or m//t see below) would be defined to do the match, and return the
count of matches, this leaves all existing uses consistent and unaffected.
/t is suggested for "counT", as /c is already taken.

Relationship of m//t and m//g - there are three possibilities, my original:

m//gt, where /t adds counting to a group match (/t without /g would just
return 0 or 1).  However \G loses its meaning.

The Alternative By Uri :

m//t and m//g are mutually exclusive and m//gt should be regarded as an error.

Hugo:

 I like this too. I'd suggest /t should mean a) return a scalar of
 the number of matches and b) don't set any special variables. Then
 /t without /g would return 0 or 1, but be faster since no extra
 information need be captured (except internally for (.)\1 type
 matching - compile time checks could determine if these are needed,
 though (?{..}) and (??{..}) patterns would require disabling of
 that optimisation). /tg would give a scalar count of the total
 number of matches. \G would retain its meaning.

I think Hugo's wording about the relationship makes the best sense, and
this is the suggested way forward.

=head1 CHANGES

RFC110 V1 - Original posting to perl6-language

RFC110 V2 - Reposted to perl6-language-regex

RFC110 V3 - Added Uri's alternitive m//t

RFC110 V4 - Added notes about $count = () = $string =~ /pattern/g

RFC110 V5 - Added Hugo's wording about /g and /t relationship, suggested this
is the way forward.

Unless any significant discussion takes place this RFC will move to frozen
within a week.

=head1 IMPLENTATION

Hugo:
 Implementation should be fairly straightforward,
 though ensuring that optimisations occurred precisely when they
 are safe would probably involve a few bug-chasing cycles.


=head1 REFERENCES

I brought this up on p5p a couple of years ago, but it was lost in the noise...






Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.

2000-09-12 Thread Hugo

In 085601c01cc8$2c94f390$[EMAIL PROTECTED], "mike mulligan" w
rites:
:From: Hugo [EMAIL PROTECTED]
:Sent: Monday, September 11, 2000 11:59 PM
:
:
: mike mulligan replied to Peter Heslin:
: : ... it is greedy in the sense of the forward matching "*" or "+"
:constructs.
: : [snip]
:
: This is nothing to do with greediness and everything to do with
: left-to-rightness. The regexp engine does not look for x* except
: in those positions where the lookbehind has already matched.
:
:I was trying to understand at what point the lookbehind was attempted, and
:confused myself and posted a bad example.  My apologies to everyone.  Let's
:see if I can make sense of it on a second try.
:
:My question is: if I have the regex  /(?=[aeiou]X[yz]+/  then does Perl: 1.
:scan first for 'X', test the lookbehind, and then test the '[yz]',  or 2.
:scan for 'X[yz]' and then test the lookbehind?

3. The regexp is matched left to right: first the lookbehind, then 'X',
then '[yz]'.

:I am expecting these two alternatives to give the same result, but certain
:test strings might run slower or faster depending on the approach.
:
:Running perl -Dr shows that alternative 1 is used:

Running perl -Dr shows that alternative 3 is used. However the -Dr data
is confused by the optimiser, which happens to have chosen the fixed
string 'X' as something worth searching for first. So the optimiser
permits the main matching engine to look only at those positions where
there is an 'X' immediately following.

I've annotated the -Dr output below to try and clarify. Note that if
you replace 'X' with '(x|X)', no optimisations take place (other than
a 'minimum length' check) and -Dr will give a much clearer picture of
the flow; again, if you replace 'X[yz]' with '(x|X)y' the optimiser
will now pick 'y' as the most significant thing worth searching for.

Hope this helps,

Hugo
---
:qq(aXuhXvoXyz) =~ /(?=[aeiou])X[yz]/
:
:Guessing start of match, REx `(?=[aeiou])X[yz]' against `aXuhXvoXyz'...

The optimiser is entered.

:Found anchored substr `X' at offset 1...

This is what the optimiser is looking for.

:Guessed: match at offset 1

This is what the optimiser found.

:Matching REx `(?=[aeiou])X[yz]' against `XuhXvoXyz'

The real matcher is entered.

:  Setting an EVAL scope, savestack=3
:   1 a XuhXvoXyz  |  1:  IFMATCH[-1]
:   0  aXuhXvoXyz  |  3:ANYOF[aeiou]

Checking lookbehind ...

:   1 a XuhXvoXyz  | 12:SUCCEED

Ok.

:  could match...
:   1 a XuhXvoXyz  | 14:  EXACT X

Checking 'X' ...

:   2 aX uhXvoXyz  | 16:  ANYOF[yz]

Checking '[yz]' ...

:failed...

Failed: try the next position permitted by the optimiser.

:  Setting an EVAL scope, savestack=3
:   4 aXuh XvoXyz  |  1:  IFMATCH[-1]
:   3 aXu hXvoXyz  |  3:ANYOF[aeiou]

Checking lookbehind ...

:  failed...

Failed.

:failed...

Try the next position permitted by the optimiser.

:  Setting an EVAL scope, savestack=3
:   7 aXuhXvo Xyz  |  1:  IFMATCH[-1]
:   6 aXuhXv oXyz  |  3:ANYOF[aeiou]

Checking lookbehind ...

:   7 aXuhXvo Xyz  | 12:SUCCEED

Ok.

:  could match...
:   7 aXuhXvo Xyz  | 14:  EXACT X

Checking 'X' ...

:   8 aXuhXvoX yz  | 16:  ANYOF[yz]

Checking '[yz]' ...

:   9 aXuhXvoXy z  | 25:  END
:Match successful!

Match successful.