Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.
From: Hugo [EMAIL PROTECTED] Sent: Monday, September 11, 2000 11:59 PM mike mulligan replied to Peter Heslin: : ... it is greedy in the sense of the forward matching "*" or "+" constructs. : [snip] This is nothing to do with greediness and everything to do with left-to-rightness. The regexp engine does not look for x* except in those positions where the lookbehind has already matched. I was trying to understand at what point the lookbehind was attempted, and confused myself and posted a bad example. My apologies to everyone. Let's see if I can make sense of it on a second try. My question is: if I have the regex /(?=[aeiou]X[yz]+/ then does Perl: 1. scan first for 'X', test the lookbehind, and then test the '[yz]', or 2. scan for 'X[yz]' and then test the lookbehind? I am expecting these two alternatives to give the same result, but certain test strings might run slower or faster depending on the approach. Running perl -Dr shows that alternative 1 is used: qq(aXuhXvoXyz) =~ /(?=[aeiou])X[yz]/ Guessing start of match, REx `(?=[aeiou])X[yz]' against `aXuhXvoXyz'... Found anchored substr `X' at offset 1... Guessed: match at offset 1 Matching REx `(?=[aeiou])X[yz]' against `XuhXvoXyz' Setting an EVAL scope, savestack=3 1 a XuhXvoXyz | 1: IFMATCH[-1] 0 aXuhXvoXyz | 3:ANYOF[aeiou] 1 a XuhXvoXyz | 12:SUCCEED could match... 1 a XuhXvoXyz | 14: EXACT X 2 aX uhXvoXyz | 16: ANYOF[yz] failed... Setting an EVAL scope, savestack=3 4 aXuh XvoXyz | 1: IFMATCH[-1] 3 aXu hXvoXyz | 3:ANYOF[aeiou] failed... failed... Setting an EVAL scope, savestack=3 7 aXuhXvo Xyz | 1: IFMATCH[-1] 6 aXuhXv oXyz | 3:ANYOF[aeiou] 7 aXuhXvo Xyz | 12:SUCCEED could match... 7 aXuhXvo Xyz | 14: EXACT X 8 aXuhXvoX yz | 16: ANYOF[yz] 9 aXuhXvoXy z | 25: END Match successful!
Re: RFC 166 (v1) Additions to regexs
On Mon 11 Sep, Mark-Jason Dominus wrote: (?@foo) is sort of equivalent to (??{join('|',@foo)}), ie it expands into a list of alternatives. One could possible use just @foo, for this. It just occurs to me that this is already possible. I've written a module, 'atq', such that if you write use atq; then your regexes may contain the sequence (?\@foo) with the meaning that you asked for. Yes, but is this a very good way to go forward, the use of overload is "heavy", if somthing might be useful. (See other note about generalised additions to regexes). (The \ is necessary here because (?@foo) already has a meaning under Perl 5, and I think your proposal must address this.) (?@foo) has no meaning I checked the code Anyway, since this is possible under Perl 5 with a fairly simple module, I wonder if it really needs to be in the Perl 6 core. Perhaps it would be better to propose that the module be added to the Perl 6 standard library? The module is small, but this does not mean that adding functionality to the core is necesarily a bad idea. Richard -- [EMAIL PROTECTED]
Generalised Additions to regexes
(proto RFC possibly, and some generalised ramblings) Given that expansion of regexes could include (+...) and (*...) I have been thinking about providing a general purpose way of adding functionality. I propose that the entire (+...) syntax is kept free from formal specification for this and is available for pluggable (module) expansion. (+ = addition). A module or anything that wants to support some enhanced syntax registers something that handles "regex enhancements". At regex compile time, if and when (+foo) is found perl calls each of the registered regex enhancements in turn, these: 1) Are passed the foo string as a parameter exactly as is. (There is an issue of actually finding the end of the foo.) 2) The regex enhancement can either recognise the content or not. 3) If not it returns undef and perl goes to the next regex enhancement (Does it handle the enhancements as a stack (Last checked first) or a list (First checked first?) how are they scoped? Job here for the OO fanatics) 4) If perl runs out of regex enhancements it reports an error. 5) if an enhancement recognises the content it could do either of: a) return replacement expanded regex using existing capabilities perl will then pass this back through the regex compiler. b) return a coderef that is called at run time when the regex gets to this point. The referenced code needs to have enough access to the regex internals to be able to see the current sub-expression, request more characters, access to relevant flags and visability of greediness. It may also need a coderef that is simarly called when the regex is being unwound when it backtracks. Thinking from that - the last case should be generalised (it is sort of like my (?*{...}) from RFC 198. If so both cases a and b are the same, b is just a case of returning (?*{...}). Following on, if (?{...}) etc code is evaluated in forward match, it would be a good idea to likewise support some code block that is ignored on a forward match but is executed when the code is unwound due to backtracking. Thus (?{ foo })(?\{ bar }) could be defined to execute foo on the forward case and bar if it unwinds. For example - Think about foo putting something on a stack (eg the bracket to match [RFC 145]) and bar taking it off. I dont care at the moment what the syntax is - what about the concepts? Richard -- [EMAIL PROTECTED]
RFC 110 (v5) counting matches
This and other RFCs are available on the web at http://dev.perl.org/rfc/ =head1 TITLE counting matches =head1 VERSION Maintainer: Richard Proctor [EMAIL PROTECTED] Date: 16 Aug 2000 Last Modified: 12 Sep 2000 Mailing List: [EMAIL PROTECTED] Number: 110 Version: 5 Status: Developing =head1 ABSTRACT Provide a simple way of giving a count of matches of a pattern. =head1 DESCRIPTION Have you ever wanted to count the number of matches of a patten? s///g returns the number of matches it finds. m//g just returns 1 for matching. Counts can be made using s//$/g but this is wastefull, or by putting some counting loop round a m//g. But this all seams rather messy. TomC (and a couple of others) have said that it can also be done as : $count = () = $string =~ /pattern/g; However many people do not like this construct, here are a couple of quotes: jhi: Which I find cute as a demonstration of the Perl's context concept, but ugly as hell from usability viewpoint. Bart Lateur: '()=' is not perfect. It is also butt ugly. It is a "dirty hack". This construct is also likely to be inefficient as perl will have to build up a list of all the matches, store them somewhere, count them, then throw them away. Therefore I would like a way of counting matches. =head2 Proposal m//gt (or m//t see below) would be defined to do the match, and return the count of matches, this leaves all existing uses consistent and unaffected. /t is suggested for "counT", as /c is already taken. Relationship of m//t and m//g - there are three possibilities, my original: m//gt, where /t adds counting to a group match (/t without /g would just return 0 or 1). However \G loses its meaning. The Alternative By Uri : m//t and m//g are mutually exclusive and m//gt should be regarded as an error. Hugo: I like this too. I'd suggest /t should mean a) return a scalar of the number of matches and b) don't set any special variables. Then /t without /g would return 0 or 1, but be faster since no extra information need be captured (except internally for (.)\1 type matching - compile time checks could determine if these are needed, though (?{..}) and (??{..}) patterns would require disabling of that optimisation). /tg would give a scalar count of the total number of matches. \G would retain its meaning. I think Hugo's wording about the relationship makes the best sense, and this is the suggested way forward. =head1 CHANGES RFC110 V1 - Original posting to perl6-language RFC110 V2 - Reposted to perl6-language-regex RFC110 V3 - Added Uri's alternitive m//t RFC110 V4 - Added notes about $count = () = $string =~ /pattern/g RFC110 V5 - Added Hugo's wording about /g and /t relationship, suggested this is the way forward. Unless any significant discussion takes place this RFC will move to frozen within a week. =head1 IMPLENTATION Hugo: Implementation should be fairly straightforward, though ensuring that optimisations occurred precisely when they are safe would probably involve a few bug-chasing cycles. =head1 REFERENCES I brought this up on p5p a couple of years ago, but it was lost in the noise...
Re: RFC 72 (v3) Variable-length lookbehind: the regexp engine should also go backward.
In 085601c01cc8$2c94f390$[EMAIL PROTECTED], "mike mulligan" w rites: :From: Hugo [EMAIL PROTECTED] :Sent: Monday, September 11, 2000 11:59 PM : : : mike mulligan replied to Peter Heslin: : : ... it is greedy in the sense of the forward matching "*" or "+" :constructs. : : [snip] : : This is nothing to do with greediness and everything to do with : left-to-rightness. The regexp engine does not look for x* except : in those positions where the lookbehind has already matched. : :I was trying to understand at what point the lookbehind was attempted, and :confused myself and posted a bad example. My apologies to everyone. Let's :see if I can make sense of it on a second try. : :My question is: if I have the regex /(?=[aeiou]X[yz]+/ then does Perl: 1. :scan first for 'X', test the lookbehind, and then test the '[yz]', or 2. :scan for 'X[yz]' and then test the lookbehind? 3. The regexp is matched left to right: first the lookbehind, then 'X', then '[yz]'. :I am expecting these two alternatives to give the same result, but certain :test strings might run slower or faster depending on the approach. : :Running perl -Dr shows that alternative 1 is used: Running perl -Dr shows that alternative 3 is used. However the -Dr data is confused by the optimiser, which happens to have chosen the fixed string 'X' as something worth searching for first. So the optimiser permits the main matching engine to look only at those positions where there is an 'X' immediately following. I've annotated the -Dr output below to try and clarify. Note that if you replace 'X' with '(x|X)', no optimisations take place (other than a 'minimum length' check) and -Dr will give a much clearer picture of the flow; again, if you replace 'X[yz]' with '(x|X)y' the optimiser will now pick 'y' as the most significant thing worth searching for. Hope this helps, Hugo --- :qq(aXuhXvoXyz) =~ /(?=[aeiou])X[yz]/ : :Guessing start of match, REx `(?=[aeiou])X[yz]' against `aXuhXvoXyz'... The optimiser is entered. :Found anchored substr `X' at offset 1... This is what the optimiser is looking for. :Guessed: match at offset 1 This is what the optimiser found. :Matching REx `(?=[aeiou])X[yz]' against `XuhXvoXyz' The real matcher is entered. : Setting an EVAL scope, savestack=3 : 1 a XuhXvoXyz | 1: IFMATCH[-1] : 0 aXuhXvoXyz | 3:ANYOF[aeiou] Checking lookbehind ... : 1 a XuhXvoXyz | 12:SUCCEED Ok. : could match... : 1 a XuhXvoXyz | 14: EXACT X Checking 'X' ... : 2 aX uhXvoXyz | 16: ANYOF[yz] Checking '[yz]' ... :failed... Failed: try the next position permitted by the optimiser. : Setting an EVAL scope, savestack=3 : 4 aXuh XvoXyz | 1: IFMATCH[-1] : 3 aXu hXvoXyz | 3:ANYOF[aeiou] Checking lookbehind ... : failed... Failed. :failed... Try the next position permitted by the optimiser. : Setting an EVAL scope, savestack=3 : 7 aXuhXvo Xyz | 1: IFMATCH[-1] : 6 aXuhXv oXyz | 3:ANYOF[aeiou] Checking lookbehind ... : 7 aXuhXvo Xyz | 12:SUCCEED Ok. : could match... : 7 aXuhXvo Xyz | 14: EXACT X Checking 'X' ... : 8 aXuhXvoX yz | 16: ANYOF[yz] Checking '[yz]' ... : 9 aXuhXvoXy z | 25: END :Match successful! Match successful.