Re: Perlstorm #0040
== > I lie: the other reason qr{} currently doesn't behave like that is that > when we interpolate a compiled regexp into a context that requires it be > recompiled, Interpolated qr() items shouldn't be recompiled anyway. They should be treated as subroutine calls. Unfortunately, this requires a reentrant regex engine, which Perl doesn't have. But I think it's the right way to go, and it would solve the backreference problem, as well as many other related problems. == The REx engine is reenterant enough right now. All you need to do is to add the //p switch (or, meanwhile, rewrite each $qrn into (?p{ $qrn })). Ilya
Re: perl6-language-regex summary for 20000920
== RFC 72: The regexp engine should go backward as well as forward. (Peter Heslin) Peter says (edited): :If the regexp code is unlikely to be rewritten from the ground up, then :there may be little chance of this feature being implemented. I'll make :a pitch for it anyway at the end of my talk at YAPC::Europe, and then :I'll freeze the RFC. == As I said it for many times: this is absolutely trivial to implement. First of all, if you agree to rewrite (?<= \w\s*\d ) # Semantic X: match "a 1" as (?<= \d\s*\w ) # Semantic Y: match "a 1" then it is as simple as inserting go-back-by(1) nodes before each node for \s \d and \w. And to support the more intuitive ;-) semantic X, the only more-or-less tricky part is to recursively go through the compile tree, and put "concatenated" nodes in the opposite order. A piece of cake. == RFC 145: Brace-matching for Perl Regular Expressions (Eric Roode) The closest we have to an emerging consensus appears to be that it is very difficult to pin down a precise problem to solve - the areas in which we want to match pairs of delimiters (such as numeric expressions, C code, perl code, HTML and XML) each seem to require a variety of special cases, each different from the other. == Emacs gives a bare minimum to support: mark chars by syntax classes. Which classes there are is a tricky question. Emacs's way is too C-centric. == I have no time to summarize the things I feel are needed. But since they can be easily done in the Perl5 track as well, maybe they are not proper for this list. And I discussed all of them many times already... "unfinished strings", (allows $/ = /fo*ba*r/) \g< and \g> (report start/end of $& at these pos); onion rings: (?<> A <> B &! C & D) (substring matched by A such that B and D match against it, but C does not, in B, C, D \A and \z denote boundaries of what was matched by A); \F{-*}, \F{-.}, \F+ (finish and restart the match "where"), here "where" is nowhere/at-the-current-position/as-usual, and -/+ mean whether one needs to report this match to the caller; applying a REx to a substring (two versions: with/without allowing lookahead/behind outside of the range); (*@arr: REx ) # Make @arr the default-match-array instead of ($1,$2,...) # (@arr is not interpolated) (*%hash: REx ) # Make @hash the default-match-hash instead of %^MATCH (*id:REx ) # Put what-is-matched into $default_match_hash{id} (*id*: REx )* # As, REx*, but put what-is-matched during each REx # into separate elements of @{$default_match_hash{id}} (*id[]: REx ) # make @{$default_match_hash{id)} into default-match-array (*id{}: REx ) # make %{$default_match_hash{id)} into default-match-hash # all of the above are localized for the duration of REx as well as many performance improvements. Yours, Ilya
Re: is \1 vs $1 a necessary distinction?
On Wed 27 Sep, Dave Storrs wrote: > > > On Wed, 27 Sep 2000, Richard Proctor wrote: > > > Both \1 and $1 refer to what is matched by the first set of parens in a > > > regex. AFAIK, the only difference between these two notation is that > > > \1 is used within the regex itself and $1 is used outside of the > > > regex. Is there any reason not to standardize these down to one > > > notation (i.e., eliminate one or the other)? > > > > I think this is fixable. > > The way you phrase that makes it sound that other people perceive > this as a problem as well, which gives me all sorts of warm fuzzies. :> > > > The only real need for this at the moment is to overcome limitations in > > the order of expansion of regexes. RFCs 112, 166, 276... all depend on > > fixing this. > > Ok, here's another question. How the _HELL_ does everyone else on > this bloody list keep track of every detail in every frigging RFC? Some > random comment comes up, and someone will go, "Oh, the third paragraph of > the second section in RFC 0x97A already mentioned this as a parenthetical > aside, despite the fact that its title and primary topic had no relation > to the issue." I still have (mumble-mumble) RFCs that I haven't even had > time to *read*, let alone memorize every detail of! In this context I was the author of guess what 112, 166 and 276 (though I admit to having to look up the number of the last one) > > Grr*grumble, grumble, moan, winge* > > Ok, back to rationality now. > > > If the regex compiler gets in before the expansion of the variables to > > make these work, it could handle $1 in all cases \1 can be retained for > > compatibility. > > Do we *want* to maintain \1? Why have two notations to do the > same thing when one is clearly superior? (\1 can only go up to \9 while > the other could theoretically go to ${...}.) Perl6 is breaking > backwards compatibility and eliminating all deprecated features...let's > get rid of \n as backreference notation. > The principle issue would be what to do about use of $1 on the LHS having its current meaning. Which is rather good for obfuscated code, but not terribly kind on normal programming. Note RFC 112 covers assignment within a regex naming rather than numbering the brackets one wishes to capture, it also covers named back references. Currently $1 is expanded by the quoting currently before the regex compiler gets to play, the regex compiler sees the \1 and knows what to do. \ meaning refer back I am reasonably happy with, the numbers I am not. Richard -- [EMAIL PROTECTED]
Re: is \1 vs $1 a necessary distinction?
On 27 Sep 2000, Piers Cawley wrote: > > Do we *want* to maintain \1? Why have two notations to do the > > I'm kind of curious about what happens when you want to do, say: > > if (m/(\S+)/) { > $reg = qr{<(em|i|b)>($1)}; > } > > where the $1 in the regex quote is refering to $1 from the previous > regex match. Well, how about this: $reg = qr{<(em|i|b)>(${P1})}; NOTE: ^ If you assume that $1 and ${1} are equivalent (which makes it possible to have as many backrefs as you want), then you could say that, if the first character after the { is a P, it means "in the previous regex match." Dave
Re: is \1 vs $1 a necessary distinction?
> "Jonathan" == Jonathan Scott Duff <[EMAIL PROTECTED]> writes: Jonathan> On Wed, Sep 27, 2000 at 08:15:53AM -0700, Dave Storrs wrote: >> Both \1 and $1 refer to what is matched by the first set of parens in a >> regex. AFAIK, the only difference between these two notation is that \1 >> is used within the regex itself and $1 is used outside of the regex. Is >> there any reason not to standardize these down to one notation (i.e., >> eliminate one or the other)? Jonathan> \1 can be used on the LHS of a s/// whereas $1 there probably won't do Jonathan> what you expect. Also, \1, \2, \3 only takes you as far as \9 ;-) Wrong. If you have more than 10 parens visible so far, \10 works just fine. Jonathan> If $1 could be made to work properly on the LHS of s///, I'd vote for Jonathan> that being The Way. It can't ever. It means $1 from the previous match. -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 <[EMAIL PROTECTED]> http://www.stonehenge.com/merlyn/> Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
Re: is \1 vs $1 a necessary distinction?
Dave Storrs <[EMAIL PROTECTED]> writes: > On Wed, 27 Sep 2000, Richard Proctor wrote: > > > Both \1 and $1 refer to what is matched by the first set of parens in a > > > regex. AFAIK, the only difference between these two notation is that \1 > > > is used within the regex itself and $1 is used outside of the regex. Is > > > there any reason not to standardize these down to one notation (i.e., > > > eliminate one or the other)? > > > > I think this is fixable. > > The way you phrase that makes it sound that other people perceive > this as a problem as well, which gives me all sorts of warm fuzzies. :> > > >The only real need for this at the moment is to > > overcome limitations in the order of expansion of regexes. RFCs 112, 166, > > 276... all depend on fixing this. > > [...] > > >If the regex compiler gets in before the > > expansion of the variables to make these work, it could handle $1 in all cases > > \1 can be retained for compatibility. > > Do we *want* to maintain \1? Why have two notations to do the > same thing when one is clearly superior? (\1 can only go up to \9 while > the other could theoretically go to ${...}.) Perl6 is breaking > backwards compatibility and eliminating all deprecated features...let's > get rid of \n as backreference notation. I'm kind of curious about what happens when you want to do, say: if (m/(\S+)/) { $reg = qr{<(em|i|b)>($1)}; } while (<>) { next unless m{$reg}; ... } where the $1 in the regex quote is refering to $1 from the previous regex match. -- Piers
Re: is \1 vs $1 a necessary distinction?
> "DS" == Dave Storrs <[EMAIL PROTECTED]> writes: DS> Both \1 and $1 refer to what is matched by the first set of parens DS> in a regex. AFAIK, the only difference between these two notation DS> is that \1 is used within the regex itself and $1 is used outside DS> of the regex. Is there any reason not to standardize these down DS> to one notation (i.e., eliminate one or the other)? because $1 having be set previously will be interpolated INTO the new regex. so you have to have another notation to refer to grabbed stuff from the current regex. uri -- Uri Guttman - [EMAIL PROTECTED] -- http://www.sysarch.com SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting The Perl Books Page --- http://www.sysarch.com/cgi-bin/perl_books The Best Search Engine on the Net -- http://www.northernlight.com
Re: is \1 vs $1 a necessary distinction?
On Wed, 27 Sep 2000, Richard Proctor wrote: > > Both \1 and $1 refer to what is matched by the first set of parens in a > > regex. AFAIK, the only difference between these two notation is that \1 > > is used within the regex itself and $1 is used outside of the regex. Is > > there any reason not to standardize these down to one notation (i.e., > > eliminate one or the other)? > > I think this is fixable. The way you phrase that makes it sound that other people perceive this as a problem as well, which gives me all sorts of warm fuzzies. :> >The only real need for this at the moment is to > overcome limitations in the order of expansion of regexes. RFCs 112, 166, > 276... all depend on fixing this. Ok, here's another question. How the _HELL_ does everyone else on this bloody list keep track of every detail in every frigging RFC? Some random comment comes up, and someone will go, "Oh, the third paragraph of the second section in RFC 0x97A already mentioned this as a parenthetical aside, despite the fact that its title and primary topic had no relation to the issue." I still have (mumble-mumble) RFCs that I haven't even had time to *read*, let alone memorize every detail of! Grr*grumble, grumble, moan, winge* Ok, back to rationality now. >If the regex compiler gets in before the > expansion of the variables to make these work, it could handle $1 in all cases > \1 can be retained for compatibility. Do we *want* to maintain \1? Why have two notations to do the same thing when one is clearly superior? (\1 can only go up to \9 while the other could theoretically go to ${...}.) Perl6 is breaking backwards compatibility and eliminating all deprecated features...let's get rid of \n as backreference notation. Dave
Re: is \1 vs $1 a necessary distinction?
On Wed, 27 Sep 2000, Jonathan Scott Duff wrote: > If $1 could be made to work properly on the LHS of s///, I'd vote for > that being The Way. That was pretty much my thought?
Re: is \1 vs $1 a necessary distinction?
From: "Dave Storrs" <[EMAIL PROTECTED]> > Both \1 and $1 refer to what is matched by the first set of parens in a > regex. AFAIK, the only difference between these two notation is that \1 > is used within the regex itself and $1 is used outside of the regex. Is > there any reason not to standardize these down to one notation (i.e., > eliminate one or the other)? \1 came from sed and friends. I think an early driving force was maintaining familiarity with things like awk and sed. Even today there are still people that switch to and from other reg-ex languages. Emacs is the most common for me (though I still dabble with awk). I don't see a real advantage in taking out \1, and it is very likely to needlessly break legacy code, and additionally confuse various developers that have a habbit of using \1. On the other hand, the use of $1with substitutions is important for consistency. When you write s/../.../e, you're going to need to use a substitution variable, "\1" just doesn't fit. s/(...)/pre\1post/; works fine s/(...)/pre$1post/; is the question. I tend to use it only because I sometimes switch to: s/(...)/func() . "$1post"/e; for various reasons.. I just try and standardize on $1, but that's just me. Additionally the use of $1 in the matching reg-ex is ambiguous as in: m/(...).*?$1/; Does it refer to the internal set of (..), or does it mean the previous value of $1 before this match.. This becomes non-obvious to the observer in the following case: m/($keyword).*?$1/; Here, our mindset is substitution of external variables, the casual (non-seasoned) observer might not understand that it really means: m/($keyword).*?\1/; My argument is that both \1 and $1 have their places, and limiting to one type can be troublesome. Plus, TMTOWTDI. :) -Michael
Re: is \1 vs $1 a necessary distinction?
Dave, > Both \1 and $1 refer to what is matched by the first set of parens in a > regex. AFAIK, the only difference between these two notation is that \1 > is used within the regex itself and $1 is used outside of the regex. Is > there any reason not to standardize these down to one notation (i.e., > eliminate one or the other)? I think this is fixable. The only real need for this at the moment is to overcome limitations in the order of expansion of regexes. RFCs 112, 166, 276... all depend on fixing this. If the regex compiler gets in before the expansion of the variables to make these work, it could handle $1 in all cases \1 can be retained for compatibility. Richard
Re: is \1 vs $1 a necessary distinction?
On Wed, Sep 27, 2000 at 08:15:53AM -0700, Dave Storrs wrote: > Both \1 and $1 refer to what is matched by the first set of parens in a > regex. AFAIK, the only difference between these two notation is that \1 > is used within the regex itself and $1 is used outside of the regex. Is > there any reason not to standardize these down to one notation (i.e., > eliminate one or the other)? \1 can be used on the LHS of a s/// whereas $1 there probably won't do what you expect. Also, \1, \2, \3 only takes you as far as \9 ;-) If $1 could be made to work properly on the LHS of s///, I'd vote for that being The Way. -Scott -- Jonathan Scott Duff [EMAIL PROTECTED]
is \1 vs $1 a necessary distinction?
Both \1 and $1 refer to what is matched by the first set of parens in a regex. AFAIK, the only difference between these two notation is that \1 is used within the regex itself and $1 is used outside of the regex. Is there any reason not to standardize these down to one notation (i.e., eliminate one or the other)? Dave
Re: RFC 274 (v1) Generalised Additions to Regexs
> In <[EMAIL PROTECTED]/, Perl6 RFC > Librarian writes: > :Given that expansion of regexes could include (+...) and (*...) I > :have been thinking about providing a general purpose way of adding > :functionality. Hence I propose that the entire (+...) syntax is > :kept free from formal specification for this. (+ = addition) > : > :A module or anything that wants to support some enhanced syntax > :registers something that handles "regex enhancements". > : > :At regex compile time, if and when (+foo) is found perl calls > :each of the registered regex enhancements in turn, these: > : > :1) Are passed the foo string as a parameter exactly as is. (There > :is an issue of actually finding the end of the generic foo.) > : > :2) The regex enhancement can either recognise the content or not. > > Is this the right approach? If more than one callback is registered, > this seems likely to lead to results dependent on the order of > registration. Maybe, maybe not. Does a newer localised definition replace the older one? The handling of multiple regestrations has to be resolved. My initial thoughts are that a "Last registered is checked first" approach may be best. > > I'd be more inclined to have callbacks registered for a word: that > way we can complain earlier when two modules try to register the > same word. Then at regexp-compile time we parse out the word > following the (+ and immediately know who to pass it to (or fail). This is equally possible, my thoughts where to leave the syntax completely open so that anything could be added - words, chinese, $$$. And leave it to the enhancements to recognise it or not. I could add this as an alternative for V2. > > :5) if an enhancement recognises the content it could do either of: > : > :a) return replacement expanded regex using existing capabilities > :perl will then pass this back through the regex compiler. > > Can we/should we detect (+...) loops? Or are you suggesting that the > returned string should not permit (+...) expansion? > Should we detect? Probably not. Should we allow definately yes. The only grounds for detection are to report infinite recursion. > :b) return a coderef that is called at run time when the regex gets > :to this point. > > Ok. > > : The referenced code needs to have enough access to the regex > :internals to be able to see the current sub-expression, request > :more characters ,access to relevant flags and visability of > :greediness. > > I don't see that this is a good idea; it makes more sense to me that > the coderef is treated exactly as if it had been compiled from (?{...}). Lets look at these one at a time: Access to subexpresions - ok this can be done. Visability of flags - Not curently possible. The code might like to know that /i is in effect, it might want to know that /s is in effect it probably does not need to know about /o. This is equally true to the enhancement regex handler that looks at the (+foo) in the first place. I think that these could be of use to (?{...}) code. Greediness - maybe not necessary, but I think better visability of internals might be beneficial. > > :Following on, if (?{...}) etc code is evaluated > :in forward match, it would be a good idea to likewise support some > :code block that is ignored on a forward match but is executed when the > :code is unwound due to backtracking. > > The support in (?{...}) for localisation is (as I understand it) the > intended mechanism for permitting such effects. Can you describe some > specific problems you are trying to solve here? Is localisation enough? It might be, it might be nicer however to provide a more explicit mechanism to handle more complex cases. > > Hugo > Richard
Re: RFC 274 (v1) Generalised Additions to Regexs
In <[EMAIL PROTECTED]>, Perl6 RFC Librarian writes: :Given that expansion of regexes could include (+...) and (*...) I have :been thinking about providing a general purpose way of adding :functionality. Hence I propose that the entire (+...) syntax is :kept free from formal specification for this. (+ = addition) : :A module or anything that wants to support some enhanced syntax :registers something that handles "regex enhancements". : :At regex compile time, if and when (+foo) is found perl calls :each of the registered regex enhancements in turn, these: : :1) Are passed the foo string as a parameter exactly as is. (There is :an issue of actually finding the end of the generic foo.) : :2) The regex enhancement can either recognise the content or not. Is this the right approach? If more than one callback is registered, this seems likely to lead to results dependent on the order of registration. I'd be more inclined to have callbacks registered for a word: that way we can complain earlier when two modules try to register the same word. Then at regexp-compile time we parse out the word following the (+ and immediately know who to pass it to (or fail). :5) if an enhancement recognises the content it could do either of: : :a) return replacement expanded regex using existing capabilities perl will :then pass this back through the regex compiler. Can we/should we detect (+...) loops? Or are you suggesting that the returned string should not permit (+...) expansion? :b) return a coderef that is called at run time when the regex gets to this :point. Ok. : The referenced code needs to have enough access to the regex internals :to be able to see the current sub-expression, request more characters, access :to relevant flags and visability of greediness. I don't see that this is a good idea; it makes more sense to me that the coderef is treated exactly as if it had been compiled from (?{...}). :Following on, if (?{...}) etc code is evaluated :in forward match, it would be a good idea to likewise support some :code block that is ignored on a forward match but is executed when the :code is unwound due to backtracking. The support in (?{...}) for localisation is (as I understand it) the intended mechanism for permitting such effects. Can you describe some specific problems you are trying to solve here? Hugo
Re: RFC 198 (v2) Boolean Regexes
HI Tom, Welcome to England (I presume) > This seems very complicated. Did you look at the Ram:6 recipe on > expressing AND, OR, and NOT in a regex? For example, to do > /FOO/ && /BAR/ you need not write /FOO.*BAR|BAR.*FOO/ -- and in > fact, should not, as it doesn't work properly on some pairs! > For example, /CAN/ && /ANAL/ can't be written /CAN.*ANAL|ANAL.*CAN/ > of you expect to match "CANAL". Overlaps bite you. You really > need /(?=.*CAN)(?=.*ANAL)/ instead, which permits multiple assertions. > Please check out the recipe I'm talking about. > > --tom, from a strange place I will start by admiting I dont have the RAM. I was brainstorming ideas (my day job involves a lot of brainstorming) and trying to think of new/better ways to do things. I am more interested in concepts than syntax. Richard
RFC 198 (v2) Boolean Regexes
This seems very complicated. Did you look at the Ram:6 recipe on expressing AND, OR, and NOT in a regex? For example, to do /FOO/ && /BAR/ you need not write /FOO.*BAR|BAR.*FOO/ -- and in fact, should not, as it doesn't work properly on some pairs! For example, /CAN/ && /ANAL/ can't be written /CAN.*ANAL|ANAL.*CAN/ of you expect to match "CANAL". Overlaps bite you. You really need /(?=.*CAN)(?=.*ANAL)/ instead, which permits multiple assertions. Please check out the recipe I'm talking about. --tom, from a strange place PS: NB -- I cannot access my mail spool. And the mailing list archives are 4 days behind on the website, so there is no hope of me participating in real-time, nor in seeing any private replies. Visit our website at http://www.ubswarburg.com This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. If verification is required please request a hard-copy version. This message is provided for informational purposes and should not be construed as a solicitation or offer to buy or sell any securities or related financial instruments.