Re: Perlstorm #0040

2000-09-27 Thread Ilya Zakharevich

==
> I lie: the other reason qr{} currently doesn't behave like that is
  that
> when we interpolate a compiled regexp into a context that requires
  it be
> recompiled,

Interpolated qr() items shouldn't be recompiled anyway.  They should
be treated as subroutine calls.  Unfortunately, this requires a
reentrant regex engine, which Perl doesn't have.  But I think it's the
right way to go, and it would solve the backreference problem, as well
as many other related problems.
==

The REx engine is reenterant enough right now.  All you need to do is
to add the //p switch (or, meanwhile, rewrite each $qrn into (?p{ $qrn })).

Ilya



Re: perl6-language-regex summary for 20000920

2000-09-27 Thread Ilya Zakharevich

==
RFC 72: The regexp engine should go backward as well as
forward. (Peter Heslin)

Peter says (edited):
:If the regexp code is unlikely to be rewritten from the ground up,
then
:there may be little chance of this feature being implemented. I'll
make
:a pitch for it anyway at the end of my talk at YAPC::Europe, and then
:I'll freeze the RFC.
==

As I said it for many times: this is absolutely trivial to implement.
First of all, if you agree to rewrite

 (?<= \w\s*\d ) # Semantic X: match "a  1"

as

 (?<= \d\s*\w ) # Semantic Y: match "a  1"

then it is as simple as inserting go-back-by(1) nodes before each node
for \s \d and \w.

And to support the more intuitive ;-) semantic X, the only
more-or-less tricky part is to recursively go through the compile
tree, and put "concatenated" nodes in the opposite order.  A piece of
cake.

==
RFC 145: Brace-matching for Perl Regular Expressions  (Eric Roode)

The closest we have to an emerging consensus appears to be that
it is very difficult to pin down a precise problem to solve - the
areas in which we want to match pairs of delimiters (such as
numeric expressions, C code, perl code, HTML and XML) each seem
to require a variety of special cases, each different from the
other.
==

Emacs gives a bare minimum to support: mark chars by syntax classes.
Which classes there are is a tricky question.  Emacs's way is too C-centric. 

==

I have no time to summarize the things I feel are needed.  But since
they can be easily done in the Perl5 track as well, maybe they are not
proper for this list.  And I discussed all of them many times already...

   "unfinished strings",   (allows $/ = /fo*ba*r/)

   \g< and \g> (report start/end of $& at these pos);

   onion rings: (?<> A <> B &! C & D)  (substring matched by A
such that B and D match against
it, but C does not, in B, C, D
\A and \z denote boundaries of
what was matched by A);

   \F{-*}, \F{-.}, \F+  (finish and restart the match "where"), here
   "where" is nowhere/at-the-current-position/as-usual, and -/+ mean
   whether one needs to report this match to the caller;

   applying a REx to a substring (two versions: with/without allowing
 lookahead/behind outside of the range);

   (*@arr:  REx )  # Make @arr the default-match-array instead of ($1,$2,...)
   # (@arr is not interpolated)

   (*%hash: REx )  # Make @hash the default-match-hash instead of %^MATCH

   (*id:REx )  # Put what-is-matched into $default_match_hash{id}

   (*id*:   REx )* # As, REx*, but put what-is-matched during each REx
   # into separate elements of @{$default_match_hash{id}}

   (*id[]:  REx )  # make @{$default_match_hash{id)} into default-match-array

   (*id{}:  REx )  # make %{$default_match_hash{id)} into default-match-hash

   # all of the above are localized for the duration of REx

as well as many performance improvements.

Yours,
Ilya



Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Richard Proctor

On Wed 27 Sep, Dave Storrs wrote:
> 
> 
> On Wed, 27 Sep 2000, Richard Proctor wrote:
> > > Both \1 and $1 refer to what is matched by the first set of parens in a
> > > regex.  AFAIK, the only difference between these two notation is that
> > > \1 is used within the regex itself and $1 is used outside of the
> > > regex.  Is there any reason not to standardize these down to one
> > > notation (i.e., eliminate one or the other)?
> > 
> > I think this is fixable.  
> 
>   The way you phrase that makes it sound that other people perceive
> this as a problem as well, which gives me all sorts of warm fuzzies. :>
> 
> > The only real need for this at the moment is to overcome limitations in
> > the order of expansion of regexes.  RFCs 112, 166, 276... all depend on
> > fixing this.  
> 
>   Ok, here's another question.  How the _HELL_ does everyone else on
> this bloody list keep track of every detail in every frigging RFC?  Some
> random comment comes up, and someone will go, "Oh, the third paragraph of
> the second section in RFC 0x97A already mentioned this as a parenthetical
> aside, despite the fact that its title and primary topic had no relation
> to the issue."  I still have (mumble-mumble) RFCs that I haven't even had
> time to *read*, let alone memorize every detail of!

In this context I was the author of guess what 112, 166 and 276 (though 
I admit to having to look up the number of the last one)

> 
>   Grr*grumble, grumble, moan, winge*
> 
>   Ok, back to rationality now.
> 
> > If the regex compiler gets in before the expansion of the variables to
> > make these work, it could handle $1 in all cases \1 can be retained for
> > compatibility.
> 
>   Do we *want* to maintain \1?  Why have two notations to do the
> same thing when one is clearly superior?  (\1 can only go up to \9 while
> the other could theoretically go to ${...}.)  Perl6 is breaking
> backwards compatibility and eliminating all deprecated features...let's
> get rid of \n as backreference notation.
> 

The principle issue would be what to do about use of $1 on the LHS having
its current meaning.  Which is rather good for obfuscated code, but not
terribly kind on normal programming.

Note RFC 112 covers assignment within a regex naming rather than numbering
the brackets one wishes to capture, it also covers named back references.

Currently $1 is expanded by the quoting currently before the regex compiler
gets to play, the regex compiler sees the \1 and knows what to do.  \ meaning
refer back I am reasonably happy with, the numbers I am not.

Richard

-- 

[EMAIL PROTECTED]




Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Dave Storrs



On 27 Sep 2000, Piers Cawley wrote:

> > Do we *want* to maintain \1?  Why have two notations to do the
> 
> I'm kind of curious about what happens when you want to do, say:
> 
>   if (m/(\S+)/) {
>  $reg = qr{<(em|i|b)>($1)};
>   }
> 
> where the $1 in the regex quote is refering to $1 from the previous
> regex match.

Well, how about this:

  $reg = qr{<(em|i|b)>(${P1})};
NOTE:  ^

If you assume that $1 and ${1} are equivalent (which makes it
possible to have as many backrefs as you want), then you could say that,
if the first character after the { is a P, it means "in the previous regex
match."

Dave





Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Randal L. Schwartz

> "Jonathan" == Jonathan Scott Duff <[EMAIL PROTECTED]> writes:

Jonathan> On Wed, Sep 27, 2000 at 08:15:53AM -0700, Dave Storrs wrote:
>> Both \1 and $1 refer to what is matched by the first set of parens in a
>> regex.  AFAIK, the only difference between these two notation is that \1
>> is used within the regex itself and $1 is used outside of the regex.  Is
>> there any reason not to standardize these down to one notation (i.e.,
>> eliminate one or the other)?

Jonathan> \1 can be used on the LHS of a s/// whereas $1 there probably won't do
Jonathan> what you expect.  Also, \1, \2, \3 only takes you as far as \9 ;-)

Wrong.  If you have more than 10 parens visible so far, \10 works just fine.

Jonathan> If $1 could be made to work properly on the LHS of s///, I'd vote for
Jonathan> that being The Way.

It can't ever.  It means $1 from the previous match.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[EMAIL PROTECTED]> http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Piers Cawley

Dave Storrs <[EMAIL PROTECTED]> writes:

> On Wed, 27 Sep 2000, Richard Proctor wrote:
> > > Both \1 and $1 refer to what is matched by the first set of parens in a
> > > regex.  AFAIK, the only difference between these two notation is that \1
> > > is used within the regex itself and $1 is used outside of the regex.  Is
> > > there any reason not to standardize these down to one notation (i.e.,
> > > eliminate one or the other)?
> > 
> > I think this is fixable.  
> 
>   The way you phrase that makes it sound that other people perceive
> this as a problem as well, which gives me all sorts of warm fuzzies. :>
> 
> >The only real need for this at the moment is to
> > overcome limitations in the order of expansion of regexes.  RFCs 112, 166,
> > 276... all depend on fixing this.  
>
> [...]
> 
> >If the regex compiler gets in before the
> > expansion of the variables to make these work, it could handle $1 in all cases
> > \1 can be retained for compatibility.
> 
>   Do we *want* to maintain \1?  Why have two notations to do the
> same thing when one is clearly superior?  (\1 can only go up to \9 while
> the other could theoretically go to ${...}.)  Perl6 is breaking
> backwards compatibility and eliminating all deprecated features...let's
> get rid of \n as backreference notation.

I'm kind of curious about what happens when you want to do, say:

  if (m/(\S+)/) {
 $reg = qr{<(em|i|b)>($1)};
  }

  while (<>) {
next unless m{$reg};
...
  }

where the $1 in the regex quote is refering to $1 from the previous
regex match.

-- 
Piers




Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Uri Guttman

> "DS" == Dave Storrs <[EMAIL PROTECTED]> writes:

  DS> Both \1 and $1 refer to what is matched by the first set of parens
  DS> in a regex.  AFAIK, the only difference between these two notation
  DS> is that \1 is used within the regex itself and $1 is used outside
  DS> of the regex.  Is there any reason not to standardize these down
  DS> to one notation (i.e., eliminate one or the other)?


because $1 having be set previously will be interpolated INTO the new
regex. so you have to have another notation to refer to grabbed stuff
from the current regex.

uri

-- 
Uri Guttman  -  [EMAIL PROTECTED]  --  http://www.sysarch.com
SYStems ARCHitecture, Software Engineering, Perl, Internet, UNIX Consulting
The Perl Books Page  ---  http://www.sysarch.com/cgi-bin/perl_books
The Best Search Engine on the Net  --  http://www.northernlight.com



Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Dave Storrs



On Wed, 27 Sep 2000, Richard Proctor wrote:
> > Both \1 and $1 refer to what is matched by the first set of parens in a
> > regex.  AFAIK, the only difference between these two notation is that \1
> > is used within the regex itself and $1 is used outside of the regex.  Is
> > there any reason not to standardize these down to one notation (i.e.,
> > eliminate one or the other)?
> 
> I think this is fixable.  

The way you phrase that makes it sound that other people perceive
this as a problem as well, which gives me all sorts of warm fuzzies. :>

>The only real need for this at the moment is to
> overcome limitations in the order of expansion of regexes.  RFCs 112, 166,
> 276... all depend on fixing this.  

Ok, here's another question.  How the _HELL_ does everyone else on
this bloody list keep track of every detail in every frigging RFC?  Some
random comment comes up, and someone will go, "Oh, the third paragraph of
the second section in RFC 0x97A already mentioned this as a parenthetical
aside, despite the fact that its title and primary topic had no relation
to the issue."  I still have (mumble-mumble) RFCs that I haven't even had
time to *read*, let alone memorize every detail of!

Grr*grumble, grumble, moan, winge*

Ok, back to rationality now.

>If the regex compiler gets in before the
> expansion of the variables to make these work, it could handle $1 in all cases
> \1 can be retained for compatibility.

Do we *want* to maintain \1?  Why have two notations to do the
same thing when one is clearly superior?  (\1 can only go up to \9 while
the other could theoretically go to ${...}.)  Perl6 is breaking
backwards compatibility and eliminating all deprecated features...let's
get rid of \n as backreference notation.

Dave




Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Dave Storrs



On Wed, 27 Sep 2000, Jonathan Scott Duff wrote:

> If $1 could be made to work properly on the LHS of s///, I'd vote for
> that being The Way.

That was pretty much my thought?




Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Michael Maraist


From: "Dave Storrs" <[EMAIL PROTECTED]>

> Both \1 and $1 refer to what is matched by the first set of parens in a
> regex.  AFAIK, the only difference between these two notation is that \1
> is used within the regex itself and $1 is used outside of the regex.  Is
> there any reason not to standardize these down to one notation (i.e.,
> eliminate one or the other)?

\1 came from sed and friends.  I think an early driving force was
maintaining familiarity with things like awk and sed.  Even today there are
still people that switch to and from other reg-ex languages.  Emacs is the
most common for me (though I still dabble with awk).  I don't see a real
advantage in taking out \1, and it is very likely to needlessly break legacy
code, and additionally confuse various developers that have a habbit of
using \1.

On the other hand, the use of $1with substitutions is important for
consistency.  When you write s/../.../e, you're going to need to use a
substitution variable, "\1" just doesn't fit.
s/(...)/pre\1post/;  works fine
s/(...)/pre$1post/; is the question. I tend to use it only because I
sometimes switch to:
s/(...)/func() . "$1post"/e;  for various reasons..  I just try and
standardize on $1, but that's just me.

Additionally the use of $1 in the matching reg-ex is ambiguous as in:
m/(...).*?$1/;
Does it refer to the internal set of (..), or does it mean the previous
value of $1 before this match.. This becomes non-obvious to the observer in
the following case:
m/($keyword).*?$1/;
Here, our mindset is substitution of external variables, the casual
(non-seasoned) observer might not understand that it really means:
m/($keyword).*?\1/;

My argument is that both \1 and $1 have their places, and limiting to one
type can be troublesome.  Plus, TMTOWTDI. :)

-Michael




Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Richard Proctor



Dave,

> Both \1 and $1 refer to what is matched by the first set of parens in a
> regex.  AFAIK, the only difference between these two notation is that \1
> is used within the regex itself and $1 is used outside of the regex.  Is
> there any reason not to standardize these down to one notation (i.e.,
> eliminate one or the other)?

I think this is fixable.  The only real need for this at the moment is to
overcome limitations in the order of expansion of regexes.  RFCs 112, 166,
276... all depend on fixing this.  If the regex compiler gets in before the
expansion of the variables to make these work, it could handle $1 in all cases
\1 can be retained for compatibility.

Richard





Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Jonathan Scott Duff

On Wed, Sep 27, 2000 at 08:15:53AM -0700, Dave Storrs wrote:
> Both \1 and $1 refer to what is matched by the first set of parens in a
> regex.  AFAIK, the only difference between these two notation is that \1
> is used within the regex itself and $1 is used outside of the regex.  Is
> there any reason not to standardize these down to one notation (i.e.,
> eliminate one or the other)?

\1 can be used on the LHS of a s/// whereas $1 there probably won't do
what you expect.  Also, \1, \2, \3 only takes you as far as \9 ;-)

If $1 could be made to work properly on the LHS of s///, I'd vote for
that being The Way.

-Scott
-- 
Jonathan Scott Duff
[EMAIL PROTECTED]



is \1 vs $1 a necessary distinction?

2000-09-27 Thread Dave Storrs

Both \1 and $1 refer to what is matched by the first set of parens in a
regex.  AFAIK, the only difference between these two notation is that \1
is used within the regex itself and $1 is used outside of the regex.  Is
there any reason not to standardize these down to one notation (i.e.,
eliminate one or the other)?

Dave




Re: RFC 274 (v1) Generalised Additions to Regexs

2000-09-27 Thread Richard Proctor




> In <[EMAIL PROTECTED]/, Perl6 RFC
> Librarian writes:
> :Given that expansion of regexes could include (+...) and (*...) I
> :have been thinking about providing a general purpose way of adding
> :functionality.  Hence I propose that the entire (+...) syntax is
> :kept free from formal specification for this. (+ = addition)
> :
> :A module or anything that wants to support some enhanced syntax
> :registers something that handles "regex enhancements".
> :
> :At regex compile time, if and when (+foo) is found perl calls
> :each of the registered regex enhancements in turn, these:
> :
> :1) Are passed the foo string as a parameter exactly as is.  (There
> :is an issue of actually finding the end of the generic foo.)
> :
> :2) The regex enhancement can either recognise the content or not.
>
> Is this the right approach? If more than one callback is registered,
> this seems likely to lead to results dependent on the order of
> registration.

Maybe, maybe not.  Does a newer localised definition replace the older
one?  The handling of multiple regestrations has to be resolved.
My initial thoughts are that a "Last registered is checked first"
approach may be best.

>
> I'd be more inclined to have callbacks registered for a word: that
> way we can complain earlier when two modules try to register the
> same word. Then at regexp-compile time we parse out the word
> following the (+ and immediately know who to pass it to (or fail).

This is equally possible, my thoughts where to leave the syntax
completely open so that anything could be added - words, chinese,
$$$.  And leave it to the enhancements to recognise it or not.  I
could add this as an alternative for V2.

>
> :5) if an enhancement recognises the content it could do either of:
> :
> :a) return replacement expanded regex using existing capabilities
> :perl will then pass this back through the regex compiler.
>
> Can we/should we detect (+...) loops? Or are you suggesting that the
> returned string should not permit (+...) expansion?
>

Should we detect? Probably not.  Should we allow definately yes.  The
only grounds for detection are to report infinite recursion.

> :b) return a coderef that is called at run time when the regex gets
> :to this point.
>
> Ok.
>
> :  The referenced code needs to have enough access to the regex
> :internals to be able to see the current sub-expression, request
> :more characters ,access to relevant flags and visability of
> :greediness.
>
> I don't see that this is a good idea; it makes more sense to me that
> the coderef is treated exactly as if it had been compiled from (?{...}).

Lets look at these one at a time:

Access to subexpresions - ok this can be done.

Visability of flags - Not curently possible. The code might
like to know that /i is in effect, it might want to know that /s is
in effect it probably does not need to know about /o.  This is equally
true to the enhancement regex handler that looks at the (+foo) in the
first place.  I think that these could be of use to (?{...}) code.

Greediness - maybe not necessary, but I think better visability of
internals might be beneficial.

>
> :Following on, if (?{...}) etc code is evaluated
> :in forward match, it would be a good idea to likewise support some
> :code block that is ignored on a forward match but is executed when the
> :code is unwound due to backtracking.
>
> The support in (?{...}) for localisation is (as I understand it) the
> intended mechanism for permitting such effects. Can you describe some
> specific problems you are trying to solve here?

Is localisation enough?  It might be, it might be nicer however to
provide a more explicit mechanism to handle more complex cases.

>
> Hugo
>

Richard





Re: RFC 274 (v1) Generalised Additions to Regexs

2000-09-27 Thread Hugo

In <[EMAIL PROTECTED]>, Perl6 RFC Librarian writes:
:Given that expansion of regexes could include (+...) and (*...) I have
:been thinking about providing a general purpose way of adding
:functionality.  Hence I propose that the entire (+...) syntax is
:kept free from formal specification for this. (+ = addition)
:
:A module or anything that wants to support some enhanced syntax
:registers something that handles "regex enhancements".
:
:At regex compile time, if and when (+foo) is found perl calls
:each of the registered regex enhancements in turn, these:
:
:1) Are passed the foo string as a parameter exactly as is.  (There is
:an issue of actually finding the end of the generic foo.)
:
:2) The regex enhancement can either recognise the content or not.

Is this the right approach? If more than one callback is registered,
this seems likely to lead to results dependent on the order of
registration.

I'd be more inclined to have callbacks registered for a word: that
way we can complain earlier when two modules try to register the
same word. Then at regexp-compile time we parse out the word
following the (+ and immediately know who to pass it to (or fail).

:5) if an enhancement recognises the content it could do either of:
:
:a) return replacement expanded regex using existing capabilities perl will
:then pass this back through the regex compiler.

Can we/should we detect (+...) loops? Or are you suggesting that the
returned string should not permit (+...) expansion?

:b) return a coderef that is called at run time when the regex gets to this
:point.

Ok.

:  The referenced code needs to have enough access to the regex internals
:to be able to see the current sub-expression, request more characters, access
:to relevant flags and visability of greediness.

I don't see that this is a good idea; it makes more sense to me that the
coderef is treated exactly as if it had been compiled from (?{...}).

:Following on, if (?{...}) etc code is evaluated
:in forward match, it would be a good idea to likewise support some
:code block that is ignored on a forward match but is executed when the
:code is unwound due to backtracking.

The support in (?{...}) for localisation is (as I understand it) the
intended mechanism for permitting such effects. Can you describe some
specific problems you are trying to solve here?

Hugo



Re: RFC 198 (v2) Boolean Regexes

2000-09-27 Thread Richard Proctor



HI Tom,

Welcome to England (I presume)

> This seems very complicated.  Did you look at the Ram:6 recipe on
> expressing AND, OR, and NOT  in a regex?  For example, to do
> /FOO/ && /BAR/ you need not write /FOO.*BAR|BAR.*FOO/ -- and in
> fact, should not, as it doesn't work properly on some pairs!
> For example, /CAN/ && /ANAL/ can't be written /CAN.*ANAL|ANAL.*CAN/
> of you expect to match "CANAL".   Overlaps bite you.  You really
> need /(?=.*CAN)(?=.*ANAL)/ instead, which permits multiple assertions.
> Please check out the recipe I'm talking about.
>
> --tom, from a strange place

I will start by admiting I dont have the RAM.   I was brainstorming ideas (my
day job involves a lot of brainstorming) and trying to think of new/better ways
to do things.  I am more interested in concepts than syntax.

Richard





RFC 198 (v2) Boolean Regexes

2000-09-27 Thread Tom Christiansen

This seems very complicated.  Did you look at the Ram:6 recipe on
expressing AND, OR, and NOT  in a regex?  For example, to do
/FOO/ && /BAR/ you need not write /FOO.*BAR|BAR.*FOO/ -- and in 
fact, should not, as it doesn't work properly on some pairs!
For example, /CAN/ && /ANAL/ can't be written /CAN.*ANAL|ANAL.*CAN/
of you expect to match "CANAL".   Overlaps bite you.  You really
need /(?=.*CAN)(?=.*ANAL)/ instead, which permits multiple assertions.
Please check out the recipe I'm talking about.

--tom, from a strange place

PS: NB -- I cannot access my mail spool.  And the mailing list
archives are 4 days behind on the website, so there is no 
hope of me participating in real-time, nor in seeing any 
private replies.

Visit our website at http://www.ubswarburg.com

This message contains confidential information and is intended only 
for the individual named.  If you are not the named addressee you 
should not disseminate, distribute or copy this e-mail.  Please 
notify the sender immediately by e-mail if you have received this 
e-mail by mistake and delete this e-mail from your system.

E-mail transmission cannot be guaranteed to be secure or error-free 
as information could be intercepted, corrupted, lost, destroyed, 
arrive late or incomplete, or contain viruses.  The sender therefore 
does not accept liability for any errors or omissions in the contents 
of this message which arise as a result of e-mail transmission.  If 
verification is required please request a hard-copy version.  This 
message is provided for informational purposes and should not be 
construed as a solicitation or offer to buy or sell any securities or 
related financial instruments.