Re: RFC 112 (v3) Asignment within a regex

2000-09-29 Thread Richard Proctor




> On Fri, 29 Sep 2000 01:02:40 +0100, Hugo wrote:
>
> >It also isn't clear what parts of the expression are interpolated at
> >compile time; what should the following leave in %foo?
> >
> >  %foo = ();
> >  $bar = "one";
> >  "twothree" =~ / (?$bar=two) (?$foo{$bar}=three) /x;
>
> It's not just that. You act as if this is assignment takes place
> whenever a submatch succeeds. So:
>
>  "twofour" =~ /(?$bar=two)($foo=three)/;
>
> Will $bar be set to "two", and $foo undef? I think not. Assignment
> should be postponed to till the very end, when the match finally
> succeeds, as a whole.

In general all assignments should wait to the very end, and then assign
them all.  However before code callouts (?{...}) and enemies, the named
assignments that are currently defined should be made (localised) so that
the code can refer to them by name.  If the expression finally fails the
localised values would unroll.

>
> Therefore, I think that allowing just any l-value on the left of the "="
> sign, is not practical. Or is it?

I think any simple scalar value is reasonable.

>
> OTOH I would rather have that all submatches would be assigned to a
> hash, not to global or lexical variables. I have no clue about what
> syntax that would need.

That is in RFC 150, I think there is a case for both.

Richard





Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Richard Proctor

On Wed 27 Sep, Dave Storrs wrote:
> 
> 
> On Wed, 27 Sep 2000, Richard Proctor wrote:
> > > Both \1 and $1 refer to what is matched by the first set of parens in a
> > > regex.  AFAIK, the only difference between these two notation is that
> > > \1 is used within the regex itself and $1 is used outside of the
> > > regex.  Is there any reason not to standardize these down to one
> > > notation (i.e., eliminate one or the other)?
> > 
> > I think this is fixable.  
> 
>   The way you phrase that makes it sound that other people perceive
> this as a problem as well, which gives me all sorts of warm fuzzies. :>
> 
> > The only real need for this at the moment is to overcome limitations in
> > the order of expansion of regexes.  RFCs 112, 166, 276... all depend on
> > fixing this.  
> 
>   Ok, here's another question.  How the _HELL_ does everyone else on
> this bloody list keep track of every detail in every frigging RFC?  Some
> random comment comes up, and someone will go, "Oh, the third paragraph of
> the second section in RFC 0x97A already mentioned this as a parenthetical
> aside, despite the fact that its title and primary topic had no relation
> to the issue."  I still have (mumble-mumble) RFCs that I haven't even had
> time to *read*, let alone memorize every detail of!

In this context I was the author of guess what 112, 166 and 276 (though 
I admit to having to look up the number of the last one)

> 
>   Grr*grumble, grumble, moan, winge*
> 
>   Ok, back to rationality now.
> 
> > If the regex compiler gets in before the expansion of the variables to
> > make these work, it could handle $1 in all cases \1 can be retained for
> > compatibility.
> 
>   Do we *want* to maintain \1?  Why have two notations to do the
> same thing when one is clearly superior?  (\1 can only go up to \9 while
> the other could theoretically go to ${...}.)  Perl6 is breaking
> backwards compatibility and eliminating all deprecated features...let's
> get rid of \n as backreference notation.
> 

The principle issue would be what to do about use of $1 on the LHS having
its current meaning.  Which is rather good for obfuscated code, but not
terribly kind on normal programming.

Note RFC 112 covers assignment within a regex naming rather than numbering
the brackets one wishes to capture, it also covers named back references.

Currently $1 is expanded by the quoting currently before the regex compiler
gets to play, the regex compiler sees the \1 and knows what to do.  \ meaning
refer back I am reasonably happy with, the numbers I am not.

Richard

-- 

[EMAIL PROTECTED]




Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Richard Proctor



Dave,

> Both \1 and $1 refer to what is matched by the first set of parens in a
> regex.  AFAIK, the only difference between these two notation is that \1
> is used within the regex itself and $1 is used outside of the regex.  Is
> there any reason not to standardize these down to one notation (i.e.,
> eliminate one or the other)?

I think this is fixable.  The only real need for this at the moment is to
overcome limitations in the order of expansion of regexes.  RFCs 112, 166,
276... all depend on fixing this.  If the regex compiler gets in before the
expansion of the variables to make these work, it could handle $1 in all cases
\1 can be retained for compatibility.

Richard





Re: RFC 274 (v1) Generalised Additions to Regexs

2000-09-27 Thread Richard Proctor




> In <[EMAIL PROTECTED]/, Perl6 RFC
> Librarian writes:
> :Given that expansion of regexes could include (+...) and (*...) I
> :have been thinking about providing a general purpose way of adding
> :functionality.  Hence I propose that the entire (+...) syntax is
> :kept free from formal specification for this. (+ = addition)
> :
> :A module or anything that wants to support some enhanced syntax
> :registers something that handles "regex enhancements".
> :
> :At regex compile time, if and when (+foo) is found perl calls
> :each of the registered regex enhancements in turn, these:
> :
> :1) Are passed the foo string as a parameter exactly as is.  (There
> :is an issue of actually finding the end of the generic foo.)
> :
> :2) The regex enhancement can either recognise the content or not.
>
> Is this the right approach? If more than one callback is registered,
> this seems likely to lead to results dependent on the order of
> registration.

Maybe, maybe not.  Does a newer localised definition replace the older
one?  The handling of multiple regestrations has to be resolved.
My initial thoughts are that a "Last registered is checked first"
approach may be best.

>
> I'd be more inclined to have callbacks registered for a word: that
> way we can complain earlier when two modules try to register the
> same word. Then at regexp-compile time we parse out the word
> following the (+ and immediately know who to pass it to (or fail).

This is equally possible, my thoughts where to leave the syntax
completely open so that anything could be added - words, chinese,
$$$.  And leave it to the enhancements to recognise it or not.  I
could add this as an alternative for V2.

>
> :5) if an enhancement recognises the content it could do either of:
> :
> :a) return replacement expanded regex using existing capabilities
> :perl will then pass this back through the regex compiler.
>
> Can we/should we detect (+...) loops? Or are you suggesting that the
> returned string should not permit (+...) expansion?
>

Should we detect? Probably not.  Should we allow definately yes.  The
only grounds for detection are to report infinite recursion.

> :b) return a coderef that is called at run time when the regex gets
> :to this point.
>
> Ok.
>
> :  The referenced code needs to have enough access to the regex
> :internals to be able to see the current sub-expression, request
> :more characters ,access to relevant flags and visability of
> :greediness.
>
> I don't see that this is a good idea; it makes more sense to me that
> the coderef is treated exactly as if it had been compiled from (?{...}).

Lets look at these one at a time:

Access to subexpresions - ok this can be done.

Visability of flags - Not curently possible. The code might
like to know that /i is in effect, it might want to know that /s is
in effect it probably does not need to know about /o.  This is equally
true to the enhancement regex handler that looks at the (+foo) in the
first place.  I think that these could be of use to (?{...}) code.

Greediness - maybe not necessary, but I think better visability of
internals might be beneficial.

>
> :Following on, if (?{...}) etc code is evaluated
> :in forward match, it would be a good idea to likewise support some
> :code block that is ignored on a forward match but is executed when the
> :code is unwound due to backtracking.
>
> The support in (?{...}) for localisation is (as I understand it) the
> intended mechanism for permitting such effects. Can you describe some
> specific problems you are trying to solve here?

Is localisation enough?  It might be, it might be nicer however to
provide a more explicit mechanism to handle more complex cases.

>
> Hugo
>

Richard





Re: RFC 198 (v2) Boolean Regexes

2000-09-27 Thread Richard Proctor



HI Tom,

Welcome to England (I presume)

> This seems very complicated.  Did you look at the Ram:6 recipe on
> expressing AND, OR, and NOT  in a regex?  For example, to do
> /FOO/ && /BAR/ you need not write /FOO.*BAR|BAR.*FOO/ -- and in
> fact, should not, as it doesn't work properly on some pairs!
> For example, /CAN/ && /ANAL/ can't be written /CAN.*ANAL|ANAL.*CAN/
> of you expect to match "CANAL".   Overlaps bite you.  You really
> need /(?=.*CAN)(?=.*ANAL)/ instead, which permits multiple assertions.
> Please check out the recipe I'm talking about.
>
> --tom, from a strange place

I will start by admiting I dont have the RAM.   I was brainstorming ideas (my
day job involves a lot of brainstorming) and trying to think of new/better ways
to do things.  I am more interested in concepts than syntax.

Richard





Re: Perlstorm #0040

2000-09-24 Thread Richard Proctor

On Sun 24 Sep, Hugo wrote:
> In <[EMAIL PROTECTED]>, Richard Proctor 
> writes
> :
> :TomCs perl storm has:
> :
> :> Figure out way to do 
> :> 
> :> /$e1 $e2/
> :> 
> :> safely, where $e1 might have '(foo) \1' in it. 
> :> and $e2 might have '(bar) \1' in it.  Those won't work.
> :
> :If e1 and e2 are qr// type things the answer might be to localise 
> :the backref numbers in each qr// expression.  
> :
> :If they are not qr//s it might still be possible to achieve if the
> :expansion of variables in regexes is done by the regex compiler it
> :could recognise this context and localise the backrefs.
> :
> :Any code like this is going to have real problem with $1 etc if used
> :later, use of assignment in a regex and named backrefs (RFC 112) would
> :make this a lot safer.
> 
> I think it is reaonable to ask whether the current handling of qr{}
> subpatterns is correct:
> 
> perl -wle '$a=qr/(a)\1/; $b=qr/(b).*\1/; /$a($b)/g and print join ":", $1,
> pos for "aabbac"' a:5
> 
> I'm tempted to suggest it isn't; that the paren count should be local
> to each qr{}, so that the above prints 'bb:4'. I think that most people
> currently construct their qr{} patterns as if they are going to be
> handled in isolation, without regard to the context in which they are
> embedded - why else do they override the embedder's flags if not to
> achieve that?

This seams the right way to go

> The problem then becomes: do we provide a mechansim to access the
> nested backreferences outside of the qr{} in which they were referenced,
> and if so what syntax do we offer to achieve that? I don't have an answer
> to the latter, which tempts me to answer 'no' to the former for all the
> wrong reasons. I suspect (and suggest) that complication is the only
> reason we don't currently have the behaviour I suggest the rest of the
> semantics warrant - that backreferences are localised within a qr().

With the suggestions from RFC 112, with assignment within the regex and
named backreferences, this provides a solution for anyone trying to
get at a backref inside of a nested qr(), I think this is a reasonable way
forward.

> I lie: the other reason qr{} currently doesn't behave like that is that
> when we interpolate a compiled regexp into a context that requires it be
> recompiled, we currently ignore the compiled form and act only on the
> original string. Perhaps this is also an insufficiently intelligent thing
> to do.
> 
> Hugo
> 

Yes, this and MJDs comment about the reentrant regex engine.  I will stick
this in an RFC in a few minutes.

Richard

-- 

[EMAIL PROTECTED]




Perlstorm #0040

2000-09-23 Thread Richard Proctor

TomCs perl storm has:

> Figure out way to do 
> 
> /$e1 $e2/
> 
> safely, where $e1 might have '(foo) \1' in it. 
> and $e2 might have '(bar) \1' in it.  Those won't work.

If e1 and e2 are qr// type things the answer might be to localise 
the backref numbers in each qr// expression.  

If they are not qr//s it might still be possible to achieve if the expansion
of variables in regexes is done by the regex compiler it could recognise
this context and localise the backrefs.

Any code like this is going to have real problem with $1 etc if used later,
use of assignment in a regex and named backrefs (RFC 112) would make this
a lot safer.

Richard

-- 

[EMAIL PROTECTED]




Re: perl6-language-regex summary for 20000920

2000-09-20 Thread Richard Proctor

On Thu 21 Sep, Hugo wrote:
> perl6-language-regex
> 
> Summary report 2920
> 
> Mark-Jason Dominus has relinquished the wg chair due to the pressure
> of other commitments; I'll be taking over the chair for the short
> time remaining. Thanks to Mark-Jason for all his hard work.
> 
> I'll be contacting the authors of all outstanding RFCs shortly to
> encourage them to work towards freezing them as soon as practical.
> 
> Hugo
> 

Welcome to the job


> RFC 112: Assignment within a regex  (Richard Proctor)
> 
> No discussion.

There was some before you joined the list - I have a couple of things
I want to tidy up.

> RFC 145: Brace-matching for Perl Regular Expressions  (Eric Roode)
> 
> No discussion directly about this RFC. The discussion of XML/HTML-
> -specific extensions continued for a short while, but has not
> resulted in an RFC.

I would like to see what was proposed in the discussions about bracket
matching written up in a new version of this RFC (even if it is eventually
dropped)

> RFC 165: Allow variables in tr///  (Richard Proctor)
> 
> Surprisingly, no discussion.

There was lots prior to you joining the list - this is almost frozen

> 
> RFC 166: Alternative lists and quoting of things  (Richard Proctor)
> 
> New version, with a new name (was 'Additions to regexs'). This RFC
> is not currently available from the archive due to a misfiling, but
> you'll find it here:
>   http://www.mail-archive.com/perl6-language-regex@perl.org/msg00350.html
> 
> This removes two of the three original suggestions, and expands on
> the remaining one. Mark-Jason pointed out that the (new) extension
> to (?\Q$foo) is not needed.

An update will follow

> RFC 198: Boolean Regexes (Richard Proctor)
> 
> No discussion.

About to go to issue 2.

> 
> New RFCS

Please take a look at my posting "Generalized additions to regexes"
this is a sort of proto RFC.

Richard

-- 

[EMAIL PROTECTED]




Re: RFC 166 (v2) Alternative lists and quoting of things

2000-09-17 Thread Richard Proctor

On Sat 16 Sep, Mark-Jason Dominus wrote:
> 
> > (?Q$foo) Quotes the contents of the scalar $foo - equivalent to
> > (??{ quotemeta $foo }).
> 
> How is this different from
> 
> \Q$foo\E

Um - not at all - think of it as a brainstorming overrun...

BTW have you any thoughts about my "Generalised additions to regexes"
that I posted last week - some discussion and filling in of gaps would
realy be needed before it becomes an RFC.

Richard
-- 

[EMAIL PROTECTED]




Re: RFC 166 (v1) Additions to regexs

2000-09-13 Thread Richard Proctor

On Wed 13 Sep, Bart Lateur wrote:
> On Tue, 12 Sep 2000 19:01:35 -0400, Mark-Jason Dominus wrote:
> 
> >I don't know what you mean, but you're mistaken, because it means to
> >interpolate @foo as in a double-quoted string.
> 
> Which is precisely the meaning he wants for it, with $" set to '|'.
> 
> I wonder if we're not trying too hard. What if, inside regexes, $" is
> always localized and set to '|'. What if we change the meaning of
> "\Q@foo" so it only metaquotes the contents of the array, not of the
> separator.
> 
>   @foo = ('a.b', 'a+b', 'a*b');
>   $" = '|';
>   print "\Q@foo";
> -->
>   a\.b\|a\+b\|a\*b
> 
> Hmm... We can't really use this result, can we?
> 

This is getting there, but not quite.  The problem is the order in which
the regex compiler performs the actions.  The expansion of @foo (and $bar)
currently occours BEFORE the regex is examined as a regex and before the
compiler looks at \Q etc.  This construct does not wrap the @foo in 
a set of brackets.

I have no special desire for one type of syntax or another so the
list expanson could be @foo or (?@foo).  What this exposes also is my
idea for idea (in RFC112) for regex asignment.  Both would suffer.

If we want to allow more facilities within regexes, the order of expansion
may need to be enhanced to allow the regex compiler to get in before
the handling of $s and @s.  [This is feasable].

As well as \Quoting an array I also think we should have the complementry
\Quoting of a scalar.

Richard

-- 

[EMAIL PROTECTED]




Generalised Additions to regexes

2000-09-12 Thread Richard Proctor

(proto RFC possibly, and some generalised ramblings)

Given that expansion of regexes could include (+...) and (*...) I have been
thinking about providing a general purpose way of adding functionality. 

I propose that the entire (+...) syntax is kept free from formal
specification for this and is available for pluggable (module) expansion. 
(+ = addition).

A module or anything that wants to support some enhanced syntax
registers something that handles "regex enhancements".

At regex compile time, if and when (+foo) is found perl calls
each of the registered regex enhancements in turn, these:

1) Are passed the foo string as a parameter exactly as is.  (There is
an issue of actually finding the end of the foo.)

2) The regex enhancement can either recognise the content or not.

3) If not it returns undef and perl goes to the next regex enhancement
(Does it handle the enhancements as a stack (Last checked first) or a list
(First checked first?) how are they scoped?  Job here for the OO fanatics)

4) If perl runs out of regex enhancements it reports an error.  

5) if an enhancement recognises the content it could do either of:

a) return replacement expanded regex using existing capabilities perl will
then pass this back through the regex compiler.

b) return a coderef that is called at run time when the regex gets to this
point.  The referenced code needs to have enough access to the regex
internals to be able to see the current sub-expression, request more
characters, access to relevant flags and visability of greediness.  It may
also need a coderef that is simarly called when the regex is being unwound
when it backtracks.


Thinking from that - the last case should be generalised (it is sort of
like my (?*{...}) from RFC 198.  If so both cases a and b are the same,
b is just a case of returning (?*{...}).  

Following on, if (?{...}) etc code is evaluated in forward match, it would
be a good idea to likewise support some code block that is ignored on a
forward match but is executed when the code is unwound due to backtracking. 
Thus (?{ foo })(?\{ bar }) could be defined to execute foo on the forward
case and bar if it unwinds.  

For example - Think about foo putting something on a stack (eg the
bracket to match [RFC 145]) and bar taking it off.

I dont care at the moment what the syntax is - what about the concepts?

Richard






-- 

[EMAIL PROTECTED]




Re: RFC 166 (v1) Additions to regexs

2000-09-12 Thread Richard Proctor

On Mon 11 Sep, Mark-Jason Dominus wrote:
> 
> > (?@foo) is sort of equivalent to (??{join('|',@foo)}), ie it expands into
> > a list of alternatives.  One could possible use just @foo, for this.
> 
> It just occurs to me that this is already possible.  I've written a
> module, 'atq', such that if you write
> 
> use atq;
> 
> then your regexes may contain the sequence
> 
> (?\@foo)
> 
> with the meaning that you asked for.  

Yes, but is this a very good way to go forward, the use of overload is
"heavy", if somthing might be useful.  (See other note about generalised
additions to regexes).

> 
> (The \ is necessary here because (?@foo) already has a meaning under
> Perl 5, and I think your proposal must address this.)

(?@foo) has no meaning I checked the code

> 
> Anyway, since this is possible under Perl 5 with a fairly simple
> module, I wonder if it really needs to be in the Perl 6 core.  Perhaps
> it would be better to propose that the module be added to the Perl 6
> standard library?

The module is small, but this does not mean that adding functionality to the
core is necesarily a bad idea.

Richard

-- 

[EMAIL PROTECTED]




RFC 165: Allow Variables in tr/// (post hugo)

2000-09-11 Thread Richard Proctor

Hugo wrote:
> Definitely. Should be easy to implement. There is a potential for
> confusion, since it makes the tr/ lists look even more like
> m/ and s/ patterns, but I think it can only be less confusion than
> the current state of affairs. It is tempting to make it the default,
> and have a flag to turn it off (or just backwhack the dagnabbed
> dollar), and auto-translation of existing scripts would be pretty
> easy, except that it would presumably fail exactly where people
> are using the current workaround, by way of eval.
> 
> It would be helpful to tie down would should occur for @var and
> %var (but note that this one liner changed between 5.6.0 and 5.7.0:
>   crypt% setperl 5.6.0
>   crypt% perl -we '/.@x./'
>   In string, @x now must be written as \@x at -e line 1, near ".@x"
>   Execution of -e aborted due to compilation errors.
>   crypt% setperl 5.7.0
>   crypt% perl -we '/.@x./'
>   Possible unintended interpolation of @x in string at -e line 1.
>   Name "main::x" used only once: possible typo at -e line 1.
>   Use of uninitialized value in pattern match (m//) at -e line 1.
>   crypt% 
> ).

I propose adding the first para as a note and moving RFC to frozen soon.
Should it do anything for @foo and %bar?  I cant think of any
good reason.  

Richard

-- 

[EMAIL PROTECTED]




RFC 166 (postHugo)

2000-09-11 Thread Richard Proctor

This RFC had three concepts, I propose dropping the "Not a pattern" from here
as it is now in RFC 198 and the null element.  The List expansion might
benefit from a slight enhancement.

Hugo:
> (?@foo) and (?Q@foo) are both things I've wanted before now. I'm
> not sure if this is the right syntax, particularly if RFC 112 is
> adopted: it would be confusing to have (?@foo) to have so
> different a meaning from (?$foo=...), and even more so if the
> latter is ever extended to allow (?@foo=...).
> I see no reason that implementation should cause any problems
> since this is purely a regexp-compile time issue.
> 
> I dont have any problem with the (?@foo) syntax, does anybody else?
> I cant imagine a (?@foo=...) style syntax (yet).

Thinking further about what I defined for (?Q@foo) as adding the list
as quoted alternatives, is there a case for (?Q$foo) to match the contents of
$foo quoted in a similar way?  (I think it is at least a probably).

Feedback desirable.  

Richard

(Still thinking on scoping in assignment and boolean regexes)


-- 

[EMAIL PROTECTED]




RFC 110 counting matches (post Hugo)

2000-09-11 Thread Richard Proctor

This list has gone a little quiet...

Hugo wrote:
> I like this too. I'd suggest /t should mean a) return a scalar of
> the number of matches and b) don't set any special variables. Then
> /t without /g would return 0 or 1, but be faster since no extra
> information need be captured (except internally for (.)\1 type
> matching - compile time checks could determine if these are needed,
> though (?{..}) and (??{..}) patterns would require disabling of
> that optimisation). /tg would give a scalar count of the total
> number of matches. \G would retain its meaning.
> 
> Any which way, implementation should be fairly straightforward,
> though ensuring that optimisations occurred precisely when they
> are safe would probably involve a few bug-chasing cycles.


I propose adding this note.  His preference for the working of
/t and /g seems the most appropriate.  Unless I here any further
discussion I propose moving this RFC to frozen this week.

Richard


-- 

[EMAIL PROTECTED]




Re: RFC 150 (v1) Extend regex syntax to provide for return of a hash of matched subpatterns

2000-09-08 Thread Richard Proctor

On Fri 08 Sep, Kevin Walker wrote:
> (This thread has been inactive for a while.  See 
> http://www.mail-archive.com/perl6-language-regex@perl.org/index.html#0 
> 0015 for it's short history.)
> 
> Long ago Tom Christiansen wrote:
> 
> >This is useful in that it would stop being number dependent.
> >For example, you can't now safely say
> >
> >/$var (foo) \1/
> >
> >and guarantee for arbitrary contents of $var that your you have
> >the right number backref anymore.
> >
> >If I recall correctly, the Python folks addressed this.  One
> >might check that.
> 
> Python does, indeed, have something similar.  See (?P...) and 
> (?P=...) at http://www.python.org/doc/current/lib/re-syntax.html .
> 
> Tom's comment points out a shortcoming in the original RFC:  There's 
> no way to make, by name, a backref to a named group.  I propose to 
> fix that in a revised version of RFC 150.  I don't have strong 
> feelings about what the syntax should be.  Here one idea:
> 
>The substring matched by (?%some_name: ... ) can be referred to as 
> $%{some_name}.
> 
> That's kind of ugly, so other suggestions are welcome.  (The idea was 
> to do something analogous to $1, $2, etc.  Unfortunately ${some_name} 
> is already taken.  Maybe $_{some_name} would also work -- though if 
> %_ seems too valuable to use for this limited purpose.)
> 
> 

Kevin,

I have been having similar thoughts about my RFC 112 (assignment within
a regex).  At present it is worded that it does not generate the back
reference, but I now have some reservations.

Thinking about the comparision between the two RFCs there is some common
ground, but cases where people will want your hash and cases where
people will want explicit variables.  Using RFC 112, you can do
hash assignment, but it would not clear the hash beforehand whereas
your hash assignment would (I assume) set the hash to ONLY those elements
from the regex.

Your %hash = $string =~ /..(?%foo=..)/
is essentially the same as my %hash = (); $string =~ /..(?$hash{foo}=..)/

Do we need both?  I think the answer is prossibly, but whatever is
decided about back refereces should apply to both.

My thoughts on the back references would be, that if a variable is used
again later in the regex, assignment takes place and it is simply refered
to.

Thus $string =~ m#<(?$foo=\w+).*?#;

The parse notices the reuse of $foo and performs the actual assigment
as and when the foo is matched (or at least acts as if it does).

Richard


-- 

[EMAIL PROTECTED]




Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Richard Proctor

On Wed 06 Sep, Mark-Jason Dominus wrote:
> 
> I've been thinking the same thing.  It seems to me that the attempts to
> shoehorn parsers into regex syntax have either been unsuccessful
> (yielding an underpowered extension) or illegible or both.
> 
>SNOBOL: 
> parenstring = '(' *parenstring ')'  
> | *parenstring *parenstring
> | span('()')
> 
> 
> This is not exactly the same, but I tried a direct translation:
> 
>  $re = qr{ \( (??{$re}) \)
>  | (??{$re}) (??{$re})
>  | (?> [^()]+)
>  }x;
> 

I think what is needed is something along the line of :

   $re = qz{ '(' \$re ')'
| \$re \$re
| [^()]+
   };
   
Where qz is some hypothetical new quoting syntax

Richard

-- 

[EMAIL PROTECTED]




Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Richard Proctor

On Wed 06 Sep, David Corbin wrote:
> Nathan Wiger wrote:
> > 
> > > It would be useful (and increasingly more common) to be able to match
> > > qr|<\s*(\w+)([^>]*)>| to qr|<\s*/\1\s*>|, and handle the case where
> > > those can nest as well.  Something like
> > >
> > > match this with
> > >
> > >   not this but
> > >this.
> > 
> > I suspect this is going to need a ?[ and ?] of its own. I've been
> > thinking about this since your email on the subject yesterday, and I
> > don't see how either RFC 145 or this alternative method could support
> > it, since there are two tags - > and  > asymmetrically, and neither approach gives any credence to what's
> > contained inside the tag. So  would be matched itself as "< matches
> > >".
> 
> Actually, in one of my responses I did outline a syntax which would handle
> this with reasonably ease, I think.  If the contents of (?[) is considered
> a pattern, then you can define a matching pattern.

I think it should be a list of patterns rather than a single pattern.

Each pattern in the list is attempted left to right until one matches.  I now
dont think it should be a hash as it needs to be ordered.  But using the =>
as the l/r separateor does  make it clear.

> 
> m:(?['<\w+>' => '').*(?]):
> 
> 
> I'll grant you it's not the simplest syntax, but it's a lot simpler than
> using the 5.6 method... :)

Actually that simple case is handled as m:<(\w+)>.*: but I 
think this is getting somewhere.  This is a rich syntax that has lots of
potential uses, not just for html.

> > 
> > What if we added special XML/HTML-parsing ?< and ?> operators?
> > Unfortunately, as Richard notes, ?> is already taken, but I will use it
> > for the examples to make things symmetrical.
> > 
> >?<  =  opening tag (with name specified)
> >?>  =  closing tag (matches based on nesting)

We are running out of (? syntax, we might want to find some other construct
before long.  But anyway, XML/HTML is important, but I am not convinced
that what is being covered here really helps.  I am working on an RFC
to allow boolean logic ( && and || and !) to apply a number of patterns to
the same substring to allow easier mining of information out of such
constructs. 

Richard

-- 

[EMAIL PROTECTED]




Re: RFC 145 (alternate approach)

2000-09-06 Thread Richard Proctor

On Tue 05 Sep, Nathan Wiger wrote:
>"normal"   "reversed"
>-- ---
>103301
>99aa99
>(( ))
><+ +>
>{{[!<_ _>!]}}
>{__A1( )A1__}
> 
> That is, when a bracket is encountered, the "reverse" of that is
> automatically interpreted as its closing counterpart. This is the same
> reason why qq// and qq() and qq{} all work without special notation. 
> 
> So we can replace @^g and @^G with simple precendence rules, the same
> that are actually invoked automatically throughout Perl already.
> 
> >   (?[( => ),{ => }, 01 => 10)
> > 
> > sort of hashish in style.
> 
> I actually think this is redundant, for the reasons I mentioned above.
> I'm not striking it down outright, but it seems simple rules could make
> all this unnecessary. 

I dont think you will ever come up with a set of rules that will satisfy
everybody all the time.  what about html comments  are they
brackets?  What about people doing 66/99 pairs?  The best you could
achieve is a set of default rules as you have suggested AND a way
of overriding them with an explicit hash of what is the closing
 
bracket for each opening bracket.

The two methods depend on what follows the (?[ is it a hash or not.

For the "Default" method the list of brackets could be as has been
suggested a regex, or perhaps a simple comma separated list.  For this
you should define what is the "reverse" of each character, at
least for latin-1, what do you do about the full utf-8...?  An \X type
construct that covers all the common brackets might be a usefull addition
({


Re: RFC 145 (alternate approach)

2000-09-05 Thread Richard Proctor

On Tue 05 Sep, Nathan Wiger wrote:
> Eric Roode wrote:
> Now *that* sounds cool, I like it!
> 
> What if the RFC only suggested the addition of two new constructs, (?[)
> and (?]), which did nested matches. The rest would be bound by standard
> regex constructs and your imagination!
> 
> That is, the ?] simply takes whatever the closest ?[ matched and
> reverses it, verbatim, including ordering, case, and number of
> characters. The only trick would be a way to get what "reverses it"
> means correct.
> 

No ?] should match the closest ?[ it should nest the ?[s bound by any
brackets in the regex and act accordingly.  

Also this does not work as a definition of simple bracket matching as you
need ( to match ) not ( to match (.  A ?[ list should specify for each
element what the matching element is perhaps 

  (?[( => ),{ => }, 01 => 10)
  
sort of hashish in style.

Perhaps the brackets could be defined as a hash allowing (?[%Hash)

Richard

-- 

[EMAIL PROTECTED]




Re: RFC 145 (alternate approach)

2000-09-05 Thread Richard Proctor

On Tue 05 Sep, David Corbin wrote:
> Nathan Wiger wrote:
> > 
> > But, how about a new ?m operator?
> > 
> >/(?m<<|[).*?(?M>>|])/;
> > 
> 
> Let's combine yor operator with my example from above where everything
> inside the (?m) or the ?(M)
> fits the syntax of a RE.  
> 
>   /(?m(<<)|\[).*?(?M(>>)|(\]))
> 
> > Then the ?M matches pairs with the previous ?m, if there was one that
> > was matched. The | character separates or'ed sets consistent with other
> > regex patterns.

There already is a (?m

The whole (?x set of thingies is getting complicated...  The list of what is
used at present (and in current suggestions is:

Current Use in perl5

(?# comment
(?imsx  flags
(?-imsx flags
(?: subexpression without bracket capture
(?= zero-width positive look ahead
(?! zero width negative look ahead
(?<=zero-width positve look behind
(? Independant subexpression
(?(condition)yes-pattern
(?(condition)yes-pattern|no-pattern

Suggested in RFCs either current or in development

(?$foo= suggested for assignment (RFC 112)
(?%foo= suggested for hash assignment (RFC 150?)

(?@foo  suggested list expansion (?:$foo[0] | $foo[1] | ...) ? (RFC 166)
(?Q@foo) Quote each item of lists (RFC 166)
(?^pattern) matches anything that does not match pattern 
(RFC 166 but will be somewhere else on next rewrite [1])
(?F Failure tokens (RFC in development by me [1])
(?r),(?f)   Suggested in Direction Control RFC 1
(?& Boolean regexes (RFC in development [1])
(?*{code})  Execute code with pass/fail result (RFC in development [1])

[1] these will all be in an RFC which will probably be out in a day or so.

Unused (? sequences

a,b,c,d,e, ,g,h, ,j,k,l, ,n,o,p,q, , ,t,u,v,w,x,y,z
A,B,C,D,E, ,G,H,I,J,K,L,M,N,O,P, ,R,S,T,U,V,W,X,Y,Z
0,1,2,3,4,5,6,7,8,9
`_,."+[];'~)

(if I have forgotten any do tell and I will try and keep this list up to
date.

Richard



-- 

[EMAIL PROTECTED]




Re: RFC 145 (alternate approach)

2000-09-05 Thread Richard Proctor

On Tue 05 Sep, David Corbin wrote:
> Nathan Wiger wrote:
> > 
> > But, how about a new ?m operator?
> > 
> >/(?m<<|[).*?(?M>>|])/;
> > 
> 
> Let's combine yor operator with my example from above where everything
> inside the (?m) or the ?(M)
> fits the syntax of a RE.  
> 
>   /(?m(<<)|\[).*?(?M(>>)|(\]))
> 
> > Then the ?M matches pairs with the previous ?m, if there was one that
> > was matched. The | character separates or'ed sets consistent with other
> > regex patterns.

There already is a (?m

The whole (?x set of thingies is getting complicated...  The list of what is
used at present (and in current suggestions is:

Current Use in perl5

(?# comment
(?imsx  flags
(?-imsx flags
(?: subexpression without bracket capture
(?= zero-width positive look ahead
(?! zero width negative look ahead
(?<=zero-width positve look behind
(? Independant subexpression
(?(condition)yes-pattern
(?(condition)yes-pattern|no-pattern

Suggested in RFCs either current or in development

(?$foo= suggested for assignment (RFC 112)
(?%foo= suggested for hash assignment (RFC 150?)

(?@foo  suggested list expansion (?:$foo[0] | $foo[1] | ...) ? (RFC 166)
(?Q@foo) Quote each item of lists (RFC 166)
(?^pattern) matches anything that does not match pattern 
(RFC 166 but will be somewhere else on next rewrite [1])
(?F Failure tokens (RFC in development by me [1])
(?r),(?f)   Suggested in Direction Control RFC 1
(?& Boolean regexes (RFC in development [1])
(?*{code})  Execute code with pass/fail result (RFC in development [1])

[1] these will all be in an RFC which will probably be out in a day or so.

Unused (? sequences

a,b,c,d,e, ,g,h, ,j,k,l, ,n,o,p,q, , ,t,u,v,w,x,y,z
A,B,C,D,E, ,G,H,I,J,K,L,M,N,O,P, ,R,S,T,U,V,W,X,Y,Z
0,1,2,3,4,5,6,7,8,9
`_,."+[];'~)

(if I have forgotten any do tell and I will try and keep this list up to
date.

Richard


-- 

[EMAIL PROTECTED]




Re: perl6-language-regex summary for 20000831

2000-08-31 Thread Richard Proctor

On Thu 31 Aug, Mark-Jason Dominus wrote:
> Summary report 2831

> RFC 110: counting matches  (Richard Proctor)
> 
> An extensive side discussion of
> 
> $count = () = m/PAT/g;
> 
> developed, including an excursion off into context issues.  I have
> asked the author to take this idiom into account in the next version
> of the RFC.

Expect this tommorrow.

> RFC 112: Assignment within a regex  (Richard Proctor)
> 
> Very little discussion.  This should be compared with RFC 150 and
> perhaps be folded into it.

There where four "This is wonderful" type messages between here and its
original posting on language.  I suspect that things people agree with
don't get much discussion.

> RFC 144: Behavior of empty regex should be simple  (Mark Dominus)

I agreed with the RFC - no discussion needed.

> RFC 165: Allow variables in tr///  (Richard Proctor)
> 
> I pointed out that the implementation would have construct the
> translation table at run-time, and that this brings in the same issues
> as when a regex is constructed at run time.  For example, a new tr///o
> option becomes desirable for the same reaosn the m//o is desirable.

I have captured that for the next issue.

> RFC 166: Additions to regexs  (Richard Proctor)
> This RFC unfortunately proposes three totally unrelated changes.

I should split it.

> Richard proposed a 'does not match' operator, with the example that
> 
> Richard said he would tighten up the definition, but
> version 2 has not appeared yet.

Working on it - There are big cavens opening up.  When I have a solution
I will post this in a new RFC, as it will probably have to include other 
things.

> Richard also proposed a (?) operator that would match the empty
> string.  You would use this in cases like /$foo(?)bar/ where it is
> inappropriate to abut $foo and bar.  It was pointed out that
> /${foo}bar/ and /$foo(?:)bar/ already work for this purpose.  Richard
> agreed that this was what he wanted.

Consider it dropped.

> The third proposal was that (?@foo) be taken to interpolate the string
> (join "|", @foo).  There was no discussion of this.

I will make a RFC166 V2 that has just this.  It also proposed a 
quoted varient (?Q@foo) that effectively did quotemeta on each item.

Richard

-- 

[EMAIL PROTECTED]




Re: RFC 110 (v3) counting matches

2000-08-29 Thread Richard Proctor

On Tue 29 Aug, Mark-Jason Dominus wrote:
> 
> OK, I think this discussion should be closed.
> 
> Richard should add a section to RFC110 that discusses the
> 
> $count = () = m/PAT/g;
> 
> locution and its advantages and disadvantages compared to his
> proposal, duly taking into account the many valuable comments that
> have been made.

Will do.

Richard

-- 

[EMAIL PROTECTED]




Re: RFC 166 (disambiguator)

2000-08-29 Thread Richard Proctor

On Tue 29 Aug, Mark-Jason Dominus wrote:
> 
> 2. You can already write /$foo(?:)bar/ to get what you wanted.  This
>is almost identical to what Richard proposed anyway.

This has the effect I was after.

> 
> It is really not clear to me that this problem needs to be solved any
> better than it is already.
> 
> I suggest that this section be removed from the RFC.
> 

OK.  I was throwing up some ideas.   (I have a few more in development)



-- 

[EMAIL PROTECTED]




Re: RFC 166 (does-not-match)

2000-08-29 Thread Richard Proctor

On Tue 29 Aug, Mark-Jason Dominus wrote:
> 
> Richard Proctor's RFC166 says:
> 
> > =head2 Matching Not a pattern
> > 
> > (?^pattern) matches anything that does not match the pattern.  On
> > its own, one can use !~ etc to negatively match patterns, but to
> > match a pattern that has foo(anything but not baz)bar is currently
> > difficult.  With this syntax it would simply be /foo(?^baz)bar/.
> 
> The problem with this proposal is that it's really unclear what it
> means.

This is going to need a much better definition...

> 
> The reason we don't have this feature today is not that it has never
> been thought of before.  People have thought of this a hundred times.
> The problem is that nobody has ever figured out how it should work.
> I don't mean that the implemenation is difficult. I mean  that nobody
> understand what such a a feature actually means.   Richard doesn't say
> this in his RFC, even for the simple examples he raises.  He just
> assumes that it will be obvious, but it isn't.  
> 
> "foo-bazbar"  =~ /foo(?^baz)bar/# true or false?
> "foo-baz-bar" =~ /foo(?^baz)bar/# true or false?

The simple answer is both are false.

> OK, I'm going to try to invent a meaning for (?^baz).  I'm going to
> choose what appears to be a reasonable choice, and see what happens.
> 
> Let's suppose that what (?^baz) means is "match any substring that is
> not 'baz'."  That is a reasonably clear meaning.  Then it behaves like
> (.*)(?{$1 ne 'baz'}) does today.  Then the examples above are both
> true.

No your example is wrong it should behave as (.*)(?{$1 !~ /baz/}) both the
examples are false.  (?^foo) matches any substring that does not match the
pattern foo.

> 
> Now let's see how that choice works out.
> 
> "foobaz" =~ /foo.*(?^baz)/
> 
> This is TRUE, because "foo" matches "foo", ".*" matches "baz", and
> "(?^baz)" matches the empty string at the end, which is a substring
> that is not "baz".

This is a traditional problem with a greedy .* this however does beg
the question is (?^baz) greedy?  I think the right answer is that it should
not be (but I am open to debate on that).

> 
> In fact, with this apparently reasonable choice of meaning for
> (?^baz), /foo.*(?^baz)/ will match anything that /foo.*/ will.  The
> (?^baz) has hardly any effect at all.

With a greedy .* the (?^baz) has no effect, unless something follows that
has to be matched.

> 
> It is a good thing that we did not implement it that way, because it
> is sure to become an instant FAQ:  "Why does /foo.*(?^baz)/ match
> 'foobaz'?"  You are going to see this question in comp.lang.perl.misc
> every week.

I think one should outlaw .* before or after a (?^foo) construct, as
the result is meaningless.

> 
> So this choice I made for the meaning of (?^baz) appears to have been
> the wrong one. I could go on and make a different reasonable-seeming
> choice and show what was wrong with it, but I don't want to belabor my
> point, which is:
> 
> Every choice anyone has ever made for the meaning of (?^baz) has
> always been the wrong one for one reason or another.  So without a
> detailed explanation of what (?^baz) might mean, suggesting that Perl
> 6 have one is not helpful.  

I can tighten the definition up.  If there have been calls for a 
(?^baz) type construct before, there will be again.  It is a matter of
getting the definition straightforward and useable.

Richard

-- 

[EMAIL PROTECTED]




Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()

2000-08-28 Thread Richard Proctor

On Mon 28 Aug, Tom Christiansen wrote:
> >> It's nearly part of Perl's language signature.  I wouldn't count
> >> on this going away if you still think to call this "Perl".  It is
> >> of course much more likely in the renamed "Frob" language, however.
> 
> >First off, this argument is just a little too grandiose, because if we
> >can't change anything because of precedent, then we're stuck and Perl 6
> >should just be Perl 5.9 instead.
> 
> How nice of you to put words in my mouth.  Please cite me the precise
> message ID, date, and appropriate text in which I said "we can't
> change anything because of precedent".  

This is getting a little heated and silly.  (its not supposed to be p5p)

When we eventually have perl6 there will be a LONG transition time when
programs will need to work both with perl5 and perl6. 

By all means suggest ADDITIONS to perl to make the programmers life easier
(all of my RFCs have been)

If you want to change STUPID behaviour that should be avoided by current
programs (such as empty regexes) fine.

If you want to take little used things out of the core such as formats,
fine provided they can simply be brought back in with a use statement or two.

If you want to change heavily used aspects of perl and ruin most programs
out there you had better come up with some really serious reasons why.

Be reasonable, be nice.

Richard

-- 

[EMAIL PROTECTED]