RFC 158 (v3) Regular Expression Special Variables

2000-09-22 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1 TITLE

Regular Expression Special Variables

=head1 VERSION

  Maintainer: Uri Guttman [EMAIL PROTECTED]
  Date: 25 Aug 2000
  Last Modified: 22 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 158
  Version: 3
  Status: Frozen
  Frozen since: v2

=head1 ABSTRACT

This RFC addresses ways to make the regex special variables $`, $ and
$' not be such pariahs like they are now.

=head1 CHANGES

I dropped the local scoping of $`, $ and $' as they are already
localized now.

=head1 DESCRIPTION

$`, $ and $' are useful variables which are never used by any
experienced Perl hacker since they have well known problems with
efficiency. Since they are globals, any use of them anywhere in your
code forces all regexes to copy their data for potential later
referencing by one of them. I will describe some ideas to make this
issue go away and return these variables back into the toolbox where
they belong.

=head1 IMPLEMENTATION

The copy all regex data problem is solved by a new modifier k (for
keep). This tells the regex to do the copy so the 3 vars will work
properly. So you would use code like this:

$str = 'prefoopost' ;

if ( $str =~ /foo/k ) {

print "pre is [$`]\n" ;
print "match is [$]\n" ;
print "post is [$']\n" ;
}

=head1 IMPACT

None

=head1 UNKNOWNS

None

=head1 REFERENCES

None.





Re: RFC 197 (v1) Numeric Value Ranges In Regular Expressions

2000-09-22 Thread David L. Nicol

Hugo wrote:
 
 In [EMAIL PROTECTED], "David L. Nicol" writes:
 :I think I did -- I guess v2 didn't make it in; I sent it again; what
 :were your and mjd's comments again?
 
 Here are the messages:
 http://www.mail-archive.com/perl6-language-regex%40perl.org/msg00306.html
 http://www.mail-archive.com/perl6-language-regex%40perl.org/msg00294.html
 
 However if you didn't see them it is too late now, since I see that
 your v2 freezes the RFC. I think it is a shame there was not more
 discussion of this - I'm sure the functionality would be useful, but
 I'm not at all convinced about the syntax.
 
 Hugo



Thanks.  Yes, I had seen them, and they are both valid criticisms.  There
are more examples in v2.

The syntax matches exactly the syntax used for specifying segments of number line
in algebra classes.

If this goes into the language, people who are writing nonlinear number systems
would have to decide whether to support it or not and if so how; that goes w/o saying

the inspiration for it, along with its companion piece on an implied grep in certain
hash accesses, is a way to ease slicing of "traditional" multidimensional arrays.   


%Center_of_4x4x4_cube = %FourCube{/[2,3] [2,3] [2,3]/} ;


That's an old-fasioned fake multidimensional array, of course, not one of these
new creatures.




-- 
  David Nicol 816.235.1187 [EMAIL PROTECTED]
   "The most powerful force in the universe is gossip"



RFC 165 (v3) Allow Varibles in tr///

2000-09-22 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1  TITLE

Allow Varibles in tr///

=head1 VERSION

  Maintainer: Richard Proctor [EMAIL PROTECTED]
  Date: 27 Aug 2000
  Last Modified: 22 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 165
  Version: 3
  Status: Frozen

=head1 ABSTRACT

Allow variables in a tr///.  At present the only way to do a tr/$foo/$bar/
is to wrap it up in an eval.  I dont like using evals for this sort of thing.

=head1 DESCRIPTION

Suggested syntax: tr/$foo/$bar/e

With a /e, tr will expand both the LHS and RHS of the translate function.
Either or both could be variables. I am suggesting /e as it is sort of like
/e for s///e.

These words from MJD:

The way tr/// works is that a 256-byte table is constructed at compile
time that say for each input character what output character is
produced.  Then when it's time to apply the tr/// to a string, Perl
iterates over the string one character at a time, looks up each
character in the table, and replaces it with the corresponding
character from the table.

With tr///e, you would have to generate the table at run-time.

This would suggest that you want the same sorts of optimizations that
Perl applies when it encounters a regex that contains variables:

1. Perl should examine the strings to see if they have changed
   since the last time it executed the code

2. It should rebuild the tables only if the strings changed

3. There should be a /o modifier that promises Perl that the
   variables will never change.

The implementation could be analogous to the way m/.../o is
implemented, with two separate op nodes: One that tells Perl
'construct the tables' and one that tells Perl 'transform the
string'.  The 'construct the tables' node would remove itself from the
op tree if it saw that the tr//o modifier was used.

Hugo wrote:
 Definitely. Should be easy to implement. There is a potential for
 confusion, since it makes the tr/ lists look even more like
 m/ and s/ patterns, but I think it can only be less confusion than
 the current state of affairs. It is tempting to make it the default,
 and have a flag to turn it off (or just backwhack the dagnabbed
 dollar), and auto-translation of existing scripts would be pretty
 easy, except that it would presumably fail exactly where people
 are using the current workaround, by way of eval.
 

Comments by me:

Therefore tr///o might be a good idea as well.  

If Hugo's idea of making this the normal behaviour, the problem of
existing evals is avoided by p52p6 changing the eval to a perl5_eval
which acts accordingly.  (One of MJD's ideas).

=head1 IMPLENTATION

Hugo:  Should be easy to implement.  

Me: Should not be too complicated, this is just a case of doing existing
things in a different context.

=head1 CHANGES

V2 - Added words from MJD and Hugo - This hopefully in a pre freeze state.

V3 - re issued due to an error in posting V2 and now frozen

=head1 REFERENCES

None yet.





RFC 166 (v3) Alternative lists and quoting of things

2000-09-22 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1  TITLE

Alternative lists and quoting of things

=head1 VERSION

  Maintainer: Richard Proctor [EMAIL PROTECTED]
  Date: 27 Aug 2000
  Last Modifiedj: 22 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 166
  Version: 3
  Status: Developing

=head1 ABSTRACT

Expand Alternate Lists from Arrays and Quote the contents of things 
inside regexes.


=head1 DESCRIPTION

These are a couple of constructs to make it easy to build up regexes
from other things.

=head2 Alternative Lists from arrays

The basic idea is to expand an array as a list of alternatives.  There
are two possible syntaxs (?@foo) and just plain @foo.  @foo might just have
existing uses (just), therefore I prefer the (?@foo) syntax.

(?@foo) is just syntactic sugar for (?:(??{ join('|',@foo) })) A bracketed
list of alternatives.

=head2 Quoting the contents of things

If a regex uses $foo or @bar there are problems if the content of
the variables contain special characters.  What is needed is a way
of \Quoting the content of scalars $foo or arrays (?@foo).

Suggested syntax:

(?Q$foo) Quotes the contents of the scalar $foo - equivalent to
(??{ quotemeta $foo }).

(?Q@foo) Quotes each item in a list (as above) this is equivalent to
(?:(??{ join ('|', map quotemeta, @foo)})).

In this syntax the Q is used as it represents a more inteligent \Quot\E.

It is recognised that (?Q$foo) is equivalent to \Q$foo\E, but it does not
mean that this is a bad idea to add this at the same time as (?Q@foo) for
reasons of symetry and perl DWIM.

=head2 Comments

Hugo:
 (?@foo) and (?Q@foo) are both things I've wanted before now. I'm
 not sure if this is the right syntax, particularly if RFC 112 is
 adopted: it would be confusing to have (?@foo) to have so
 different a meaning from (?$foo=...), and even more so if the
 latter is ever extended to allow (?@foo=...).
 I see no reason that implementation should cause any problems
 since this is purely a regexp-compile time issue.

Me: I cant see any reasonable meaning to (?@foo=...) this seams an appropriate
syntax, but I am open for others to be suggested.

=head1 CHANGES

V1 of this RFC had three ideas, one has been dropped, the other is now part
of RFC 198.

V2 Expands the list expansion and quoting with quoting of scalars and 
Implemention issues.

V3 In an error what should have been 165 V2 was issued as 166 V2 so this is V3
with a change in (?Q$foo).  This is in a pre-frozen state.

=head1 MIGRATION

As (?@foo) and (?Q...) these are additions with out any compatibility issues.

The option of just @foo for list exansion, might represent a small problem if
people already use the construct.

=head1 IMPLENTATION

Both of these are changes are regex compile time issues.

Generating lists from arrays almost works by localising $" as '|' for the 
regex and just using @foo.

MJD has demonstrated implementing (?@foo) as (?\@foo) by means of an overload
of regexes, this slight change was necessary because of the expansion of
@foo - see below.

Both of these changes are currently affected by the expansion of variables in
the regex before the regex compiler gets to work on the regex.  This problem also
affects several other RFCs.  The expansion of variables in regexes needs
for these (and other RFCs) to be driven from within the regex compiler so
that the regex can expand as and where appropriate.  Changing this should not
affect any existing behaviour.

=head1 REFERENCES

RFC 198





RFC 198 (v2) Boolean Regexes

2000-09-22 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1  TITLE

Boolean Regexes

=head1 VERSION

  Maintainer: Richard Proctor [EMAIL PROTECTED]
  Date: 6 Sep 2000
  Last Modified: 22 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 198
  Version: 2
  Status: Developing

=head1 ABSTRACT

This is a development of the proposal for the "not a pattern" concept in RFC
166 V1.  Looking deeper into the handling of advanced regexs, there are
potential needs for many other concepts, to allow a regex to extract
information directly from a complex file in one go, rather than a mixture
of splits and nested regexes as is typically needed today.  With these
parsing data should become easier (in some cases). 

=head1 CHANGES

V2 - Changed the "Fail Pattern", enhanced the wording for many things.

=head1 DESCRIPTION

It would be nice (in my opinion) to be able to build more elaborate regexes
allowing data to be mined out of a sting in one go.  These ideas allow
you to apply several patterns to one substring (each must match), to
fail a match from within, to look for patterns that do not contain other
patterns, and to handle looking for cases such as (foo.*bar)|(bar.*foo) in
a more general way of saying "A substring that contains both foo and bar".

These are ideas, at present with some proposed syntax.  The ideas are more
important than the exact syntax at this stage.  This is very much work in
progress.

I have  called these boolean regexs as they bring the concepts of and ()
or (||) and not(!) into the realm of regexes.

Within a boolean regex (or the boolean part of a regex), several new
symbols have meanings, and some have enhanced meanings.

=head2 The Ideas

Are these part of a boolean (?...) construct within an existing regex, or
is the advanced syntax (and meaning of |!^$) invoked by a new flag such
as /B?

These can look like line noise so the use of white space with /x is used
throughout, and it might be appropriate to enforce (or assume) /x within
(...).

=head3 Boolean construct

(?...) grabs a substring, and applies one or more tests to the substring.

=head3 Substring matching multiple patterns ()

(? pattern1  pattern2  pattern3 )

A substring is definied that matches each pattern.

For example, the first pattern may say specify a substring of at least
30 chars, the next two have a foo and a bar.

=head3 Substring matching alternative patterns (||)

(? pattern1 || pattern2 || pattern3)

This is similar to the existing alternative syntax "|" but the
alternatives to "|" behave as /^pattern$/ rather than /pattern/ (^ and $
taken as refereing to the substring in this case - see below).

(pattern1 || pattern2 || pattern3) can be mixed in with the  case above to
build up more advanced cases.  and || operators can be nested with brackets
in normal ways.

=head3 Brackets within boolean regexes

Within a complex boolean regex there are likely to be lots and lots of
brackets to nest and control the behaviour of the regex.  Rather than having
to sprinkle the regex with (?:) line noise, it would be nicer to just use
ordinary brackets () and only support capturing of elements by using one of
the (?$=) or (?%=) constructs that have been proposed elsewhere (RFC 112
and RFC 150).  There might be some case for this as a general capability
using some flag /b = brackets? 

=head3 Substring not matching a pattern

In RFC 166 I originally proposed (?^ pattern ).  This proposal replaces that.
Though it could be used as well outside of the (?) construct.

!pattern matches anything that does not match the pattern.  On its own it is
not terribly useful, but in conjuction with  and || one can do things
such as /(? img  ! alt=)/ ie does it have an image not have an alt.
 
! is chosen as it has the same basic meaning outside of regexes.

!pattern is a non greedy construct that matches any string/substring that
does not match the pattern.  

=head3 Meaning of $ and ^ inside a boolean regex

^ and $ are taken to mean the begining and end of the substring, not begining
and and of the line/string from within a boolean regex.

=head3 Greediness

Should the (?...) construct be greedy or nongreedy?  To some extent this
depends on the elements it contains.  If all the matching set of patterns are
greedy then it will be greedy, if they are not greedy then it will not be. 
This might or might be sufficient.

If the situation is ambiguous (or might be) The boolean can be expresed as
(?? ...) to force non greediness. 

=head3 Delivering a substring to some code that generates a pass/fail

(?*{code}) delivers a substring to the code, which returns with success
or failure.  The code sees the substring as $_.  This is not dependant on the
Boolean regex concept and could be used for other things, though it is most 
useful in this context.  

This is sort of equivalent to (?: (.*)(??{$_ = $1; code})) ie it matches an
arbitary long substring and deliveres it to the code.  But not dependant on
how many brackets have been 

RFC 274 (v1) Generalised Additions to Regexs

2000-09-22 Thread Perl6 RFC Librarian

This and other RFCs are available on the web at
  http://dev.perl.org/rfc/

=head1  TITLE

Generalised Additions to Regexs

=head1 VERSION

  Maintainer: Richard Proctor [EMAIL PROTECTED]
  Date: 22 Sep 2000
  Mailing List: [EMAIL PROTECTED]
  Number: 274
  Version: 1
  Status: Developing

=head1 ABSTRACT

This proposes a way for generalised additions to regex capabilities.

=head1 DESCIPTION

Given that expansion of regexes could include (+...) and (*...) I have
been thinking about providing a general purpose way of adding
functionality.  Hence I propose that the entire (+...) syntax is
kept free from formal specification for this. (+ = addition)

A module or anything that wants to support some enhanced syntax
registers something that handles "regex enhancements".

At regex compile time, if and when (+foo) is found perl calls
each of the registered regex enhancements in turn, these:

1) Are passed the foo string as a parameter exactly as is.  (There is
an issue of actually finding the end of the generic foo.)

2) The regex enhancement can either recognise the content or not.

3) If not the enhancement returns undef and perl goes to the next regex
enhancement (Does it handle the enhancements as a stack (Last checked
first) or a list (First checked first?) how are they scoped?  Job here
for the OO/scoping fanatics)

4) If perl runs out of registered regex enhancements it reports an error.  

5) if an enhancement recognises the content it could do either of:

a) return replacement expanded regex using existing capabilities perl will
then pass this back through the regex compiler.

b) return a coderef that is called at run time when the regex gets to this
point.  The referenced code needs to have enough access to the regex internals
to be able to see the current sub-expression, request more characters, access
to relevant flags and visability of greediness.  It may also need a coderef
that is simarly called when the regex is being unwound when it backtracks.
These features would also be of interest to the existing code inside regexes
as well.


Thinking from that - the last case should be generalised (it is sort of
like my (?*{...}) from RFC 198 or an enhancement to (??{...}).  If so cases
(a) and (b) are the same as case (b) is just a case of returning (?*{...}) the
appropriate code.  

Following on, if (?{...}) etc code is evaluated
in forward match, it would be a good idea to likewise support some
code block that is ignored on a forward match but is executed when the
code is unwound due to backtracking.  Thus (?{ foo })(?\{ bar })
executes foo on the forward case and bar if it unwinds.  I dont
care at the moment what the syntax is - what about the concepts.
Think about foo putting something on a stack (eg the bracket to match
[RFC 145]) and bar taking it off for example.

Note:

I dont consider this RFC complete, but after posting this on the regex list
to no effect I am making it an RFC to see if it gets a little more feedback...

=head1 MIGRATION

This is a new feature - no compatibity problems

=head1 IMPLENTATION

This has not been looked at in detail, but the desciption above provides
some views as to how it may operate.

=head1 REFERENCES

RFC 145 - Bracket matching

RFC 198 - Boolean Regexes