Re: is \1 vs $1 a necessary distinction?

2000-09-27 Thread Michael Maraist


From: "Dave Storrs" <[EMAIL PROTECTED]>

> Both \1 and $1 refer to what is matched by the first set of parens in a
> regex.  AFAIK, the only difference between these two notation is that \1
> is used within the regex itself and $1 is used outside of the regex.  Is
> there any reason not to standardize these down to one notation (i.e.,
> eliminate one or the other)?

\1 came from sed and friends.  I think an early driving force was
maintaining familiarity with things like awk and sed.  Even today there are
still people that switch to and from other reg-ex languages.  Emacs is the
most common for me (though I still dabble with awk).  I don't see a real
advantage in taking out \1, and it is very likely to needlessly break legacy
code, and additionally confuse various developers that have a habbit of
using \1.

On the other hand, the use of $1with substitutions is important for
consistency.  When you write s/../.../e, you're going to need to use a
substitution variable, "\1" just doesn't fit.
s/(...)/pre\1post/;  works fine
s/(...)/pre$1post/; is the question. I tend to use it only because I
sometimes switch to:
s/(...)/func() . "$1post"/e;  for various reasons..  I just try and
standardize on $1, but that's just me.

Additionally the use of $1 in the matching reg-ex is ambiguous as in:
m/(...).*?$1/;
Does it refer to the internal set of (..), or does it mean the previous
value of $1 before this match.. This becomes non-obvious to the observer in
the following case:
m/($keyword).*?$1/;
Here, our mindset is substitution of external variables, the casual
(non-seasoned) observer might not understand that it really means:
m/($keyword).*?\1/;

My argument is that both \1 and $1 have their places, and limiting to one
type can be troublesome.  Plus, TMTOWTDI. :)

-Michael




Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-26 Thread Michael Maraist

> There is, but as MJD wrote: "it ain't pretty". Now, semantic checks or
> assertions would be the only reason why I'd expect to be able to execute
> perl code every time a part of a regex is succesfully parsed. Simply
> look at RFC 197: a syntactic extension to regexes just to check if a
> number is within a range! That is absurd, isn't it? Would a simple way
> to include localized tests, *any*¨test, make more sense?

I'm trying to stick to a general philosophy of what's in a reg-ex, and I can
almost justify assertions since as you say, \d, ^, $, (?=), etc are these
very sort of things.  I've been avoiding most of this discussion because
it's been so odd, I can't believe they'll ultimately get accepted.  Given
the argument that it's unlikely that (?{code}) has been implemented in
production, I can almost see changing it's symantics.  From what I
understand, the point would be to run some sort of perl-code and returned
defined / undefined, where undefined forces a back-track.

As you said, we shouldn't encourage full-fledged execution (since core dumps
are common).  I can definately see simple optimizations such as (?{$1 op
const}), though other interesting things such as (?{exists $keywords{ $1 }})
might proliferate.  That would expand to the general purpose (?{
isKeyword( $1 ) }), which then allows function calls within the reg-ex,
which is just asking for trouble.

One restriction might be to disallow various op-codes within the reg-ex
assertion.  Namely user-function calls, reg-ex's, and most OS or IO
operations.

A very common thing could be an optimal /(?>\d+)(?{MIN < $1 && $1 > MAX})/,
where MIN and MAX are constants.

-Michael




Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-26 Thread Michael Maraist


> On 25 Sep 2000 20:14:52 -, Perl6 RFC Librarian wrote:
>
> >Remove C, C and friends.
>
> I'm putting the finishing touches on an RFC to drop (?{...}) and replace
> it with something far more localized, hence cleaner: assertions, also in
> Perl code. That way,
>
> /(?
> would only match integers between 0 and 255.
>
> Communications between Perl code snippets inside a regex would be
> strongly discouraged.

I can't believe that there currently isn't a means of killing a back-track
based on perl-code.  Looking through perlre it seems like you're right.  I'm
not really crazy about breaking backward compatibilty like this though.  It
shouldn't be too hard to find another character sequence to perform your
above job.

Beyond that, there's a growing rift between reg-ex extenders and purifiers.
I assume the functionality you're trying to produce above is to find the
first bare number that is less than 256 (your above would match the 25 in
256).. Easily fixed by inserting (?!\d) between the second and third
aggregates.  If you were to be more strict, you could more simply apply
\b(\d+)\b...

In any case, the above is not very intuitive to the casual observers as
might be

while ( /(\d+)/g ) {
  if ( $1 < 256 ) {
$answer = $1;
last;
  }
}

Likewise, complex matching tokens are the realm of a parser (I'm almost
getting tired of saying that).  Please be kind to your local maintainer,
don't proliferate n'th order code complexities such as recursive or
conditional reg-ex's.  Yes, I can mandate that my work doesn't use them, but
it doesn't mean that CPAN won't (and I often have to reverse engineer CPAN
modules to figure out why something isn't working).

That said, nobody should touch the various relative reg-ex operators.  I
look at reg-ex as a tokenizer, and things like (?>...) which optimizes
reading, and (?


Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-25 Thread Michael Maraist

From: "Simon Cozens" <[EMAIL PROTECTED]>

> > A lot of what is trying to happen in (?{..}) and friends is parsing.
>
> That's not the problem that I'm trying to solve. The problem I'm trying
> to solve is interdependence. Parsing is neither here nor there.

Well, I recognize that your focus was not on parsing.  However, I don't feel
that perl-abstractness is a key deliverable of perl.  My comment was
primarly on how the world might be a better place with reg-ex's not getting
into algorithms that are better solved elsewhere.  I just thought it might
help your cause if you expanded your rationale.

-Michael




Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-25 Thread Michael Maraist

From: "Hugo" <[EMAIL PROTECTED]>



> :Remove C, C and friends.
>
> Whoops, I missed this bit - what 'friends' do you mean?

Going by the topic, I would assume it involves (?(cond) true-exp |
false-exp).
There's also the $^R or what-ever it was that is the result of (?{ }).
Basically the code-like operations found in perl 5.005 and 5.6's perlre.

-Michael




Re: RFC 308 (v1) Ban Perl hooks into regexes

2000-09-25 Thread Michael Maraist

> Ban Perl hooks into regexes
>
> =head1 ABSTRACT
>
> Remove C, C and friends.
>

At first, I thought you were crazy, then I read

>It would be preferable to keep the regular expression engine as
>self-contained as possible, if nothing else to enable it to be used
>either outside Perl or inside standalone translated Perl programs
>without a Perl runtime.

Which makes a lot of sence in the development field.

Tom has mentioned that the reg-ex engine is getting really out of hand;
it's hard enough to document clearly, much less be understandible to the
maintainer (or even the debugger).

A lot of what is trying to happen in (?{..}) and friends is parsing.  To
quote Star Trek Undiscovered Country, "Just because we can do a thing,
doesn't mean we should."  Tom and I have commented that parsing should be
done in a PARSER, not a lexer (like our beloved reg-ex engine).  RecDescent
and Yacc do a wonderful job of providing parsing power within perl.

I'd suggest you modify your RFC to summarize the above; that (?{}) and
friends are parsers, and we already have RecDescent / etc. which are much
easier to understand, and don't require too much additional overhead.

Other than the inherent coolness of having hooks into the reg-ex code, I
don't really see much real use from it other than debugging; eg (?{ print
"Still here\n" }).  I could go either way on the topic, but I'm definately
of the opinion that we shouldn't continue down this dark path any further.


-Michael




Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-07 Thread Michael Maraist


- Original Message -
From: "Jonathan Scott Duff" <[EMAIL PROTECTED]>
Subject: Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145
(alternate approach))


> How about qy() for Quote Yacc  :-)  This stuff is starting to look
> more and more like we're trying to fold lex and yacc into perl.  We
> already have lex through (?{code}) in REs, but we have to hand-write
> our own yacc-a-likes.

Though you can do cool stuff in (?{code}), I wouldn't quite call it lex.
First off we're dealing with NFA instead of DFA, and at the very least, that
gives you back-tracking.  True, local's allow you to preserve state to some
degree.  But the following is as close as I can consider (?{code}) a lexer:

sub lex_init {
my $str = shift;
our @tokens;
$str =~ / \G (?{ local @tokens; })
   (?: TokenDelim(\d+) (?{ push @tokens, [ 'digit', $1 ] })
   | TokenDelim(\w+) (?{ push @tokens, [ 'word', $1 ] })
   )
/gx;
}

sub getNextToken {  shift @tokens; }

I'm not even suggesting this is a good design.  Just showing how akward it
is.

Other problems with the lexing in perl is that you pretty much need the
entire string before you begin processing, while a good lexer only needs the
next character.  Ideally, this is a character stream.  Already we're talking
about a lot of alteration and work here..  Not something I'd be crazy about
putting into the core.

-Michael






Re: RFC 145 (alternate approach)

2000-09-06 Thread Michael Maraist


- Original Message -
From: "Richard Proctor" <[EMAIL PROTECTED]>
Sent: Tuesday, September 05, 2000 1:49 PM
Subject: Re: RFC 145 (alternate approach)


> On Tue 05 Sep, David Corbin wrote:
> > Nathan Wiger wrote:
> > > But, how about a new ?m operator?
> > >/(?m<<|[).*?(?M>>|])/;
> There already is a (?m
> Current Use in perl5
> (?# comment
> (?imsx flags
> (?-imsx flags
> (?: subexpression without bracket capture
> (?= zero-width positive look ahead
> (?! zero width negative look ahead
> (?<= zero-width positve look behind
> (? (?{code} Execute code
> (??{code} Execute code and use result as pattern
> (?> Independant subexpression
> (?(condition)yes-pattern
> (?(condition)yes-pattern|no-pattern
>
> Suggested in RFCs either current or in development
>
> (?$foo= suggested for assignment (RFC 112)
> (?%foo= suggested for hash assignment (RFC 150?)
>
> (?@foo suggested list expansion (?:$foo[0] | $foo[1] | ...) ? (RFC 166)
> (?Q@foo) Quote each item of lists (RFC 166)
> (?^pattern) matches anything that does not match pattern
> (RFC 166 but will be somewhere else on next rewrite [1])
> (?F Failure tokens (RFC in development by me [1])
> (?r),(?f) Suggested in Direction Control RFC 1
> (?& Boolean regexes (RFC in development [1])
> (?*{code}) Execute code with pass/fail result (RFC in development [1])
>
> a,b,c,d,e, ,g,h, ,j,k,l, ,n,o,p,q, , ,t,u,v,w,x,y,z
> A,B,C,D,E, ,G,H,I,J,K,L,M,N,O,P, ,R,S,T,U,V,W,X,Y,Z
> 0,1,2,3,4,5,6,7,8,9
> `_,."+[];'~)

Ok, I've read through some of the archives, and thought this was a good
starting point.
I haven't seen any discussion on an obvious solution (though in another
email, I suggested that this approach should be foregone in favor of a
parsing approach.. But one thing at a time).

There are two general problems as I see it.  First, you have to be able to
specify exactly what you're matching.  Obviously generically matching "[<(`"
etc is going to be upset if your nesting has simple things like " a < 5 " or
"I'm going home, it's hot".  A design goal, therefore should be to
explicitly state the matching characters.  Second, you need to be able to
apply additional expression-syntax to match inside the nesting.

An additional problem occurs when you suggest using pragmas to specify
delimeters.  It could be a performance hit, if not a developer's nightmare.
When I run eval, must I always set the pragma, just in case there is some
wierd scoping problem?  Same problem as when using all global variables (and
the 'local' keyword.  God I hate that thing).

Therefore, I suggest a commonly used form:

/(?N [ { ] . )/x

Note that I use N which stands for nesting instead of the redunant 'M'atch.
I don't know how well character-based op-codes will be accepted.  As pointed
out above, the symbol-space is shrinking fast.

The dots describe further matching / capturing within the delimeters.  Thus
/A (?N [ { ] ) B/x
will match 'A' followed by a bracket grouping (anything therein is fine),
then followed by 'B'.

/A (?N [ { ] ( .* ) ) B/x
does the same as above, but captures the internal contents (excluding the
delimeters).

/A ( (?N [ { ]  ) ) B/x
Will capture all the conents, including the delimeters.

/A (?N [ [ ( ]  ( .* )  ) B/x
Same as before, but with squares and parentheses.  Note delim specifiers can
obey the same rules as normal character classes, thus [ [ ( { < ] means
collect the entire group.  POSIX classes can be used for all of them, as in
[=open_braces=] (don't care what the phrase actually is).  The reason I
chose this is becuase we are essentially doing a character class, so we
might as well explicitly use one; It makes more logical sence.  Note that to
make emacs happy, you should be able to escape all the one-way delimeters.
as in [ \[ \( \{ \< ].  That might also make it easier to read, explicitly
showing that these are being treated as characters, and not as actual
operators.

As for special operations such as (/* ... */ ), then I would recommend the
usage of named-character classes.  [=c_comment=], for example.  I'm not sure
how those classes are defined, but this obviously requires the system to be
extensible (RFC anyone?).  Course this violates my issue of using pragmas to
alter the operation of reg-ex's.  Most likely only built-in types should
work.

Another feature could be to treat the end of matching-brace as an
end-of-line.  Thus the above .* will properly exit.  If this turns out to
not work, then .* can necessarily be replaced by .*?.  The advantage of this
is in nested expressions, as in:

$r_kw = qr/Keyword \s* .* /x;
$r_lisp_expr = qr/ (?N [ ( ] $r_kw ) /x;
$line = <>;
$line =~ $r_lisp_expr;

But this would also have worked with:
$r_kw = qr/Keyword \s* .* $/x;
Since '$' would treat ')' as '\n'.

The main advantages of this approach are:
That you can still pre-compile an expression and garuntee that it won't
need recompiling, and that it'll always act the same.
That you can nest the puppies with complete lack of ambiguity, and
littl

Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145 (alternate approach))

2000-09-06 Thread Michael Maraist


- Original Message -
From: "Jonathan Scott Duff" <[EMAIL PROTECTED]>
Subject: Re: XML/HTML-specific ?< and ?> operators? (was Re: RFC 145
(alternate approach))


> On Wed, Sep 06, 2000 at 08:40:37AM -0700, Nathan Wiger wrote:
> > What if we added special XML/HTML-parsing ?< and ?> operators?
>
> What if we just provided deep enough hooks into the RE engine that
> specialized parsing constructs like these could easily be added by
> those who need them?
>
> -Scott

Ok, I've avoided this thread for a while, but I'll make my comment now.
I've played with several ideas of reg-ex extensions that would allow
arbitrary "parsing".  My first goal was to be able to parse perl-like text,
then later a simple nested parentheses, then later nested xml as with this
thread.

I have been able to solve these problems using perl5.6's recursive reg-ex's,
and inserted procedure code.  Unfortunately this isn't very safe, nor is it
'pretty' to figure out by a non-perl-guru.  What's more, what I'm attempting
to do with these nested parens and xml is to _parse_ the data.. Well, guess
what guys, we've had decades of research into the area of parsing, and we
came out with yacc and lex.  My point is that I think we're approaching this
the wrong way.  We're trying to apply more and more parser power into what
classically has been the lexer / tokenizer, namely our beloved
regular-expression engine.

A great deal of string processing is possible with perls enhanced NFA
engine, but at some point we're looking at perl code that is inside out: all
code embedded within a reg-ex.  That, boys and girls, is a parser, and I'm
not convinced it's the right approach for rapid design, and definately not
for large-scale robust design.

As for XML, we already have lovely c-modules that take of that.. You even
get your choice.  Call per tag, or generate a tree (where you can search for
sub-trees).  What else could you want?  (Ok, stupid question, but you could
still accomplish it via a customized parser).

My suggestion, therefore would be to discuss a method of encorportating more
powerful and convinient parsing within _perl_; not necessarily directly
within the reg-ex engine, and most likely not within a reg-ex statement.  I
know we have Yacc and Parser modules.  But try this out for size: Perl's
very name is about extraction and reporting.  Reg-ex's are fundamental to
this, but for complex jobs, so is parsing.  After I think about this some
more, I'm going to make an RFC for it.  If anyone has any hardened opinions
on the matter, I'd like to hear from you while my brain churns.

-Michael





Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()

2000-08-28 Thread Michael Maraist



> >Simple solution.
>
> >If you want to require formats such as m/.../ (which I actually think is
a
> >good idea), then make it part of -w, -W, -ww, or -WW, which would be a
perl6
> >enhancement of strictness.
>
> That's like having "use strict" enable mandatory perlstyle compliance
> checks, and rejecting the program otherwise.  Doesn't seem sensible.
>
> --tom

Well, use strict refuses soft-links, and -w refuses use of undefined values.
It may be that these are easy to check for at the low level, and therefore
are candidates for flag-based operations.  But for style, I don't see why
the interpreter can't also check for various non-obscure syntaxes / styles.
I doubt that requireing m/.../ really would help parsing performance any
though.

Compatibility is going to have to be maintained somehow.  And we can either
have some sort of perl6 designator (such as the pragma) to designate
incompatible (and otherwise ambiguous) code, or we're going to have to
continue tacking on syntactic sugar to legacy code.

-Michael





Re: RFC 164 (v1) Replace =~, !~, m//, and s/// with match() and subst()

2000-08-28 Thread Michael Maraist

> If you want to change STUPID behaviour that should be avoided by current
> programs (such as empty regexes) fine.

Simple solution.

If you want to require formats such as m/.../ (which I actually think is a
good idea), then make it part of -w, -W, -ww, or -WW, which would be a perl6
enhancement of strictness.

Likewise, things like legacy Formats would not be allowed in -WW.  This
gives flexibility to the programmer, and can help the interpreter to make
optimizations where necessary.

If you needed legacy module compatibility, then maybe we should use pragmas
like the following:

use 6.0;

or

use 6.0 ':no-compat';

Programs and modules could assign themselves to a compatibility contract
(lacking the require statement defaults to perl5 compat).  The reason for
having 'use' instead of 'require' is that the interpreter can turn on
compile-time warnings / optimizations as it goes from module to module.

Maybe this should be an RFC.

-Michael