all regexp RFCs

Hugo Fri, 08 Sep 2000 18:52:25 -0700
Hi guys, I'm sorry that time has not permitted me to join and take an
active part in the perl6-language-regex list; however, I have grabbed
an opportunity to look through the RFCs generated to date, and thought
I should throw some comments at you.

Apologies in advance for so rudely dumping this lot and _still_ not
joining the list; sorry also if I duplicate stuff that's already
been said. Feel free to ignore all or any of this. You'll need to cc
me if you want me to see replies, and in that case you might want to
do what I didn't, and tailor the subject to be more specific.

I've tried in particular to add a note about implementation issues
in each case.

Enjoy,

Hugo
---
RFC 72: Variable-length lookbehind: the regexp engine should also go backward.
======

This is an interesting idea. However, it is not obvious to me that
there is any practical difference between the existing:
  /(?<= a+ ) b/x
.. and the proposed:
  /b (?`= a+ )/x
.. which implies that implementing one would be as difficult as the
other. And if that is the case, fixing (?<=...) to support variable
length would be preferable, since it is more general. (Consider
/\d+ (?<! 00) \. \d+/x, for example: AFAICS the proposed (?`=...)
does not allow the lookbehind to be anchored anywhere other than the
start of the match.)

While it would be great to have a working variable-length lookbehind,
it is not obvious how you would implement it: the internal structure
of a compiled regexp, as currently implemented, does not (I believe)
hold enough information to allow you to walk it backwards. It might
still be possible, though, with a fair amount of effort; you would,
for example, have to rewrite (?<= ([abc]) ([def]) g \2 \1 ) into
(?<= \1 \2 g ([def]) ([abc]) ), or maybe swap the \1 and \2.

RFC 93: Regex: Support for incremental pattern matching
======

I love this to bits. You might consider changing the arguements to
the fetcher($n;$s), such that if $n is positive it requests the
next $n characters, else it is a final call returning the -$n bytes
of $s to the stream. Not sure if this is any better than the current
proposal, but it might be easier to understand if the first argument
always represented a number of bytes.

I do not think implementation should be too difficult, though I
assume all optimisation should be turned off for such matches. It
might also be desirable to have a new regexp flag 'no optimisation
desired' to avoid the compile-time work done for optimisation's
sake, for optimisation's sake. IYSWIM.

RFC 110: counting matches
=======

I like this too. I'd suggest /t should mean a) return a scalar of
the number of matches and b) don't set any special variables. Then
/t without /g would return 0 or 1, but be faster since no extra
information need be captured (except internally for (.)\1 type
matching - compile time checks could determine if these are needed,
though (?{..}) and (??{..}) patterns would require disabling of
that optimisation). /tg would give a scalar count of the total
number of matches. \G would retain its meaning.

Any which way, implementation should be fairly straightforward,
though ensuring that optimisations occurred precisely when they
are safe would probably involve a few bug-chasing cycles.

RFC 112: Assignment within a regex
=======

This is cool, and has been requested several times in the past.
There is an outstanding issue of how variable references should
be scoped when encountered within regexps, however. Consider:

  {
    local $a = 1;
    my $re = qr{ (?$a = .) }x
    {
      my $a = 2;
      "3" =~ $re;
      print $a;
    }
    print $a;
  }

This is a problem that needs to be solved in any case, for proper
understanding of how (?{..}) and (??{..}) should be interpreted,
and I assume this proposed feature should be handled the same way.
Implementation should not be particularly difficult once that
knotty issue is resolved.

RFC 144: Behavior of empty regex should be simple
=======

Absolutely. <snip>

RFC 145: Brace-matching for Perl Regular Expressions
=======

This is an interesting idea. I'm not sure how useful it would
actually be: as far as I can see it would not match the block
on code such as:

  use matchpairs '{' => '}';
  <<EOF =~ /\m.*\M/;
  {
    my $brace = '{';
    ...
  }
  EOF

.. and most of the pair-matching patterns I've tried to write in
the past have needed to cope with embedded oddities such as
quoted-strings, comments etc.

It might be useful to add some more complex examples to show
how you'd deal with such things. Another type of example that
would be useful is HTML parsing:
  <table border=1>
    <tr>stuff...</tr>
    <tr>stuff...
  </TABLE>
.. since it also isn't clear to me whether you'd be able to
extract the table contents, or the rows, using the mechanisms
of this proposal.

RFC 150: Extend regex syntax to provide for return of a hash of matched subpatterns
=======

This is cool - I don't think I've seen this suggested before.

Implementation might be a bit more work: the backreferences are
currently stored as offsets (relative to the start of the string)
to the beginning and end of the contents of the backref, and it
might be a bit expensive for normal use to extend that either by
replacing the start offset with a pointer or by adding an extra
per-backref flag. Faster alternatives are possible, but would be
more complex.

RFC 158: Regular Expression Special Variables
=======

I'd love to see the performance penalty removed. I'm not sure that
an extra /k flag is the right solution, though I don't have any
concrete alternative to offer.

There has been much discussion of this problem on p5p in the past;
it would be handy to have some references in the RFC to any of the
more informative parts of those threads.

RFC 164: Replace =~, !~, m//, s///, and tr// with match(), subst(), and trade()
=======

I don't particularly dislike =~, but I can see that others might.
I think this RFC actually has two distinct parts, which should
probably be separated: the syntax change, and the changes to
behaviour under various contexts. I'm not sure I clearly
understand what the latter are, or why they are necessary. I'm
particularly confused about:

   1. If called in a void context, [the new operators] act on and modify C<$_>,
      consistent with current behavior.

Was this supposed to say 'the C<$str> arguments (or C<$_>)'?

The syntax change does not impact on the regexp engine at all as far
as I can see; I'm not sure whether implementation would make the
perl parser more or less complex. I don't think I understand the
other changes well enough to guess at implementation issues.

RFC 165: Allow Varibles in tr///
=======

Definitely. Should be easy to implement. There is a potential for
confusion, since it makes the tr/ lists look even more like
m/ and s/ patterns, but I think it can only be less confusion than
the current state of affairs. It is tempting to make it the default,
and have a flag to turn it off (or just backwhack the dagnabbed
dollar), and auto-translation of existing scripts would be pretty
easy, except that it would presumably fail exactly where people
are using the current workaround, by way of eval.

It would be helpful to tie down would should occur for @var and
%var (but note that this one liner changed between 5.6.0 and 5.7.0:
  crypt% setperl 5.6.0
  crypt% perl -we '/.@x./'
  In string, @x now must be written as \@x at -e line 1, near ".@x"
  Execution of -e aborted due to compilation errors.
  crypt% setperl 5.7.0
  crypt% perl -we '/.@x./'
  Possible unintended interpolation of @x in string at -e line 1.
  Name "main::x" used only once: possible typo at -e line 1.
  Use of uninitialized value in pattern match (m//) at -e line 1.
  crypt% 
).

RFC 166: Additions to regexs
=======

(?@foo) and (?Q@foo) are both things I've wanted before now. I'm
not sure if this is the right syntax, particularly if RFC 112 is
adopted: it would be confusing to have (?@foo) to have so
different a meaning from (?$foo=...), and even more so if the
latter is ever extended to allow (?@foo=...).
I see no reason that implementation should cause any problems
since this is purely a regexp-compile time issue.

(?^pattern) is interesting; I'm not sure I've ever felt a need
for it, but I'm sure I'd find a use for it if it appeared. I'm
guessing that 'pattern' is anchored to the matches either side,
so that the example would match 'fooXbazXbar' but no 'foobazbar',
but it wasn't quite clear whether that was intended. I _think_
implementation should be easy, but it is the sort of thing that
could throw up some unexpected boojums.

(?) doesn't seem necessary, since (?=) does the same already.
In general, I feel that /x and generous dollops of whitespace
are a better solution. Dead easy to implement, though.

RFC 170: Generalize =~ to a special-purpose assignment operator
=======

Fascinating stuff. I'm not convinced that this proposal does
require RFC 164 - it seems perfectly happy to stand on its
own. The use of '($name) =~ ...' in an earlier example does
seem to me to require the same support as the later example
of '($name, $email) =~ ...', but I may have misunderstood
this.

Not sure how you'd implement all this, but I don't see that
it impacts the regexp engine at all, so I'll leave others to
guess at that.

Note that the title given for RFC 164 in REFERENCES is no
longer correct.

RFC 197: Numberic Value Ranges In Regular Expressions
=======

'[]' and '()' already play multiple roles within regexps, to the
confusion of many (in part because of the heuristics involved in
deciding whether /$a[12345]/ refers to an array element or not).
There are also many other number formats that people may want to
recognise, particularly unsigned numbers and the mantissa/exponent
form.

The implementation of the runtime aspect would be pretty easy;
I'm not sure how easy the parsing would be.

RFC 198: Boolean Regexes
=======

Note that it is currently possible to match multiple patterns at
the same point, with:
  /(?=pattern1)(?=pattern2).../
.. and to check unanchored with:
  /(?=.*foo)(?=.*bar)/
.. but your proposal is very interesting. I don't think 'boolean'
expresses the new capabilities very well though (and I can't
think of a better word right now), and I hope there is a better
syntax possible because I find it scary that within (?*...)
two more punctuation characters are suddenly going to acquire a
special meaning.

I don't think anything except the /x flag should turn on /x, and
nothing should require it - the programmer should make the choice.

The 'Brackets within boolean regexes' subsection is interesting;
it seems independent of the 'boolean' concept, so it might be
worth splitting off into a separate RFC - you'd probably need a
new regexp flag to turn this on in that case.

It isn't clear to me _what_ substring is delivered to the code
in (?*{code}).

You'll find discussion of how to implement 'fail a match from
within' in the p5p archives, both as an operator (sorry, can't
remember whether the proposed operator was the same (?F) you
suggest) and using (?{ last }) or (?{ return }) (which could
also tie up with RFC 199). It is not too obvious what such a
construct should do, though - as far as I remember Ilya pointed
out some nasty problems in previous discussions. In any case,
this also seems independent of the 'boolean' stuff, and should
probably be in a separate RFC.

---
Have you ever sat 10 minutes motionless in front of a screen
debating whether to put in that optional semicolon? Man, you
haven't lived...
all regexp RFCs

Reply via email to