[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

Jeffrey C. Jacobs Thu, 24 Apr 2008 09:06:43 -0700

Jeffrey C. Jacobs <[EMAIL PROTECTED]> added the comment:

Thanks Jim for your thoughts!


Armaury has already explained about Perl 5.10.0.  I suppose it's like
Macintosh version numbering, since Mac Tiger went from version 10.4.9 to
10.4.10 and 10.4.11 a few years ago.  Maybe we should call Python 2.6
Python 2.06 just in case.  But 2.6 is the known last in the 2 series so
it's not a problem for us!  :)

>> as well as add a few python-specific
>
> because this also adds to the scope.

At this point the only python-specific changes I am proposing would be
items 2, 3 (discussed below), 5 (discussed below), 6 and 7.  6 is only a
documentation change, the code is already implemented.  7 is just a
better behavior.  I think it is RARE one compiles more than 100 unique
regular expressions, but you never know as projects tend to grow over
time, and in the old code the 101st would be recompiled even if it was
just compiled 2 minutes ago.  The patch is available so I leave it to
the community to judge for themselves whether it is worth it, but as you
can see, it's not a very large change.

>> 2) Make named matches direct attributes 
>> of the match object; i.e. instead of m.group('foo'), 
>> one will be able to write simply m.foo.
>
>> 3) (maybe) make Match objects subscriptable, such 
>> that m[n] is equivalent to m.group(n) and allow slicing.
>
> (2) and (3) would both be nice, but I'm not sure it makes sense to do 
> *both* instead of picking one.

Well, I think named matches are better than numbered ones, so I'd
definitely go with 2.  The problem with 2, though, is that it still
leaves the rather typographically intense m.group(n), since I cannot
write m.3.  However, since capture groups are always numbered
sequentially, it models a list very nicely.  So I think for indexing by
group number, the subscripting operator makes sense.  I was not
originally suggesting m['foo'] be supported, but I can see how that may
come out of 3.  But there is a restriction on python named matches that
they have to be valid python and that strikes me as 2 more than 3
because 3 would not require such a restriction but 2 would.  So at least
I want 2, but it seems IMHO m[1] is better than m.group(1) and not in
the least hard or a confusing way of retrieving the given group.  Mind
you, the Match object is a C-struct with python binding and I'm not
exactly sure how to add either feature to it, but I'm sure the C-API
manual will help with that.

>> 5) Add a well-formed, python-specific comment modifier, 
>> e.g. (?P#...);  
>
> [handles parens in comments without turning on verbose, but is slower]
>
> Why?  It adds another incompatibility, so it has to be very useful or 
> clear.  What exactly is the advantage over just turning on verbose?

Well, Larry Wall and Guido agreed long ago that we, the python
community, own all expressions of the form (?P...) and although I'd be
my preference to make (?#...) more in conformance with understanding
parenthesis nesting, changing the logic behind THAT would make python
non-standard.  So as far as any conflicting design, we needn't worry.

As for speed, the this all occurs in the parser and does not effect the
compiler or engine.  It occurs only after a (?P has been read and then
only as the last check before failure, so it should not be much slower
except when the expression is invalid.  The actual execution time to
find the closing brace of (?P#...) is a bit slower than that for (?#...)
but not by much.

Verbose is generally a good idea for anything more than a trivial
Regular Expression.  However, it can have overhead if not included as
the first flag: an expression is always checked for verbose
post-compilation and if it is encountered, the expression is compiled a
second time, which is somewhat wasteful.  But the reason I like the
(?P#...) over (?#...) is because I think people would more tend to assume:

r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't.

That expression only matches "He ls)llo", so I created the (?P#...) to
make the comment match type more intuitive:

r'He(?P# 2 (TWO) ls)llo' matches "Hello".

>> 9) C-Engine speed-ups. ...
>> a number of Macros are being eliminated where appropriate.
>
> Be careful on those, particular on str/unicode and different
> compile options.

Will do; thanks for the advice!  I have only observed the UNICODE flag
controlling whether certain code is used (besides the ones I've added)
and have tried to stay true to that when I encounter it.  Mind you,
unless I can get my extra 10% it's unlikely I'd actually go with item 9
here, even if it is easier to read IMHO.  However, I want to run the new
engine proposal through gprof to see if I can track down some bottlenecks.

At some point, I hope to get my current changes on Launchpad if I can
get that working.  If I do, I'll give a link to how people can check out
my working code here as well.

__________________________________
Tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue2636>
__________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2636] Regexp 2.6 (modifications to current re 2.2.2)

Reply via email to