Re: [Python-Dev] New regex module for 3.2?

2010-07-28 Thread Gregory P. Smith
On Tue, Jul 27, 2010 at 6:43 PM, R. David Murray rdmur...@bitdance.comwrote:

 On Tue, 27 Jul 2010 08:27:35 +0200, Stefan Behnel stefan...@behnel.de
 wrote:
  Gregory P. Smith, 27.07.2010 07:40:
   A max cache size of 100 was too small.  I just increased it to 500 in
 the
   py3k branch along with implementing a random replacement cache overflow
   policy.  It now randomly drops 20% of the compiled regular expression
 cache
   instead of simply dropping the entire cache on overflow.
  
   With the regex_v8 benchmark, the better cache replacement policy sped
 it up
   ~7% while raising the cache size on top of that (likely meaning the
 cache
   was never overflowing) sped it up ~25%.
  
   Random replacement without dropping everything at least means apps
 thrashing
   the cache degrade much more gracefully.
 
  The same algorithm should be helpful in ElementTree's ElementPath module.

 We recently added the old re cache-clearing strategy to
 fnmatch, because previously its cache would grow indefinitely.
 It sounds like this should be applied there as well.

 That's three...time to figure out how to share the code?


No doubt.

Its already a standalone _shrink_cache function with unit tests that doesn't
care the dictionaries it is shrinking are composed of.  Easy enough to move
somewhere more useful.  Any proposed stdlib locations?  I'll be offline on
vacation soon so I may not get to it for a couple weeks but feel free to
move it without me if anyone is interested.

-gps
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-28 Thread Nick Coghlan
On Wed, Jul 28, 2010 at 4:50 PM, Gregory P. Smith g...@krypto.org wrote:
 On Tue, Jul 27, 2010 at 6:43 PM, R. David Murray rdmur...@bitdance.com
 wrote:
 On Tue, 27 Jul 2010 08:27:35 +0200, Stefan Behnel stefan...@behnel.de
 wrote:
  Gregory P. Smith, 27.07.2010 07:40:
   Random replacement without dropping everything at least means apps
   thrashing
   the cache degrade much more gracefully.
 
  The same algorithm should be helpful in ElementTree's ElementPath
  module.

 We recently added the old re cache-clearing strategy to
 fnmatch, because previously its cache would grow indefinitely.
 It sounds like this should be applied there as well.

 That's three...time to figure out how to share the code?

 No doubt.

Anyone remember off the top of their head what linecache does? ... ah,
it's just unbounded unless you call clearcache(). Probably OK for that
particular application.

 Its already a standalone _shrink_cache function with unit tests that doesn't
 care the dictionaries it is shrinking are composed of.  Easy enough to move
 somewhere more useful.  Any proposed stdlib locations?  I'll be offline on
 vacation soon so I may not get to it for a couple weeks but feel free to
 move it without me if anyone is interested.

I created a tracker issue for the idea so we don't forget about it:
http://bugs.python.org/issue9396

collections seems like a fairly obvious possibility, but the flavour
doesn't quite seem right for that. Nowhere else springs to mind
though.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-27 Thread Stefan Behnel

Gregory P. Smith, 27.07.2010 07:40:

A max cache size of 100 was too small.  I just increased it to 500 in the
py3k branch along with implementing a random replacement cache overflow
policy.  It now randomly drops 20% of the compiled regular expression cache
instead of simply dropping the entire cache on overflow.

With the regex_v8 benchmark, the better cache replacement policy sped it up
~7% while raising the cache size on top of that (likely meaning the cache
was never overflowing) sped it up ~25%.

Random replacement without dropping everything at least means apps thrashing
the cache degrade much more gracefully.


The same algorithm should be helpful in ElementTree's ElementPath module.

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-27 Thread R. David Murray
On Tue, 27 Jul 2010 08:27:35 +0200, Stefan Behnel stefan...@behnel.de wrote:
 Gregory P. Smith, 27.07.2010 07:40:
  A max cache size of 100 was too small.  I just increased it to 500 in the
  py3k branch along with implementing a random replacement cache overflow
  policy.  It now randomly drops 20% of the compiled regular expression cache
  instead of simply dropping the entire cache on overflow.
 
  With the regex_v8 benchmark, the better cache replacement policy sped it up
  ~7% while raising the cache size on top of that (likely meaning the cache
  was never overflowing) sped it up ~25%.
 
  Random replacement without dropping everything at least means apps thrashing
  the cache degrade much more gracefully.
 
 The same algorithm should be helpful in ElementTree's ElementPath module.

We recently added the old re cache-clearing strategy to
fnmatch, because previously its cache would grow indefinitely.
It sounds like this should be applied there as well.

That's three...time to figure out how to share the code?

--
R. David Murray  www.bitdance.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-26 Thread Georg Brandl
Am 22.07.2010 12:53, schrieb Guido van Rossum:
 On Fri, Jul 16, 2010 at 6:08 PM, Georg Brandl g.bra...@gmx.net wrote:
 Nevertheless, the authoritative reference for our regex engine is its
 docs, i.e. http://docs.python.org/library/re.html -- and that states
 clearly that inline flags apply to the whole regex.

 I think with a new regex implementation, not all of this historical
 semantics must be copied, unless there are major real usecases, which
 would be affected by this.

 As I already said, I *have* seen this in real code.  As MRAB indicated,
 this was the only silent change in semantics as compared to the old
 regex engine.  If we replace re by regex, which I think is the only
 way to get the new features in the stdlib, changing this one aspect is
 a) not backwards compatible and b) in a subtle way that forces everyone
 to review his/her regular expressions.  That's definitely not
 acceptable.
 
 I wonder if the solution could be a new flag that you have to specify
 if you want the flags to apply locally only. Let's say add 'L' to the
 flags. Then for existing code (not using 'L') the flags should apply
 globally, while someone wanting flags to apply to only part of a regex
 can add 'L' to the flags syntax.
 
 Is that acceptable?

I guess it would be; however it remains to be proven that this scoping
is actually needed in addition to the (backwards compatible) grouped
scoping.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-26 Thread Gregory P. Smith
On Thu, Jul 22, 2010 at 3:26 PM, Nick Coghlan ncogh...@gmail.com wrote:

 On Fri, Jul 23, 2010 at 12:42 AM, Georg Brandl g.bra...@gmx.net wrote:
  Sure -- I don't think this is a showstopper for regex.  However if we
 don't
  include regex in a future version, we might think about increasing
 MAXCACHE
  a bit, and maybe not clear the cache when it reaches its max length, but
  rather remove another element.

 Yikes, I didn't know it did that. That certainly sounds like it should
 be an RFE in its own right - some basic form of Least Recently Used
 accounting should be beneficial (although the extra bookkeeping might
 hurt scripts that aren't hitting the cache limit).


A max cache size of 100 was too small.  I just increased it to 500 in the
py3k branch along with implementing a random replacement cache overflow
policy.  It now randomly drops 20% of the compiled regular expression cache
instead of simply dropping the entire cache on overflow.

With the regex_v8 benchmark, the better cache replacement policy sped it up
~7% while raising the cache size on top of that (likely meaning the cache
was never overflowing) sped it up ~25%.

Random replacement without dropping everything at least means apps thrashing
the cache degrade much more gracefully.

 http://svn.python.org/view?view=revrevision=83173

This change should be incorporated into MRAB's regex module in order to keep
comparisons fair.

-gps
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-23 Thread Hrvoje Niksic

On 07/22/2010 01:34 PM, Georg Brandl wrote:

Timings (seconds to run the test suite):

re 26.689  26.015  26.008
regex  26.066  25.797  25.865

So, I thought there wasn't a difference in performance for this use case
(which is compiling a lot of regexes and matching most of them only a
few times in comparison).  However, I found that looking at the regex
caching is very important in this case: re._MAXCACHE is by default set to
100, and regex._MAXCACHE to 1024.  When I set re._MAXCACHE to 1024 before
running the test suite, I get times around 18 (!) seconds for re.


This seems to point to re being significantly *faster* than regexp, even 
in matching, and as such may be something the author would want to look 
into.


Nick writes:

 That still fits with the compile/match performance trade-off changes
 between re and regex though.

The performance trade-off should make regex slower with sufficiently 
small compiled regex cache, when a lot of time is wasted on compilation. 
 But as the cache gets larger (and, for fairness, of the same size in 
both implementations), regex should outperform re.  Georg, would you 
care to measure if there is a difference in performance with an even 
larger cache?

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-23 Thread Nick Coghlan
On Fri, Jul 23, 2010 at 8:16 PM, Hrvoje Niksic hrvoje.nik...@avl.com wrote:
 The performance trade-off should make regex slower with sufficiently small
 compiled regex cache, when a lot of time is wasted on compilation.  But as
 the cache gets larger (and, for fairness, of the same size in both
 implementations), regex should outperform re.  Georg, would you care to
 measure if there is a difference in performance with an even larger cache?

That's not quite accurate. The figure that matters is matches per
compilation, and that is ultimately governed by the nature of the
application once the cache is sufficiently large (e.g. an application
that compiles 10 different regular expressions, but also only matches
each one against a single string is going to be slowed down
significantly by a switch to regex no matter how big the compilation
cache may be).

The total runtime for a particular re matching exercise is (average
compilation time)*(number of times compiled) + (average match
time)*(number of matches performed)

We know that the new regex module speeds up matching at the cost of
slowing down compilation. Whether that is a net increase for the
application as a whole depends on 4 things:

X: the average speed change in compilation for the application
Y: the average speed change in matching for the application
A: the proportion of time the application currently spends compiling
regular expressions
B: the proportion of time the application currently spends matching
regular expressions

The overall proportional speed change for re operations in a given
application is then just X*A + Y*B. For regex to be an improvement for
that application, the resulting figure needs to be less than 1.

For example, suppose X=2, Y=0.5 (i.e. matching is twice as fast on
average, but compilation also takes twice as long), then the overall
speed change would be 2*A + 0.5*B. For that to be a wash, the original
application needs to spend 1/3 of its expression matching time
compiling the expressions and 2/3 actually matching them (the
application with the new engine would then spend 2/3 of its time
compiling and only 1/3 matching).

That's going to be a tough assessment to make in practice, but I think
Georg's pygments test suite example shows that there is real world
code out there that already spends a lot of time on compilation.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-23 Thread Georg Brandl
Am 23.07.2010 11:16, schrieb Hrvoje Niksic:
 On 07/22/2010 01:34 PM, Georg Brandl wrote:
 Timings (seconds to run the test suite):

 re 26.689  26.015  26.008
 regex  26.066  25.797  25.865

 So, I thought there wasn't a difference in performance for this use case
 (which is compiling a lot of regexes and matching most of them only a
 few times in comparison).  However, I found that looking at the regex
 caching is very important in this case: re._MAXCACHE is by default set to
 100, and regex._MAXCACHE to 1024.  When I set re._MAXCACHE to 1024 before
 running the test suite, I get times around 18 (!) seconds for re.
 
 This seems to point to re being significantly *faster* than regexp, even 
 in matching, and as such may be something the author would want to look 
 into.
 
 Nick writes:
 
   That still fits with the compile/match performance trade-off changes
   between re and regex though.
 
 The performance trade-off should make regex slower with sufficiently 
 small compiled regex cache, when a lot of time is wasted on compilation. 
   But as the cache gets larger (and, for fairness, of the same size in 
 both implementations), regex should outperform re.  Georg, would you 
 care to measure if there is a difference in performance with an even 
 larger cache?

I did measure that, and there are no significant differences in timing.

I also did the check the other way around, and restricting regex._MAXCACHE
to 100 I got from 26 seconds to 42 seconds. (Nick, is that enough data to
calculate A and B now? ;)

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-22 Thread Georg Brandl
Am 13.07.2010 15:35, schrieb Antoine Pitrou:
 On Tue, 13 Jul 2010 15:20:23 +0100
 Michael Foord fuzzy...@voidspace.org.uk wrote:
 On 13/07/2010 15:17, Reid Kleckner wrote:
  On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlanncogh...@gmail.com  wrote:
 
  MRAB's module offers a superset of re's features rather than a subset
  though, so once it has had more of a chance to bake on PyPI it may be
  worth another look.
   
  I feel like the new module is designed to replace the current re
  module, and shouldn't need to spend time in PyPI.  A faster regex
  library isn't going to motivate users to add external dependencies to
  their projects.
 
 
 If the backwards compatibility issues can be addressed and MRAB is 
 willing to remain as maintainer then the advantages seem well worth it 
 to me.
 
 To me as well. The code needs a full review before integrating, though.

FWIW, I've now run the Pygments test suite (Pygments has about 2500 regular
expressions that are exercised there) and only had two problems:

* Scoped flags: A few lexers use (?s) and similar flags at the end of
  the expression, which has no effect in regex currently.

* POSIX character classes: One regex used a class '[][:xyz]', so the [:
  was seen as the start of a character class.  I'm not sure how common
  this is, as most people seem to escape brackets in character classes.
  Also, it gives a clear error on regex.compile(), not mysterious
  failures.

Timings (seconds to run the test suite):

re 26.689  26.015  26.008
regex  26.066  25.797  25.865

So, I thought there wasn't a difference in performance for this use case
(which is compiling a lot of regexes and matching most of them only a
few times in comparison).  However, I found that looking at the regex
caching is very important in this case: re._MAXCACHE is by default set to
100, and regex._MAXCACHE to 1024.  When I set re._MAXCACHE to 1024 before
running the test suite, I get times around 18 (!) seconds for re.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-22 Thread Guido van Rossum
On Fri, Jul 16, 2010 at 6:08 PM, Georg Brandl g.bra...@gmx.net wrote:
 Nevertheless, the authoritative reference for our regex engine is its
 docs, i.e. http://docs.python.org/library/re.html -- and that states
 clearly that inline flags apply to the whole regex.

 I think with a new regex implementation, not all of this historical
 semantics must be copied, unless there are major real usecases, which
 would be affected by this.

 As I already said, I *have* seen this in real code.  As MRAB indicated,
 this was the only silent change in semantics as compared to the old
 regex engine.  If we replace re by regex, which I think is the only
 way to get the new features in the stdlib, changing this one aspect is
 a) not backwards compatible and b) in a subtle way that forces everyone
 to review his/her regular expressions.  That's definitely not
 acceptable.

I wonder if the solution could be a new flag that you have to specify
if you want the flags to apply locally only. Let's say add 'L' to the
flags. Then for existing code (not using 'L') the flags should apply
globally, while someone wanting flags to apply to only part of a regex
can add 'L' to the flags syntax.

Is that acceptable?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-22 Thread Nick Coghlan
On Thu, Jul 22, 2010 at 9:34 PM, Georg Brandl g.bra...@gmx.net wrote:
 So, I thought there wasn't a difference in performance for this use case
 (which is compiling a lot of regexes and matching most of them only a
 few times in comparison).  However, I found that looking at the regex
 caching is very important in this case: re._MAXCACHE is by default set to
 100, and regex._MAXCACHE to 1024.  When I set re._MAXCACHE to 1024 before
 running the test suite, I get times around 18 (!) seconds for re.

That still fits with the compile/match performance trade-off changes
between re and regex though. It does make it clear this isn't going to
be a win across the board though - things like test suites are going
to have more one-off regex operations than a long-running web server
or a filesystem or database scanning operation.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-22 Thread Georg Brandl
Am 22.07.2010 14:12, schrieb Nick Coghlan:
 On Thu, Jul 22, 2010 at 9:34 PM, Georg Brandl g.bra...@gmx.net wrote:
 So, I thought there wasn't a difference in performance for this use case
 (which is compiling a lot of regexes and matching most of them only a
 few times in comparison).  However, I found that looking at the regex
 caching is very important in this case: re._MAXCACHE is by default set to
 100, and regex._MAXCACHE to 1024.  When I set re._MAXCACHE to 1024 before
 running the test suite, I get times around 18 (!) seconds for re.
 
 That still fits with the compile/match performance trade-off changes
 between re and regex though. It does make it clear this isn't going to
 be a win across the board though - things like test suites are going
 to have more one-off regex operations than a long-running web server
 or a filesystem or database scanning operation.

Sure -- I don't think this is a showstopper for regex.  However if we don't
include regex in a future version, we might think about increasing MAXCACHE
a bit, and maybe not clear the cache when it reaches its max length, but
rather remove another element.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-22 Thread Reid Kleckner
On Thu, Jul 22, 2010 at 7:42 AM, Georg Brandl g.bra...@gmx.net wrote:
 Am 22.07.2010 14:12, schrieb Nick Coghlan:
 On Thu, Jul 22, 2010 at 9:34 PM, Georg Brandl g.bra...@gmx.net wrote:
 So, I thought there wasn't a difference in performance for this use case
 (which is compiling a lot of regexes and matching most of them only a
 few times in comparison).  However, I found that looking at the regex
 caching is very important in this case: re._MAXCACHE is by default set to
 100, and regex._MAXCACHE to 1024.  When I set re._MAXCACHE to 1024 before
 running the test suite, I get times around 18 (!) seconds for re.

It might be fun to do a pygments based macro benchmark.  Just have it
syntax highlight itself and time it.

 Sure -- I don't think this is a showstopper for regex.  However if we don't
 include regex in a future version, we might think about increasing MAXCACHE
 a bit, and maybe not clear the cache when it reaches its max length, but
 rather remove another element.

+50 for the last idea.  Collin encountered a problem two summers ago
in Mondrian where we were relying on the regex cache and were
surprised to find that it cleared itself after filling up, rather than
using LRU or random eviction.

Reid
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-22 Thread Nick Coghlan
On Fri, Jul 23, 2010 at 12:42 AM, Georg Brandl g.bra...@gmx.net wrote:
 Sure -- I don't think this is a showstopper for regex.  However if we don't
 include regex in a future version, we might think about increasing MAXCACHE
 a bit, and maybe not clear the cache when it reaches its max length, but
 rather remove another element.

Yikes, I didn't know it did that. That certainly sounds like it should
be an RFE in its own right - some basic form of Least Recently Used
accounting should be beneficial (although the extra bookkeeping might
hurt scripts that aren't hitting the cache limit).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-16 Thread Vlastimil Brom
2010/7/9 Georg Brandl g.bra...@gmx.net:
 Am 09.07.2010 02:35, schrieb MRAB:


 1. Some of the inline flags are scoped; for example, putting (?i) at
 the end of a regex will now have no effect because it's no longer a
 global, all-or-nothing, flag.

 That is problematic.  I've often seen people put these flags at the end
 of a regex, probably for readability purposes.  IMHO it would be better
 to limit flag scoping to the explicit (?flags-flags: ) groups.


I just noticed the formulation on the reference page
regular-expressions.info on this kind of flags:
(?i)   Turn on case insensitivity for the remainder of the regular
expression. (Older regex flavors may turn it on for the entire
regex.) and likewise for other flags.

http://www.regular-expressions.info/refadv.html

I am not sure, how authoritative this page by Jan Goyvaerts is for
various implementations, but it looks like a very comprehensive
reference.
I think with a new regex implementation, not all of this historical
semantics must be copied, unless there are major real usecases, which
would be affected by this.
Just a thought;
Vlastimil Brom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-16 Thread Georg Brandl
Am 16.07.2010 17:08, schrieb Vlastimil Brom:
 2010/7/9 Georg Brandl g.bra...@gmx.net:
 Am 09.07.2010 02:35, schrieb MRAB:


 1. Some of the inline flags are scoped; for example, putting (?i) at
 the end of a regex will now have no effect because it's no longer a
 global, all-or-nothing, flag.

 That is problematic.  I've often seen people put these flags at the end
 of a regex, probably for readability purposes.  IMHO it would be better
 to limit flag scoping to the explicit (?flags-flags: ) groups.

 
 I just noticed the formulation on the reference page
 regular-expressions.info on this kind of flags:
 (?i) Turn on case insensitivity for the remainder of the regular
 expression. (Older regex flavors may turn it on for the entire
 regex.) and likewise for other flags.
 
 http://www.regular-expressions.info/refadv.html
 
 I am not sure, how authoritative this page by Jan Goyvaerts is for
 various implementations, but it looks like a very comprehensive
 reference.

Nevertheless, the authoritative reference for our regex engine is its
docs, i.e. http://docs.python.org/library/re.html -- and that states
clearly that inline flags apply to the whole regex.

 I think with a new regex implementation, not all of this historical
 semantics must be copied, unless there are major real usecases, which
 would be affected by this.

As I already said, I *have* seen this in real code.  As MRAB indicated,
this was the only silent change in semantics as compared to the old
regex engine.  If we replace re by regex, which I think is the only
way to get the new features in the stdlib, changing this one aspect is
a) not backwards compatible and b) in a subtle way that forces everyone
to review his/her regular expressions.  That's definitely not
acceptable.


Georg


-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-13 Thread Gregory P. Smith
On Thu, Jul 8, 2010 at 12:52 PM, MRAB pyt...@mrabarnett.plus.com wrote:

 Hi all,

 I re-implemented the re module, adding new features and speed
 improvements. It's available at:

http://pypi.python.org/pypi/regex

 under the name regex so that it can be tried alongside re.

 I'd be interested in any comments or feedback. How does it compare with
 re in terms of speed on real-world data? The benchmarks suggest it
 should be faster, or at worst comparable.

 How much interest would there be in putting it in Python 3.2?


A big +1 from me.  I'm all for putting it in Python 3.2 alphas and beta1 as
a *replacement* for the existing 're' module in order to get it visibility
and a shakeout and make this available to the world sooner rather than
later.  This module was designed for compatibility for exactly that reason.
 Lets give it a high publicity fair shake it iron out any unknown
compatibility issues early on so that we can decide how to proceed.

Also, +1 to Antoine's point on volunteering to be the maintainer of it for
the next couple major releases.

(I do expect someone to balk at this idea with but python 3.x doesn't have
enough users yet for a beta cycle to shake out hidden regular expression
engine assumption bugs to which I'll respond in advance with thats a good
thing; we can add a regular expression sniffer to 2to3 that flags potential
problems in code being ported over)

-gps
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-13 Thread Vlastimil Brom
2010/7/8 MRAB pyt...@mrabarnett.plus.com:
 Hi all,

 I re-implemented the re module, adding new features and speed
 improvements. It's available at:

    http://pypi.python.org/pypi/regex

 under the name regex so that it can be tried alongside re.

 I'd be interested in any comments or feedback. How does it compare with
 re in terms of speed on real-world data? The benchmarks suggest it
 should be faster, or at worst comparable.

 How much interest would there be in putting it in Python 3.2?


Hi,
please, let me apologize for posting here, not being a python developer;
I'd like to support the inclusion of the new regex library in the standard lib.
I use it since the early development versions in my internal app for
searching, modifying, ordering, extracting data from text - mainly
using the manually created regex patterns. I see, that it is only one
specific usecase, and the app isn't  time or space critical (input
texts up to a few MB, mostly smaller; the processing time is often
rather negligible compared to the gui presentation, styling etc.)
However, I see it as a great enhancement both in terms of regex
features (listed on http://pypi.python.org/pypi/regex ) as well as the
behaviour in some cornercases, which aren't effectively usable in the
current re (e.g. nested subexpressions with quantifiers - while many
of these are more effectively solved with the added possesive
quantifiers).
I think, this is a far more feature complete engine, which doesn't
induce any significant drawbacks (IMO) comparing to the current re and
is backward compatible.
(The mentioned exception in the scoped flags might be fixable by
allowing only explicit scoping (?flags)...(?-flags) or using the
current paren, if possible (?flag:...) and treating the bare flag
setting parens as global;
however, I would consider it quite misleading in the current re, if
these flags are set at some other place than at the beginning of the
pattern. I don't see the readability enhanced in any way with these
flags set at the end, while they should apply from the beginning of
the pattern.)

Having seen MRABs commitment in the development phase in both -
bugfixes and feature additions - including some rather complex ones
(in my opinion) like unicode properties, I am also confident, that he
could be a competent maintainer of this package in the standardlib as
well.

just my biased opinion...
Regards,

   Vlastimil Brom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-13 Thread Reid Kleckner
On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlan ncogh...@gmail.com wrote:
 MRAB's module offers a superset of re's features rather than a subset
 though, so once it has had more of a chance to bake on PyPI it may be
 worth another look.

I feel like the new module is designed to replace the current re
module, and shouldn't need to spend time in PyPI.  A faster regex
library isn't going to motivate users to add external dependencies to
their projects.

Reid
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-13 Thread Michael Foord

On 13/07/2010 15:17, Reid Kleckner wrote:

On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlanncogh...@gmail.com  wrote:
   

MRAB's module offers a superset of re's features rather than a subset
though, so once it has had more of a chance to bake on PyPI it may be
worth another look.
 

I feel like the new module is designed to replace the current re
module, and shouldn't need to spend time in PyPI.  A faster regex
library isn't going to motivate users to add external dependencies to
their projects.

   
If the backwards compatibility issues can be addressed and MRAB is 
willing to remain as maintainer then the advantages seem well worth it 
to me.


Michael


Reid
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
   



--
http://www.ironpythoninaction.com/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-13 Thread Antoine Pitrou
On Tue, 13 Jul 2010 15:20:23 +0100
Michael Foord fuzzy...@voidspace.org.uk wrote:
 On 13/07/2010 15:17, Reid Kleckner wrote:
  On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlanncogh...@gmail.com  wrote:
 
  MRAB's module offers a superset of re's features rather than a subset
  though, so once it has had more of a chance to bake on PyPI it may be
  worth another look.
   
  I feel like the new module is designed to replace the current re
  module, and shouldn't need to spend time in PyPI.  A faster regex
  library isn't going to motivate users to add external dependencies to
  their projects.
 
 
 If the backwards compatibility issues can be addressed and MRAB is 
 willing to remain as maintainer then the advantages seem well worth it 
 to me.

To me as well. The code needs a full review before integrating, though.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-12 Thread Nick Coghlan
On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Aprano st...@pearwood.info wrote:
 On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote:
  re2 comparison is interesting from the point of if it should be
  included in stdlib.

 Is it re2 or regex? I don't see having 2 regular expression engines
 in the stdlib.

 There's precedence though... the old regex engine and the new re engine
 were side-by-side for many years before regex was deprecated and
 finally removed in 2.5. Hypothetically, re2 could similarly be added to
 the standard library while re is deprecated.

re2 deliberately omits some features for efficiency reasons, hence is
not even on the table as a possible replacement for the standard
library version. If someone is in a position where re2 can solve their
problems with the re module, they should also be in a position where
they can track it down for themselves.

MRAB's module offers a superset of re's features rather than a subset
though, so once it has had more of a chance to bake on PyPI it may be
worth another look.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-12 Thread Michael Foord

On 12/07/2010 15:07, Nick Coghlan wrote:

On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Apranost...@pearwood.info  wrote:
   

On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote:
 

re2 comparison is interesting from the point of if it should be
included in stdlib.
 

Is it re2 or regex? I don't see having 2 regular expression engines
in the stdlib.
   

There's precedence though... the old regex engine and the new re engine
were side-by-side for many years before regex was deprecated and
finally removed in 2.5. Hypothetically, re2 could similarly be added to
the standard library while re is deprecated.
 

re2 deliberately omits some features for efficiency reasons, hence is
not even on the table as a possible replacement for the standard
library version. If someone is in a position where re2 can solve their
problems with the re module, they should also be in a position where
they can track it down for themselves.

   


If it has *partial* compatibility, and big enough performance 
improvements for common cases, it could perhaps be used where the regex 
doesn't use unsupported features. This would have some extra cost in the 
compile phase, but would mean Python could ship with two regex engines 
but only one interface exposed to the programmer...


Michael


MRAB's module offers a superset of re's features rather than a subset
though, so once it has had more of a chance to bake on PyPI it may be
worth another look.

Cheers,
Nick.

   



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog

READ CAREFULLY. By accepting and reading this email you agree, on behalf of 
your employer, to release me from all obligations and waivers arising from any 
and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, 
clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and 
acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your 
employer, its partners, licensors, agents and assigns, in perpetuity, without 
prejudice to my ongoing rights and privileges. You further represent that you 
have the authority to release me from any BOGUS AGREEMENTS on behalf of your 
employer.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-12 Thread Antoine Pitrou
On Mon, 12 Jul 2010 16:18:38 +0100
Michael Foord fuzzy...@voidspace.org.uk wrote:
 On 12/07/2010 15:07, Nick Coghlan wrote:
  On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Apranost...@pearwood.info  
  wrote:
 
  On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote:
   
  re2 comparison is interesting from the point of if it should be
  included in stdlib.
   
  Is it re2 or regex? I don't see having 2 regular expression engines
  in the stdlib.
 
  There's precedence though... the old regex engine and the new re engine
  were side-by-side for many years before regex was deprecated and
  finally removed in 2.5. Hypothetically, re2 could similarly be added to
  the standard library while re is deprecated.
   
  re2 deliberately omits some features for efficiency reasons, hence is
  not even on the table as a possible replacement for the standard
  library version. If someone is in a position where re2 can solve their
  problems with the re module, they should also be in a position where
  they can track it down for themselves.
 
 
 
 If it has *partial* compatibility, and big enough performance 
 improvements for common cases, it could perhaps be used where the regex 
 doesn't use unsupported features. This would have some extra cost in the 
 compile phase, but would mean Python could ship with two regex engines 
 but only one interface exposed to the programmer...

You're forgetting the maintenance cost.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-12 Thread Tim Wintle
On Mon, 2010-07-12 at 16:18 +0100, Michael Foord wrote: 
 On 12/07/2010 15:07, Nick Coghlan wrote:
  On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Apranost...@pearwood.info  
  wrote:
 
  re2 deliberately omits some features for efficiency reasons, hence is
  not even on the table as a possible replacement for the standard
  library version. If someone is in a position where re2 can solve their
  problems with the re module, they should also be in a position where
  they can track it down for themselves.
 
 
 If it has *partial* compatibility, and big enough performance 
 improvements for common cases, it could perhaps be used where the regex 
 doesn't use unsupported features. This would have some extra cost in the 
 compile phase, but would mean Python could ship with two regex engines 
 but only one interface exposed to the programmer...

I'm not sure how common those cases are - I played around with RE2 a few
months ago and found that the majority of regular expressions in my own
code were noticeably slower running under RE2 than python's re module -
RE2 just puts much nicer theoretical bounds on cases that were (in my
code at least) unusual.

The really good use case for RE2 is for applications where users can
write regular expressions as input (exactly what RE2 was designed for) -
but I'm not sure how common those applications are.

Tim Wintle

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-12 Thread Collin Winter
On Mon, Jul 12, 2010 at 8:18 AM, Michael Foord
fuzzy...@voidspace.org.uk wrote:
 On 12/07/2010 15:07, Nick Coghlan wrote:

 On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Apranost...@pearwood.info
  wrote:


 On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote:


 re2 comparison is interesting from the point of if it should be
 included in stdlib.


 Is it re2 or regex? I don't see having 2 regular expression engines
 in the stdlib.


 There's precedence though... the old regex engine and the new re engine
 were side-by-side for many years before regex was deprecated and
 finally removed in 2.5. Hypothetically, re2 could similarly be added to
 the standard library while re is deprecated.


 re2 deliberately omits some features for efficiency reasons, hence is
 not even on the table as a possible replacement for the standard
 library version. If someone is in a position where re2 can solve their
 problems with the re module, they should also be in a position where
 they can track it down for themselves.



 If it has *partial* compatibility, and big enough performance improvements
 for common cases, it could perhaps be used where the regex doesn't use
 unsupported features. This would have some extra cost in the compile phase,
 but would mean Python could ship with two regex engines but only one
 interface exposed to the programmer...

FWIW, this has all been discussed before:
http://aspn.activestate.com/ASPN/Mail/Message/python-dev/3829265. In
particular, I still believe that, it's not obvious that enough Python
regexes would benefit from re2's performance/restrictions tradeoff to
make such a hybrid system worthwhile in the long term. (There is no
representative corpus of real-world Python regexes weighted for
dynamic execution frequency to use in assessing such tradeoffs
empirically like there is for JavaScript.)

Collin

 MRAB's module offers a superset of re's features rather than a subset
 though, so once it has had more of a chance to bake on PyPI it may be
 worth another look.

 Cheers,
 Nick.




 --
 http://www.ironpythoninaction.com/
 http://www.voidspace.org.uk/blog

 READ CAREFULLY. By accepting and reading this email you agree, on behalf of
 your employer, to release me from all obligations and waivers arising from
 any and all NON-NEGOTIATED agreements, licenses, terms-of-service,
 shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure,
 non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have
 entered into with your employer, its partners, licensors, agents and
 assigns, in perpetuity, without prejudice to my ongoing rights and
 privileges. You further represent that you have the authority to release me
 from any BOGUS AGREEMENTS on behalf of your employer.


 ___
 Python-Dev mailing list
 Python-Dev@python.org
 http://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 http://mail.python.org/mailman/options/python-dev/collinwinter%40google.com

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-11 Thread anatoly techtonik
On Fri, Jul 9, 2010 at 6:59 PM, Jeffrey Yasskin jyass...@gmail.com wrote:

 While the re2 comparison might be interesting from an abstract
 standpoint, it intentionally supports a different regex language from
 Python so that it can run faster and use less memory. Since re2 can
 never replace Python's re module, it doesn't make sense to hold MRAB's
 new module to that standard.

re2 comparison is interesting from the point of if it should be
included in stdlib.
-- 
anatoly t.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-11 Thread Eric Smith

On 7/11/2010 5:19 AM, anatoly techtonik wrote:

On Fri, Jul 9, 2010 at 6:59 PM, Jeffrey Yasskinjyass...@gmail.com  wrote:


While the re2 comparison might be interesting from an abstract
standpoint, it intentionally supports a different regex language from
Python so that it can run faster and use less memory. Since re2 can
never replace Python's re module, it doesn't make sense to hold MRAB's
new module to that standard.


re2 comparison is interesting from the point of if it should be
included in stdlib.


Is it re2 or regex? I don't see having 2 regular expression engines in 
the stdlib.


Eric.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-11 Thread Nick Coghlan
On Sun, Jul 11, 2010 at 7:19 PM, anatoly techtonik techto...@gmail.com wrote:
 On Fri, Jul 9, 2010 at 6:59 PM, Jeffrey Yasskin jyass...@gmail.com wrote:

 While the re2 comparison might be interesting from an abstract
 standpoint, it intentionally supports a different regex language from
 Python so that it can run faster and use less memory. Since re2 can
 never replace Python's re module, it doesn't make sense to hold MRAB's
 new module to that standard.

 re2 comparison is interesting from the point of if it should be
 included in stdlib.

No it isn't. re2 is a wrapper for a third party engine that surrenders
some functionality in the pursuit of better bounds on memory and CPU
usage. It is not a drop-in replacement for re and cannot be by design:
The one significant exception is that RE2 drops support for
backreferences and generalized zero-width assertions, because they
cannot be implemented efficiently. There is no reason to have two
distinct regex engines in the standard library - if someone knows
enough to realise they need the performance assurances of re2, they're
also likely to be able to find the Python wrappers for it.

regex is potentially interesting for the standard library as it *is*
intended to be a drop-in replacement for re that trades longer
compilation times (typically once per application) for faster match
times (potentially many times per application). The performance of re2
has nothing to do with the comparison between the current re module
and MRAB's regex module.

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-11 Thread Steven D'Aprano
On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote:
  re2 comparison is interesting from the point of if it should be
  included in stdlib.

 Is it re2 or regex? I don't see having 2 regular expression engines
 in the stdlib.

There's precedence though... the old regex engine and the new re engine 
were side-by-side for many years before regex was deprecated and 
finally removed in 2.5. Hypothetically, re2 could similarly be added to 
the standard library while re is deprecated.


-- 
Steven D'Aprano
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-11 Thread geremy condra
On Sun, Jul 11, 2010 at 7:42 PM, Steven D'Aprano st...@pearwood.info wrote:
 On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote:
  re2 comparison is interesting from the point of if it should be
  included in stdlib.

 Is it re2 or regex? I don't see having 2 regular expression engines
 in the stdlib.

 There's precedence though... the old regex engine and the new re engine
 were side-by-side for many years before regex was deprecated and
 finally removed in 2.5. Hypothetically, re2 could similarly be added to
 the standard library while re is deprecated.

While I realize I'm not the target audience for this, there are a lot
of things I'd like to see more in the stdlib than a second re engine.

Geremy Condra
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-09 Thread Georg Brandl
Am 09.07.2010 02:35, schrieb MRAB:

 That's not what I'm asking. I'm asking what happens if you take an
 existing Python installation's re module, move it aside, and drop
 regex in its place as re.py.
 
 Doing that and then running Python's own test suite as well as the
 test suites of major Python applications and frameworks like Twisted,
 Zope and Django would provide solid evidence that the new version
 really *is* backwards compatible, rather than that it is *meant* to be
 backwards compatible.
 
 I had to recompile the .pyd to change its internal name from regex to
 re, but apart from that it passed Python's own test suite except for
 where I expected it to fail:
 
 1. Some of the inline flags are scoped; for example, putting (?i) at
 the end of a regex will now have no effect because it's no longer a
 global, all-or-nothing, flag.

That is problematic.  I've often seen people put these flags at the end
of a regex, probably for readability purposes.  IMHO it would be better
to limit flag scoping to the explicit (?flags-flags: ) groups.

 2. The .sub method will treat unmatched groups in an expansion as empty
 strings. The re module raises an exception in such cases, which means
 that users currently need a workaround.

That is a change for the better, I would say.

Georg

-- 
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-09 Thread anatoly techtonik
On Thu, Jul 8, 2010 at 10:52 PM, MRAB pyt...@mrabarnett.plus.com wrote:
 Hi all,

 I re-implemented the re module, adding new features and speed
 improvements. It's available at:

    http://pypi.python.org/pypi/regex

 under the name regex so that it can be tried alongside re.

 I'd be interested in any comments or feedback. How does it compare with
 re in terms of speed on real-world data? The benchmarks suggest it
 should be faster, or at worst comparable.

And where are the benchmarks?
In particular it would be interesting to see it compared both to re
from stdlib and re2 from  http://code.google.com/p/re2/

-- 
anatoly t.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-09 Thread Jeffrey Yasskin
On Fri, Jul 9, 2010 at 7:06 AM, anatoly techtonik techto...@gmail.com wrote:
 On Thu, Jul 8, 2010 at 10:52 PM, MRAB pyt...@mrabarnett.plus.com wrote:
 Hi all,

 I re-implemented the re module, adding new features and speed
 improvements. It's available at:

    http://pypi.python.org/pypi/regex

 under the name regex so that it can be tried alongside re.

 I'd be interested in any comments or feedback. How does it compare with
 re in terms of speed on real-world data? The benchmarks suggest it
 should be faster, or at worst comparable.

 And where are the benchmarks?
 In particular it would be interesting to see it compared both to re
 from stdlib and re2 from  http://code.google.com/p/re2/

While the re2 comparison might be interesting from an abstract
standpoint, it intentionally supports a different regex language from
Python so that it can run faster and use less memory. Since re2 can
never replace Python's re module, it doesn't make sense to hold MRAB's
new module to that standard.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-09 Thread MRAB

anatoly techtonik wrote:

On Thu, Jul 8, 2010 at 10:52 PM, MRAB pyt...@mrabarnett.plus.com wrote:

Hi all,

I re-implemented the re module, adding new features and speed
improvements. It's available at:

   http://pypi.python.org/pypi/regex

under the name regex so that it can be tried alongside re.

I'd be interested in any comments or feedback. How does it compare with
re in terms of speed on real-world data? The benchmarks suggest it
should be faster, or at worst comparable.


And where are the benchmarks?
In particular it would be interesting to see it compared both to re
from stdlib and re2 from  http://code.google.com/p/re2/


The benchmarks bm_regex_effbot.py and bm_regex_v8.py both perform
multiple runs of the tests multiple times, giving just the total times
for each set. Here are the averages:

Python26
BENCHMARKre regex  ratio
bm_regex_effbot  0.135secs  0.083secs  1.63
bm_regex_v8  0.153secs  0.085secs  1.80


Python31
BENCHMARKre regex  ratio
bm_regex_effbot  0.138secs  0.083secs  1.66
bm_regex_v8  0.170secs  0.091secs  1.87
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-09 Thread Collin Winter
On Fri, Jul 9, 2010 at 10:28 AM, MRAB pyt...@mrabarnett.plus.com wrote:
 anatoly techtonik wrote:

 On Thu, Jul 8, 2010 at 10:52 PM, MRAB pyt...@mrabarnett.plus.com wrote:

 Hi all,

 I re-implemented the re module, adding new features and speed
 improvements. It's available at:

   http://pypi.python.org/pypi/regex

 under the name regex so that it can be tried alongside re.

 I'd be interested in any comments or feedback. How does it compare with
 re in terms of speed on real-world data? The benchmarks suggest it
 should be faster, or at worst comparable.

 And where are the benchmarks?
 In particular it would be interesting to see it compared both to re
 from stdlib and re2 from  http://code.google.com/p/re2/

 The benchmarks bm_regex_effbot.py and bm_regex_v8.py both perform
 multiple runs of the tests multiple times, giving just the total times
 for each set. Here are the averages:

 Python26
 BENCHMARK        re         regex      ratio
 bm_regex_effbot  0.135secs  0.083secs  1.63
 bm_regex_v8      0.153secs  0.085secs  1.80


 Python31
 BENCHMARK        re         regex      ratio
 bm_regex_effbot  0.138secs  0.083secs  1.66
 bm_regex_v8      0.170secs  0.091secs  1.87

Out of curiosity, what are the results for the bm_regex_compile benchmark?

Collin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-09 Thread MRAB

Collin Winter wrote:

On Fri, Jul 9, 2010 at 10:28 AM, MRAB pyt...@mrabarnett.plus.com wrote:

anatoly techtonik wrote:

On Thu, Jul 8, 2010 at 10:52 PM, MRAB pyt...@mrabarnett.plus.com wrote:

Hi all,

I re-implemented the re module, adding new features and speed
improvements. It's available at:

  http://pypi.python.org/pypi/regex

under the name regex so that it can be tried alongside re.

I'd be interested in any comments or feedback. How does it compare with
re in terms of speed on real-world data? The benchmarks suggest it
should be faster, or at worst comparable.

And where are the benchmarks?
In particular it would be interesting to see it compared both to re
from stdlib and re2 from  http://code.google.com/p/re2/


The benchmarks bm_regex_effbot.py and bm_regex_v8.py both perform
multiple runs of the tests multiple times, giving just the total times
for each set. Here are the averages:

Python26
BENCHMARKre regex  ratio
bm_regex_effbot  0.135secs  0.083secs  1.63
bm_regex_v8  0.153secs  0.085secs  1.80


Python31
BENCHMARKre regex  ratio
bm_regex_effbot  0.138secs  0.083secs  1.66
bm_regex_v8  0.170secs  0.091secs  1.87


Out of curiosity, what are the results for the bm_regex_compile benchmark?


I concentrated my efforts on the matching speed because regexes tend to
be compiled only once, and are cached anyway, so I don't think it's as
important. The results are:

Python26
BENCHMARK re regex  ratio
bm_regex_compile  0.897secs  2.792secs  0.32


Python31
BENCHMARK re regex  ratio
bm_regex_compile  0.902secs  2.731secs  0.33

If anyone can demonstrate that it'll have a significant impact in
practice then I will, of course, look into it further.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-09 Thread Fred Drake
On Fri, Jul 9, 2010 at 3:35 PM, MRAB pyt...@mrabarnett.plus.com wrote:
 I concentrated my efforts on the matching speed because regexes tend to
 be compiled only once, and are cached anyway, so I don't think it's as
 important.

I think most here will agree with that, but it might be good to keep
in mind that the sre implementation has already made that trade-off
once.  I don't remember what the compilation slowdown was at the time,
but I'd be surprised if Google can't find it, given sufficient fu.


  -Fred

-- 
Fred L. Drake, Jr.fdrake at gmail.com
A storm broke loose in my mind.  --Albert Einstein
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-09 Thread William Wahl
H as long as we aren't the ones writing the check:)

BJ

--Original Message--

From: Fred Drake fdr...@acm.org
Sent: Fri, July 09, 2010 1:16 PM
To: MRAB pyt...@mrabarnett.plus.com
Cc: Python-Dev python-dev@python.org
Subject: Re: [Python-Dev] New regex module for 3.2?


On Fri, Jul 9, 2010 at 3:35 PM, MRAB pyt...@mrabarnett.plus.com wrote:
 I concentrated my efforts on the matching speed because regexes tend to
 be compiled only once, and are cached anyway, so I don't think it's as
 important.

I think most here will agree with that, but it might be good to keep
in mind that the sre implementation has already made that trade-off
once.  I don't remember what the compilation slowdown was at the time,
but I'd be surprised if Google can't find it, given sufficient fu.


  -Fred

-- 
Fred L. Drake, Jr.fdrake at gmail.com
A storm broke loose in my mind.  --Albert Einstein
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/bj.wahl%40oracle.com
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-08 Thread Nick Coghlan
On Fri, Jul 9, 2010 at 5:52 AM, MRAB pyt...@mrabarnett.plus.com wrote:
 Hi all,

 I re-implemented the re module, adding new features and speed
 improvements. It's available at:

    http://pypi.python.org/pypi/regex

 under the name regex so that it can be tried alongside re.

 I'd be interested in any comments or feedback. How does it compare with
 re in terms of speed on real-world data? The benchmarks suggest it
 should be faster, or at worst comparable.

 How much interest would there be in putting it in Python 3.2?

The list of fixed bugs/new features is certainly impressive. How does
Python's test suite go if you drop it in place of the current re
module? (Ditto for test suites of major applications and frameworks
like Django, etc).

Off the top of my head, I would say that this won't have enough time
to bake properly for inclusion in 3.2, but if the potential benefits
and intended backwards compatibility are borne out by real world usage
and the code fares well under review then it may be a contender for
3.3. If the backwards compatibility isn't quite there (and can't be
improved), then adding it under a name other than re wouldn't be
impossible, but it would be a harder sell.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-08 Thread Benjamin Peterson
2010/7/8 MRAB pyt...@mrabarnett.plus.com:
 Hi all,

 I re-implemented the re module, adding new features and speed
 improvements. It's available at:

    http://pypi.python.org/pypi/regex

 under the name regex so that it can be tried alongside re.

 I'd be interested in any comments or feedback. How does it compare with
 re in terms of speed on real-world data? The benchmarks suggest it
 should be faster, or at worst comparable.

 How much interest would there be in putting it in Python 3.2?

It would really be nice if you explained incrementally everything you
changed, so it could better be evaluated.


-- 
Regards,
Benjamin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-08 Thread Antoine Pitrou
On Thu, 08 Jul 2010 20:52:44 +0100
MRAB pyt...@mrabarnett.plus.com wrote:
 
 I'd be interested in any comments or feedback. How does it compare with
 re in terms of speed on real-world data? The benchmarks suggest it
 should be faster, or at worst comparable.

Can you publish these benchmarks somewhere?
(or send them here)

 How much interest would there be in putting it in Python 3.2?

I think there's certainly interest (especially given that the original
re module doesn't really have an expert and active maintainer).
Since it's a very big change (and a rather annoying to undo one),
though, it must really not add any maintenance problems, and ideally
you should promise to maintain it at least for a couple of years.
Bonus points if the internals are sufficiently documented, too.

Thanks

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-08 Thread MRAB

Nick Coghlan wrote:

On Fri, Jul 9, 2010 at 5:52 AM, MRAB pyt...@mrabarnett.plus.com wrote:

Hi all,

I re-implemented the re module, adding new features and speed
improvements. It's available at:

   http://pypi.python.org/pypi/regex

under the name regex so that it can be tried alongside re.

I'd be interested in any comments or feedback. How does it compare with
re in terms of speed on real-world data? The benchmarks suggest it
should be faster, or at worst comparable.

How much interest would there be in putting it in Python 3.2?


The list of fixed bugs/new features is certainly impressive. How does
Python's test suite go if you drop it in place of the current re
module? (Ditto for test suites of major applications and frameworks
like Django, etc).

Off the top of my head, I would say that this won't have enough time
to bake properly for inclusion in 3.2, but if the potential benefits
and intended backwards compatibility are borne out by real world usage
and the code fares well under review then it may be a contender for
3.3. If the backwards compatibility isn't quite there (and can't be
improved), then adding it under a name other than re wouldn't be
impossible, but it would be a harder sell.


You should be able to replace:

import re

with:

import regex as re

and still have everything work the same, ie it's backwards compatible
with re.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-08 Thread Nick Coghlan
On Fri, Jul 9, 2010 at 7:54 AM, MRAB pyt...@mrabarnett.plus.com wrote:
 You should be able to replace:

    import re

 with:

    import regex as re

 and still have everything work the same, ie it's backwards compatible
 with re.

That's not what I'm asking. I'm asking what happens if you take an
existing Python installation's re module, move it aside, and drop
regex in its place as re.py.

Doing that and then running Python's own test suite as well as the
test suites of major Python applications and frameworks like Twisted,
Zope and Django would provide solid evidence that the new version
really *is* backwards compatible, rather than that it is *meant* to be
backwards compatible.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New regex module for 3.2?

2010-07-08 Thread MRAB

Nick Coghlan wrote:

On Fri, Jul 9, 2010 at 7:54 AM, MRAB pyt...@mrabarnett.plus.com wrote:

You should be able to replace:

   import re

with:

   import regex as re

and still have everything work the same, ie it's backwards compatible
with re.


That's not what I'm asking. I'm asking what happens if you take an
existing Python installation's re module, move it aside, and drop
regex in its place as re.py.

Doing that and then running Python's own test suite as well as the
test suites of major Python applications and frameworks like Twisted,
Zope and Django would provide solid evidence that the new version
really *is* backwards compatible, rather than that it is *meant* to be
backwards compatible.


I had to recompile the .pyd to change its internal name from regex to
re, but apart from that it passed Python's own test suite except for
where I expected it to fail:

1. Some of the inline flags are scoped; for example, putting (?i) at
the end of a regex will now have no effect because it's no longer a
global, all-or-nothing, flag.

2. The .sub method will treat unmatched groups in an expansion as empty
strings. The re module raises an exception in such cases, which means
that users currently need a workaround.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com