Re: [Python-Dev] New regex module for 3.2?
On Tue, Jul 27, 2010 at 6:43 PM, R. David Murray rdmur...@bitdance.comwrote: On Tue, 27 Jul 2010 08:27:35 +0200, Stefan Behnel stefan...@behnel.de wrote: Gregory P. Smith, 27.07.2010 07:40: A max cache size of 100 was too small. I just increased it to 500 in the py3k branch along with implementing a random replacement cache overflow policy. It now randomly drops 20% of the compiled regular expression cache instead of simply dropping the entire cache on overflow. With the regex_v8 benchmark, the better cache replacement policy sped it up ~7% while raising the cache size on top of that (likely meaning the cache was never overflowing) sped it up ~25%. Random replacement without dropping everything at least means apps thrashing the cache degrade much more gracefully. The same algorithm should be helpful in ElementTree's ElementPath module. We recently added the old re cache-clearing strategy to fnmatch, because previously its cache would grow indefinitely. It sounds like this should be applied there as well. That's three...time to figure out how to share the code? No doubt. Its already a standalone _shrink_cache function with unit tests that doesn't care the dictionaries it is shrinking are composed of. Easy enough to move somewhere more useful. Any proposed stdlib locations? I'll be offline on vacation soon so I may not get to it for a couple weeks but feel free to move it without me if anyone is interested. -gps ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Wed, Jul 28, 2010 at 4:50 PM, Gregory P. Smith g...@krypto.org wrote: On Tue, Jul 27, 2010 at 6:43 PM, R. David Murray rdmur...@bitdance.com wrote: On Tue, 27 Jul 2010 08:27:35 +0200, Stefan Behnel stefan...@behnel.de wrote: Gregory P. Smith, 27.07.2010 07:40: Random replacement without dropping everything at least means apps thrashing the cache degrade much more gracefully. The same algorithm should be helpful in ElementTree's ElementPath module. We recently added the old re cache-clearing strategy to fnmatch, because previously its cache would grow indefinitely. It sounds like this should be applied there as well. That's three...time to figure out how to share the code? No doubt. Anyone remember off the top of their head what linecache does? ... ah, it's just unbounded unless you call clearcache(). Probably OK for that particular application. Its already a standalone _shrink_cache function with unit tests that doesn't care the dictionaries it is shrinking are composed of. Easy enough to move somewhere more useful. Any proposed stdlib locations? I'll be offline on vacation soon so I may not get to it for a couple weeks but feel free to move it without me if anyone is interested. I created a tracker issue for the idea so we don't forget about it: http://bugs.python.org/issue9396 collections seems like a fairly obvious possibility, but the flavour doesn't quite seem right for that. Nowhere else springs to mind though. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
Gregory P. Smith, 27.07.2010 07:40: A max cache size of 100 was too small. I just increased it to 500 in the py3k branch along with implementing a random replacement cache overflow policy. It now randomly drops 20% of the compiled regular expression cache instead of simply dropping the entire cache on overflow. With the regex_v8 benchmark, the better cache replacement policy sped it up ~7% while raising the cache size on top of that (likely meaning the cache was never overflowing) sped it up ~25%. Random replacement without dropping everything at least means apps thrashing the cache degrade much more gracefully. The same algorithm should be helpful in ElementTree's ElementPath module. Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Tue, 27 Jul 2010 08:27:35 +0200, Stefan Behnel stefan...@behnel.de wrote: Gregory P. Smith, 27.07.2010 07:40: A max cache size of 100 was too small. I just increased it to 500 in the py3k branch along with implementing a random replacement cache overflow policy. It now randomly drops 20% of the compiled regular expression cache instead of simply dropping the entire cache on overflow. With the regex_v8 benchmark, the better cache replacement policy sped it up ~7% while raising the cache size on top of that (likely meaning the cache was never overflowing) sped it up ~25%. Random replacement without dropping everything at least means apps thrashing the cache degrade much more gracefully. The same algorithm should be helpful in ElementTree's ElementPath module. We recently added the old re cache-clearing strategy to fnmatch, because previously its cache would grow indefinitely. It sounds like this should be applied there as well. That's three...time to figure out how to share the code? -- R. David Murray www.bitdance.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
Am 22.07.2010 12:53, schrieb Guido van Rossum: On Fri, Jul 16, 2010 at 6:08 PM, Georg Brandl g.bra...@gmx.net wrote: Nevertheless, the authoritative reference for our regex engine is its docs, i.e. http://docs.python.org/library/re.html -- and that states clearly that inline flags apply to the whole regex. I think with a new regex implementation, not all of this historical semantics must be copied, unless there are major real usecases, which would be affected by this. As I already said, I *have* seen this in real code. As MRAB indicated, this was the only silent change in semantics as compared to the old regex engine. If we replace re by regex, which I think is the only way to get the new features in the stdlib, changing this one aspect is a) not backwards compatible and b) in a subtle way that forces everyone to review his/her regular expressions. That's definitely not acceptable. I wonder if the solution could be a new flag that you have to specify if you want the flags to apply locally only. Let's say add 'L' to the flags. Then for existing code (not using 'L') the flags should apply globally, while someone wanting flags to apply to only part of a regex can add 'L' to the flags syntax. Is that acceptable? I guess it would be; however it remains to be proven that this scoping is actually needed in addition to the (backwards compatible) grouped scoping. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Thu, Jul 22, 2010 at 3:26 PM, Nick Coghlan ncogh...@gmail.com wrote: On Fri, Jul 23, 2010 at 12:42 AM, Georg Brandl g.bra...@gmx.net wrote: Sure -- I don't think this is a showstopper for regex. However if we don't include regex in a future version, we might think about increasing MAXCACHE a bit, and maybe not clear the cache when it reaches its max length, but rather remove another element. Yikes, I didn't know it did that. That certainly sounds like it should be an RFE in its own right - some basic form of Least Recently Used accounting should be beneficial (although the extra bookkeeping might hurt scripts that aren't hitting the cache limit). A max cache size of 100 was too small. I just increased it to 500 in the py3k branch along with implementing a random replacement cache overflow policy. It now randomly drops 20% of the compiled regular expression cache instead of simply dropping the entire cache on overflow. With the regex_v8 benchmark, the better cache replacement policy sped it up ~7% while raising the cache size on top of that (likely meaning the cache was never overflowing) sped it up ~25%. Random replacement without dropping everything at least means apps thrashing the cache degrade much more gracefully. http://svn.python.org/view?view=revrevision=83173 This change should be incorporated into MRAB's regex module in order to keep comparisons fair. -gps ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On 07/22/2010 01:34 PM, Georg Brandl wrote: Timings (seconds to run the test suite): re 26.689 26.015 26.008 regex 26.066 25.797 25.865 So, I thought there wasn't a difference in performance for this use case (which is compiling a lot of regexes and matching most of them only a few times in comparison). However, I found that looking at the regex caching is very important in this case: re._MAXCACHE is by default set to 100, and regex._MAXCACHE to 1024. When I set re._MAXCACHE to 1024 before running the test suite, I get times around 18 (!) seconds for re. This seems to point to re being significantly *faster* than regexp, even in matching, and as such may be something the author would want to look into. Nick writes: That still fits with the compile/match performance trade-off changes between re and regex though. The performance trade-off should make regex slower with sufficiently small compiled regex cache, when a lot of time is wasted on compilation. But as the cache gets larger (and, for fairness, of the same size in both implementations), regex should outperform re. Georg, would you care to measure if there is a difference in performance with an even larger cache? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Fri, Jul 23, 2010 at 8:16 PM, Hrvoje Niksic hrvoje.nik...@avl.com wrote: The performance trade-off should make regex slower with sufficiently small compiled regex cache, when a lot of time is wasted on compilation. But as the cache gets larger (and, for fairness, of the same size in both implementations), regex should outperform re. Georg, would you care to measure if there is a difference in performance with an even larger cache? That's not quite accurate. The figure that matters is matches per compilation, and that is ultimately governed by the nature of the application once the cache is sufficiently large (e.g. an application that compiles 10 different regular expressions, but also only matches each one against a single string is going to be slowed down significantly by a switch to regex no matter how big the compilation cache may be). The total runtime for a particular re matching exercise is (average compilation time)*(number of times compiled) + (average match time)*(number of matches performed) We know that the new regex module speeds up matching at the cost of slowing down compilation. Whether that is a net increase for the application as a whole depends on 4 things: X: the average speed change in compilation for the application Y: the average speed change in matching for the application A: the proportion of time the application currently spends compiling regular expressions B: the proportion of time the application currently spends matching regular expressions The overall proportional speed change for re operations in a given application is then just X*A + Y*B. For regex to be an improvement for that application, the resulting figure needs to be less than 1. For example, suppose X=2, Y=0.5 (i.e. matching is twice as fast on average, but compilation also takes twice as long), then the overall speed change would be 2*A + 0.5*B. For that to be a wash, the original application needs to spend 1/3 of its expression matching time compiling the expressions and 2/3 actually matching them (the application with the new engine would then spend 2/3 of its time compiling and only 1/3 matching). That's going to be a tough assessment to make in practice, but I think Georg's pygments test suite example shows that there is real world code out there that already spends a lot of time on compilation. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
Am 23.07.2010 11:16, schrieb Hrvoje Niksic: On 07/22/2010 01:34 PM, Georg Brandl wrote: Timings (seconds to run the test suite): re 26.689 26.015 26.008 regex 26.066 25.797 25.865 So, I thought there wasn't a difference in performance for this use case (which is compiling a lot of regexes and matching most of them only a few times in comparison). However, I found that looking at the regex caching is very important in this case: re._MAXCACHE is by default set to 100, and regex._MAXCACHE to 1024. When I set re._MAXCACHE to 1024 before running the test suite, I get times around 18 (!) seconds for re. This seems to point to re being significantly *faster* than regexp, even in matching, and as such may be something the author would want to look into. Nick writes: That still fits with the compile/match performance trade-off changes between re and regex though. The performance trade-off should make regex slower with sufficiently small compiled regex cache, when a lot of time is wasted on compilation. But as the cache gets larger (and, for fairness, of the same size in both implementations), regex should outperform re. Georg, would you care to measure if there is a difference in performance with an even larger cache? I did measure that, and there are no significant differences in timing. I also did the check the other way around, and restricting regex._MAXCACHE to 100 I got from 26 seconds to 42 seconds. (Nick, is that enough data to calculate A and B now? ;) Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
Am 13.07.2010 15:35, schrieb Antoine Pitrou: On Tue, 13 Jul 2010 15:20:23 +0100 Michael Foord fuzzy...@voidspace.org.uk wrote: On 13/07/2010 15:17, Reid Kleckner wrote: On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlanncogh...@gmail.com wrote: MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look. I feel like the new module is designed to replace the current re module, and shouldn't need to spend time in PyPI. A faster regex library isn't going to motivate users to add external dependencies to their projects. If the backwards compatibility issues can be addressed and MRAB is willing to remain as maintainer then the advantages seem well worth it to me. To me as well. The code needs a full review before integrating, though. FWIW, I've now run the Pygments test suite (Pygments has about 2500 regular expressions that are exercised there) and only had two problems: * Scoped flags: A few lexers use (?s) and similar flags at the end of the expression, which has no effect in regex currently. * POSIX character classes: One regex used a class '[][:xyz]', so the [: was seen as the start of a character class. I'm not sure how common this is, as most people seem to escape brackets in character classes. Also, it gives a clear error on regex.compile(), not mysterious failures. Timings (seconds to run the test suite): re 26.689 26.015 26.008 regex 26.066 25.797 25.865 So, I thought there wasn't a difference in performance for this use case (which is compiling a lot of regexes and matching most of them only a few times in comparison). However, I found that looking at the regex caching is very important in this case: re._MAXCACHE is by default set to 100, and regex._MAXCACHE to 1024. When I set re._MAXCACHE to 1024 before running the test suite, I get times around 18 (!) seconds for re. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Fri, Jul 16, 2010 at 6:08 PM, Georg Brandl g.bra...@gmx.net wrote: Nevertheless, the authoritative reference for our regex engine is its docs, i.e. http://docs.python.org/library/re.html -- and that states clearly that inline flags apply to the whole regex. I think with a new regex implementation, not all of this historical semantics must be copied, unless there are major real usecases, which would be affected by this. As I already said, I *have* seen this in real code. As MRAB indicated, this was the only silent change in semantics as compared to the old regex engine. If we replace re by regex, which I think is the only way to get the new features in the stdlib, changing this one aspect is a) not backwards compatible and b) in a subtle way that forces everyone to review his/her regular expressions. That's definitely not acceptable. I wonder if the solution could be a new flag that you have to specify if you want the flags to apply locally only. Let's say add 'L' to the flags. Then for existing code (not using 'L') the flags should apply globally, while someone wanting flags to apply to only part of a regex can add 'L' to the flags syntax. Is that acceptable? -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Thu, Jul 22, 2010 at 9:34 PM, Georg Brandl g.bra...@gmx.net wrote: So, I thought there wasn't a difference in performance for this use case (which is compiling a lot of regexes and matching most of them only a few times in comparison). However, I found that looking at the regex caching is very important in this case: re._MAXCACHE is by default set to 100, and regex._MAXCACHE to 1024. When I set re._MAXCACHE to 1024 before running the test suite, I get times around 18 (!) seconds for re. That still fits with the compile/match performance trade-off changes between re and regex though. It does make it clear this isn't going to be a win across the board though - things like test suites are going to have more one-off regex operations than a long-running web server or a filesystem or database scanning operation. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
Am 22.07.2010 14:12, schrieb Nick Coghlan: On Thu, Jul 22, 2010 at 9:34 PM, Georg Brandl g.bra...@gmx.net wrote: So, I thought there wasn't a difference in performance for this use case (which is compiling a lot of regexes and matching most of them only a few times in comparison). However, I found that looking at the regex caching is very important in this case: re._MAXCACHE is by default set to 100, and regex._MAXCACHE to 1024. When I set re._MAXCACHE to 1024 before running the test suite, I get times around 18 (!) seconds for re. That still fits with the compile/match performance trade-off changes between re and regex though. It does make it clear this isn't going to be a win across the board though - things like test suites are going to have more one-off regex operations than a long-running web server or a filesystem or database scanning operation. Sure -- I don't think this is a showstopper for regex. However if we don't include regex in a future version, we might think about increasing MAXCACHE a bit, and maybe not clear the cache when it reaches its max length, but rather remove another element. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Thu, Jul 22, 2010 at 7:42 AM, Georg Brandl g.bra...@gmx.net wrote: Am 22.07.2010 14:12, schrieb Nick Coghlan: On Thu, Jul 22, 2010 at 9:34 PM, Georg Brandl g.bra...@gmx.net wrote: So, I thought there wasn't a difference in performance for this use case (which is compiling a lot of regexes and matching most of them only a few times in comparison). However, I found that looking at the regex caching is very important in this case: re._MAXCACHE is by default set to 100, and regex._MAXCACHE to 1024. When I set re._MAXCACHE to 1024 before running the test suite, I get times around 18 (!) seconds for re. It might be fun to do a pygments based macro benchmark. Just have it syntax highlight itself and time it. Sure -- I don't think this is a showstopper for regex. However if we don't include regex in a future version, we might think about increasing MAXCACHE a bit, and maybe not clear the cache when it reaches its max length, but rather remove another element. +50 for the last idea. Collin encountered a problem two summers ago in Mondrian where we were relying on the regex cache and were surprised to find that it cleared itself after filling up, rather than using LRU or random eviction. Reid ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Fri, Jul 23, 2010 at 12:42 AM, Georg Brandl g.bra...@gmx.net wrote: Sure -- I don't think this is a showstopper for regex. However if we don't include regex in a future version, we might think about increasing MAXCACHE a bit, and maybe not clear the cache when it reaches its max length, but rather remove another element. Yikes, I didn't know it did that. That certainly sounds like it should be an RFE in its own right - some basic form of Least Recently Used accounting should be beneficial (although the extra bookkeeping might hurt scripts that aren't hitting the cache limit). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
2010/7/9 Georg Brandl g.bra...@gmx.net: Am 09.07.2010 02:35, schrieb MRAB: 1. Some of the inline flags are scoped; for example, putting (?i) at the end of a regex will now have no effect because it's no longer a global, all-or-nothing, flag. That is problematic. I've often seen people put these flags at the end of a regex, probably for readability purposes. IMHO it would be better to limit flag scoping to the explicit (?flags-flags: ) groups. I just noticed the formulation on the reference page regular-expressions.info on this kind of flags: (?i) Turn on case insensitivity for the remainder of the regular expression. (Older regex flavors may turn it on for the entire regex.) and likewise for other flags. http://www.regular-expressions.info/refadv.html I am not sure, how authoritative this page by Jan Goyvaerts is for various implementations, but it looks like a very comprehensive reference. I think with a new regex implementation, not all of this historical semantics must be copied, unless there are major real usecases, which would be affected by this. Just a thought; Vlastimil Brom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
Am 16.07.2010 17:08, schrieb Vlastimil Brom: 2010/7/9 Georg Brandl g.bra...@gmx.net: Am 09.07.2010 02:35, schrieb MRAB: 1. Some of the inline flags are scoped; for example, putting (?i) at the end of a regex will now have no effect because it's no longer a global, all-or-nothing, flag. That is problematic. I've often seen people put these flags at the end of a regex, probably for readability purposes. IMHO it would be better to limit flag scoping to the explicit (?flags-flags: ) groups. I just noticed the formulation on the reference page regular-expressions.info on this kind of flags: (?i) Turn on case insensitivity for the remainder of the regular expression. (Older regex flavors may turn it on for the entire regex.) and likewise for other flags. http://www.regular-expressions.info/refadv.html I am not sure, how authoritative this page by Jan Goyvaerts is for various implementations, but it looks like a very comprehensive reference. Nevertheless, the authoritative reference for our regex engine is its docs, i.e. http://docs.python.org/library/re.html -- and that states clearly that inline flags apply to the whole regex. I think with a new regex implementation, not all of this historical semantics must be copied, unless there are major real usecases, which would be affected by this. As I already said, I *have* seen this in real code. As MRAB indicated, this was the only silent change in semantics as compared to the old regex engine. If we replace re by regex, which I think is the only way to get the new features in the stdlib, changing this one aspect is a) not backwards compatible and b) in a subtle way that forces everyone to review his/her regular expressions. That's definitely not acceptable. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Thu, Jul 8, 2010 at 12:52 PM, MRAB pyt...@mrabarnett.plus.com wrote: Hi all, I re-implemented the re module, adding new features and speed improvements. It's available at: http://pypi.python.org/pypi/regex under the name regex so that it can be tried alongside re. I'd be interested in any comments or feedback. How does it compare with re in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. How much interest would there be in putting it in Python 3.2? A big +1 from me. I'm all for putting it in Python 3.2 alphas and beta1 as a *replacement* for the existing 're' module in order to get it visibility and a shakeout and make this available to the world sooner rather than later. This module was designed for compatibility for exactly that reason. Lets give it a high publicity fair shake it iron out any unknown compatibility issues early on so that we can decide how to proceed. Also, +1 to Antoine's point on volunteering to be the maintainer of it for the next couple major releases. (I do expect someone to balk at this idea with but python 3.x doesn't have enough users yet for a beta cycle to shake out hidden regular expression engine assumption bugs to which I'll respond in advance with thats a good thing; we can add a regular expression sniffer to 2to3 that flags potential problems in code being ported over) -gps ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
2010/7/8 MRAB pyt...@mrabarnett.plus.com: Hi all, I re-implemented the re module, adding new features and speed improvements. It's available at: http://pypi.python.org/pypi/regex under the name regex so that it can be tried alongside re. I'd be interested in any comments or feedback. How does it compare with re in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. How much interest would there be in putting it in Python 3.2? Hi, please, let me apologize for posting here, not being a python developer; I'd like to support the inclusion of the new regex library in the standard lib. I use it since the early development versions in my internal app for searching, modifying, ordering, extracting data from text - mainly using the manually created regex patterns. I see, that it is only one specific usecase, and the app isn't time or space critical (input texts up to a few MB, mostly smaller; the processing time is often rather negligible compared to the gui presentation, styling etc.) However, I see it as a great enhancement both in terms of regex features (listed on http://pypi.python.org/pypi/regex ) as well as the behaviour in some cornercases, which aren't effectively usable in the current re (e.g. nested subexpressions with quantifiers - while many of these are more effectively solved with the added possesive quantifiers). I think, this is a far more feature complete engine, which doesn't induce any significant drawbacks (IMO) comparing to the current re and is backward compatible. (The mentioned exception in the scoped flags might be fixable by allowing only explicit scoping (?flags)...(?-flags) or using the current paren, if possible (?flag:...) and treating the bare flag setting parens as global; however, I would consider it quite misleading in the current re, if these flags are set at some other place than at the beginning of the pattern. I don't see the readability enhanced in any way with these flags set at the end, while they should apply from the beginning of the pattern.) Having seen MRABs commitment in the development phase in both - bugfixes and feature additions - including some rather complex ones (in my opinion) like unicode properties, I am also confident, that he could be a competent maintainer of this package in the standardlib as well. just my biased opinion... Regards, Vlastimil Brom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlan ncogh...@gmail.com wrote: MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look. I feel like the new module is designed to replace the current re module, and shouldn't need to spend time in PyPI. A faster regex library isn't going to motivate users to add external dependencies to their projects. Reid ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On 13/07/2010 15:17, Reid Kleckner wrote: On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlanncogh...@gmail.com wrote: MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look. I feel like the new module is designed to replace the current re module, and shouldn't need to spend time in PyPI. A faster regex library isn't going to motivate users to add external dependencies to their projects. If the backwards compatibility issues can be addressed and MRAB is willing to remain as maintainer then the advantages seem well worth it to me. Michael Reid ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.ironpythoninaction.com/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Tue, 13 Jul 2010 15:20:23 +0100 Michael Foord fuzzy...@voidspace.org.uk wrote: On 13/07/2010 15:17, Reid Kleckner wrote: On Mon, Jul 12, 2010 at 2:07 PM, Nick Coghlanncogh...@gmail.com wrote: MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look. I feel like the new module is designed to replace the current re module, and shouldn't need to spend time in PyPI. A faster regex library isn't going to motivate users to add external dependencies to their projects. If the backwards compatibility issues can be addressed and MRAB is willing to remain as maintainer then the advantages seem well worth it to me. To me as well. The code needs a full review before integrating, though. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Aprano st...@pearwood.info wrote: On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote: re2 comparison is interesting from the point of if it should be included in stdlib. Is it re2 or regex? I don't see having 2 regular expression engines in the stdlib. There's precedence though... the old regex engine and the new re engine were side-by-side for many years before regex was deprecated and finally removed in 2.5. Hypothetically, re2 could similarly be added to the standard library while re is deprecated. re2 deliberately omits some features for efficiency reasons, hence is not even on the table as a possible replacement for the standard library version. If someone is in a position where re2 can solve their problems with the re module, they should also be in a position where they can track it down for themselves. MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On 12/07/2010 15:07, Nick Coghlan wrote: On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Apranost...@pearwood.info wrote: On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote: re2 comparison is interesting from the point of if it should be included in stdlib. Is it re2 or regex? I don't see having 2 regular expression engines in the stdlib. There's precedence though... the old regex engine and the new re engine were side-by-side for many years before regex was deprecated and finally removed in 2.5. Hypothetically, re2 could similarly be added to the standard library while re is deprecated. re2 deliberately omits some features for efficiency reasons, hence is not even on the table as a possible replacement for the standard library version. If someone is in a position where re2 can solve their problems with the re module, they should also be in a position where they can track it down for themselves. If it has *partial* compatibility, and big enough performance improvements for common cases, it could perhaps be used where the regex doesn't use unsupported features. This would have some extra cost in the compile phase, but would mean Python could ship with two regex engines but only one interface exposed to the programmer... Michael MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look. Cheers, Nick. -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Mon, 12 Jul 2010 16:18:38 +0100 Michael Foord fuzzy...@voidspace.org.uk wrote: On 12/07/2010 15:07, Nick Coghlan wrote: On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Apranost...@pearwood.info wrote: On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote: re2 comparison is interesting from the point of if it should be included in stdlib. Is it re2 or regex? I don't see having 2 regular expression engines in the stdlib. There's precedence though... the old regex engine and the new re engine were side-by-side for many years before regex was deprecated and finally removed in 2.5. Hypothetically, re2 could similarly be added to the standard library while re is deprecated. re2 deliberately omits some features for efficiency reasons, hence is not even on the table as a possible replacement for the standard library version. If someone is in a position where re2 can solve their problems with the re module, they should also be in a position where they can track it down for themselves. If it has *partial* compatibility, and big enough performance improvements for common cases, it could perhaps be used where the regex doesn't use unsupported features. This would have some extra cost in the compile phase, but would mean Python could ship with two regex engines but only one interface exposed to the programmer... You're forgetting the maintenance cost. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Mon, 2010-07-12 at 16:18 +0100, Michael Foord wrote: On 12/07/2010 15:07, Nick Coghlan wrote: On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Apranost...@pearwood.info wrote: re2 deliberately omits some features for efficiency reasons, hence is not even on the table as a possible replacement for the standard library version. If someone is in a position where re2 can solve their problems with the re module, they should also be in a position where they can track it down for themselves. If it has *partial* compatibility, and big enough performance improvements for common cases, it could perhaps be used where the regex doesn't use unsupported features. This would have some extra cost in the compile phase, but would mean Python could ship with two regex engines but only one interface exposed to the programmer... I'm not sure how common those cases are - I played around with RE2 a few months ago and found that the majority of regular expressions in my own code were noticeably slower running under RE2 than python's re module - RE2 just puts much nicer theoretical bounds on cases that were (in my code at least) unusual. The really good use case for RE2 is for applications where users can write regular expressions as input (exactly what RE2 was designed for) - but I'm not sure how common those applications are. Tim Wintle ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Mon, Jul 12, 2010 at 8:18 AM, Michael Foord fuzzy...@voidspace.org.uk wrote: On 12/07/2010 15:07, Nick Coghlan wrote: On Mon, Jul 12, 2010 at 9:42 AM, Steven D'Apranost...@pearwood.info wrote: On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote: re2 comparison is interesting from the point of if it should be included in stdlib. Is it re2 or regex? I don't see having 2 regular expression engines in the stdlib. There's precedence though... the old regex engine and the new re engine were side-by-side for many years before regex was deprecated and finally removed in 2.5. Hypothetically, re2 could similarly be added to the standard library while re is deprecated. re2 deliberately omits some features for efficiency reasons, hence is not even on the table as a possible replacement for the standard library version. If someone is in a position where re2 can solve their problems with the re module, they should also be in a position where they can track it down for themselves. If it has *partial* compatibility, and big enough performance improvements for common cases, it could perhaps be used where the regex doesn't use unsupported features. This would have some extra cost in the compile phase, but would mean Python could ship with two regex engines but only one interface exposed to the programmer... FWIW, this has all been discussed before: http://aspn.activestate.com/ASPN/Mail/Message/python-dev/3829265. In particular, I still believe that, it's not obvious that enough Python regexes would benefit from re2's performance/restrictions tradeoff to make such a hybrid system worthwhile in the long term. (There is no representative corpus of real-world Python regexes weighted for dynamic execution frequency to use in assessing such tradeoffs empirically like there is for JavaScript.) Collin MRAB's module offers a superset of re's features rather than a subset though, so once it has had more of a chance to bake on PyPI it may be worth another look. Cheers, Nick. -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog READ CAREFULLY. By accepting and reading this email you agree, on behalf of your employer, to release me from all obligations and waivers arising from any and all NON-NEGOTIATED agreements, licenses, terms-of-service, shrinkwrap, clickwrap, browsewrap, confidentiality, non-disclosure, non-compete and acceptable use policies (”BOGUS AGREEMENTS”) that I have entered into with your employer, its partners, licensors, agents and assigns, in perpetuity, without prejudice to my ongoing rights and privileges. You further represent that you have the authority to release me from any BOGUS AGREEMENTS on behalf of your employer. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/collinwinter%40google.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Fri, Jul 9, 2010 at 6:59 PM, Jeffrey Yasskin jyass...@gmail.com wrote: While the re2 comparison might be interesting from an abstract standpoint, it intentionally supports a different regex language from Python so that it can run faster and use less memory. Since re2 can never replace Python's re module, it doesn't make sense to hold MRAB's new module to that standard. re2 comparison is interesting from the point of if it should be included in stdlib. -- anatoly t. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On 7/11/2010 5:19 AM, anatoly techtonik wrote: On Fri, Jul 9, 2010 at 6:59 PM, Jeffrey Yasskinjyass...@gmail.com wrote: While the re2 comparison might be interesting from an abstract standpoint, it intentionally supports a different regex language from Python so that it can run faster and use less memory. Since re2 can never replace Python's re module, it doesn't make sense to hold MRAB's new module to that standard. re2 comparison is interesting from the point of if it should be included in stdlib. Is it re2 or regex? I don't see having 2 regular expression engines in the stdlib. Eric. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Sun, Jul 11, 2010 at 7:19 PM, anatoly techtonik techto...@gmail.com wrote: On Fri, Jul 9, 2010 at 6:59 PM, Jeffrey Yasskin jyass...@gmail.com wrote: While the re2 comparison might be interesting from an abstract standpoint, it intentionally supports a different regex language from Python so that it can run faster and use less memory. Since re2 can never replace Python's re module, it doesn't make sense to hold MRAB's new module to that standard. re2 comparison is interesting from the point of if it should be included in stdlib. No it isn't. re2 is a wrapper for a third party engine that surrenders some functionality in the pursuit of better bounds on memory and CPU usage. It is not a drop-in replacement for re and cannot be by design: The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently. There is no reason to have two distinct regex engines in the standard library - if someone knows enough to realise they need the performance assurances of re2, they're also likely to be able to find the Python wrappers for it. regex is potentially interesting for the standard library as it *is* intended to be a drop-in replacement for re that trades longer compilation times (typically once per application) for faster match times (potentially many times per application). The performance of re2 has nothing to do with the comparison between the current re module and MRAB's regex module. Regards, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote: re2 comparison is interesting from the point of if it should be included in stdlib. Is it re2 or regex? I don't see having 2 regular expression engines in the stdlib. There's precedence though... the old regex engine and the new re engine were side-by-side for many years before regex was deprecated and finally removed in 2.5. Hypothetically, re2 could similarly be added to the standard library while re is deprecated. -- Steven D'Aprano ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Sun, Jul 11, 2010 at 7:42 PM, Steven D'Aprano st...@pearwood.info wrote: On Sun, 11 Jul 2010 09:37:22 pm Eric Smith wrote: re2 comparison is interesting from the point of if it should be included in stdlib. Is it re2 or regex? I don't see having 2 regular expression engines in the stdlib. There's precedence though... the old regex engine and the new re engine were side-by-side for many years before regex was deprecated and finally removed in 2.5. Hypothetically, re2 could similarly be added to the standard library while re is deprecated. While I realize I'm not the target audience for this, there are a lot of things I'd like to see more in the stdlib than a second re engine. Geremy Condra ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
Am 09.07.2010 02:35, schrieb MRAB: That's not what I'm asking. I'm asking what happens if you take an existing Python installation's re module, move it aside, and drop regex in its place as re.py. Doing that and then running Python's own test suite as well as the test suites of major Python applications and frameworks like Twisted, Zope and Django would provide solid evidence that the new version really *is* backwards compatible, rather than that it is *meant* to be backwards compatible. I had to recompile the .pyd to change its internal name from regex to re, but apart from that it passed Python's own test suite except for where I expected it to fail: 1. Some of the inline flags are scoped; for example, putting (?i) at the end of a regex will now have no effect because it's no longer a global, all-or-nothing, flag. That is problematic. I've often seen people put these flags at the end of a regex, probably for readability purposes. IMHO it would be better to limit flag scoping to the explicit (?flags-flags: ) groups. 2. The .sub method will treat unmatched groups in an expansion as empty strings. The re module raises an exception in such cases, which means that users currently need a workaround. That is a change for the better, I would say. Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Thu, Jul 8, 2010 at 10:52 PM, MRAB pyt...@mrabarnett.plus.com wrote: Hi all, I re-implemented the re module, adding new features and speed improvements. It's available at: http://pypi.python.org/pypi/regex under the name regex so that it can be tried alongside re. I'd be interested in any comments or feedback. How does it compare with re in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. And where are the benchmarks? In particular it would be interesting to see it compared both to re from stdlib and re2 from http://code.google.com/p/re2/ -- anatoly t. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Fri, Jul 9, 2010 at 7:06 AM, anatoly techtonik techto...@gmail.com wrote: On Thu, Jul 8, 2010 at 10:52 PM, MRAB pyt...@mrabarnett.plus.com wrote: Hi all, I re-implemented the re module, adding new features and speed improvements. It's available at: http://pypi.python.org/pypi/regex under the name regex so that it can be tried alongside re. I'd be interested in any comments or feedback. How does it compare with re in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. And where are the benchmarks? In particular it would be interesting to see it compared both to re from stdlib and re2 from http://code.google.com/p/re2/ While the re2 comparison might be interesting from an abstract standpoint, it intentionally supports a different regex language from Python so that it can run faster and use less memory. Since re2 can never replace Python's re module, it doesn't make sense to hold MRAB's new module to that standard. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
anatoly techtonik wrote: On Thu, Jul 8, 2010 at 10:52 PM, MRAB pyt...@mrabarnett.plus.com wrote: Hi all, I re-implemented the re module, adding new features and speed improvements. It's available at: http://pypi.python.org/pypi/regex under the name regex so that it can be tried alongside re. I'd be interested in any comments or feedback. How does it compare with re in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. And where are the benchmarks? In particular it would be interesting to see it compared both to re from stdlib and re2 from http://code.google.com/p/re2/ The benchmarks bm_regex_effbot.py and bm_regex_v8.py both perform multiple runs of the tests multiple times, giving just the total times for each set. Here are the averages: Python26 BENCHMARKre regex ratio bm_regex_effbot 0.135secs 0.083secs 1.63 bm_regex_v8 0.153secs 0.085secs 1.80 Python31 BENCHMARKre regex ratio bm_regex_effbot 0.138secs 0.083secs 1.66 bm_regex_v8 0.170secs 0.091secs 1.87 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Fri, Jul 9, 2010 at 10:28 AM, MRAB pyt...@mrabarnett.plus.com wrote: anatoly techtonik wrote: On Thu, Jul 8, 2010 at 10:52 PM, MRAB pyt...@mrabarnett.plus.com wrote: Hi all, I re-implemented the re module, adding new features and speed improvements. It's available at: http://pypi.python.org/pypi/regex under the name regex so that it can be tried alongside re. I'd be interested in any comments or feedback. How does it compare with re in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. And where are the benchmarks? In particular it would be interesting to see it compared both to re from stdlib and re2 from http://code.google.com/p/re2/ The benchmarks bm_regex_effbot.py and bm_regex_v8.py both perform multiple runs of the tests multiple times, giving just the total times for each set. Here are the averages: Python26 BENCHMARK re regex ratio bm_regex_effbot 0.135secs 0.083secs 1.63 bm_regex_v8 0.153secs 0.085secs 1.80 Python31 BENCHMARK re regex ratio bm_regex_effbot 0.138secs 0.083secs 1.66 bm_regex_v8 0.170secs 0.091secs 1.87 Out of curiosity, what are the results for the bm_regex_compile benchmark? Collin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
Collin Winter wrote: On Fri, Jul 9, 2010 at 10:28 AM, MRAB pyt...@mrabarnett.plus.com wrote: anatoly techtonik wrote: On Thu, Jul 8, 2010 at 10:52 PM, MRAB pyt...@mrabarnett.plus.com wrote: Hi all, I re-implemented the re module, adding new features and speed improvements. It's available at: http://pypi.python.org/pypi/regex under the name regex so that it can be tried alongside re. I'd be interested in any comments or feedback. How does it compare with re in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. And where are the benchmarks? In particular it would be interesting to see it compared both to re from stdlib and re2 from http://code.google.com/p/re2/ The benchmarks bm_regex_effbot.py and bm_regex_v8.py both perform multiple runs of the tests multiple times, giving just the total times for each set. Here are the averages: Python26 BENCHMARKre regex ratio bm_regex_effbot 0.135secs 0.083secs 1.63 bm_regex_v8 0.153secs 0.085secs 1.80 Python31 BENCHMARKre regex ratio bm_regex_effbot 0.138secs 0.083secs 1.66 bm_regex_v8 0.170secs 0.091secs 1.87 Out of curiosity, what are the results for the bm_regex_compile benchmark? I concentrated my efforts on the matching speed because regexes tend to be compiled only once, and are cached anyway, so I don't think it's as important. The results are: Python26 BENCHMARK re regex ratio bm_regex_compile 0.897secs 2.792secs 0.32 Python31 BENCHMARK re regex ratio bm_regex_compile 0.902secs 2.731secs 0.33 If anyone can demonstrate that it'll have a significant impact in practice then I will, of course, look into it further. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Fri, Jul 9, 2010 at 3:35 PM, MRAB pyt...@mrabarnett.plus.com wrote: I concentrated my efforts on the matching speed because regexes tend to be compiled only once, and are cached anyway, so I don't think it's as important. I think most here will agree with that, but it might be good to keep in mind that the sre implementation has already made that trade-off once. I don't remember what the compilation slowdown was at the time, but I'd be surprised if Google can't find it, given sufficient fu. -Fred -- Fred L. Drake, Jr.fdrake at gmail.com A storm broke loose in my mind. --Albert Einstein ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
H as long as we aren't the ones writing the check:) BJ --Original Message-- From: Fred Drake fdr...@acm.org Sent: Fri, July 09, 2010 1:16 PM To: MRAB pyt...@mrabarnett.plus.com Cc: Python-Dev python-dev@python.org Subject: Re: [Python-Dev] New regex module for 3.2? On Fri, Jul 9, 2010 at 3:35 PM, MRAB pyt...@mrabarnett.plus.com wrote: I concentrated my efforts on the matching speed because regexes tend to be compiled only once, and are cached anyway, so I don't think it's as important. I think most here will agree with that, but it might be good to keep in mind that the sre implementation has already made that trade-off once. I don't remember what the compilation slowdown was at the time, but I'd be surprised if Google can't find it, given sufficient fu. -Fred -- Fred L. Drake, Jr.fdrake at gmail.com A storm broke loose in my mind. --Albert Einstein ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/bj.wahl%40oracle.com ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Fri, Jul 9, 2010 at 5:52 AM, MRAB pyt...@mrabarnett.plus.com wrote: Hi all, I re-implemented the re module, adding new features and speed improvements. It's available at: http://pypi.python.org/pypi/regex under the name regex so that it can be tried alongside re. I'd be interested in any comments or feedback. How does it compare with re in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. How much interest would there be in putting it in Python 3.2? The list of fixed bugs/new features is certainly impressive. How does Python's test suite go if you drop it in place of the current re module? (Ditto for test suites of major applications and frameworks like Django, etc). Off the top of my head, I would say that this won't have enough time to bake properly for inclusion in 3.2, but if the potential benefits and intended backwards compatibility are borne out by real world usage and the code fares well under review then it may be a contender for 3.3. If the backwards compatibility isn't quite there (and can't be improved), then adding it under a name other than re wouldn't be impossible, but it would be a harder sell. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
2010/7/8 MRAB pyt...@mrabarnett.plus.com: Hi all, I re-implemented the re module, adding new features and speed improvements. It's available at: http://pypi.python.org/pypi/regex under the name regex so that it can be tried alongside re. I'd be interested in any comments or feedback. How does it compare with re in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. How much interest would there be in putting it in Python 3.2? It would really be nice if you explained incrementally everything you changed, so it could better be evaluated. -- Regards, Benjamin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Thu, 08 Jul 2010 20:52:44 +0100 MRAB pyt...@mrabarnett.plus.com wrote: I'd be interested in any comments or feedback. How does it compare with re in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. Can you publish these benchmarks somewhere? (or send them here) How much interest would there be in putting it in Python 3.2? I think there's certainly interest (especially given that the original re module doesn't really have an expert and active maintainer). Since it's a very big change (and a rather annoying to undo one), though, it must really not add any maintenance problems, and ideally you should promise to maintain it at least for a couple of years. Bonus points if the internals are sufficiently documented, too. Thanks Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
Nick Coghlan wrote: On Fri, Jul 9, 2010 at 5:52 AM, MRAB pyt...@mrabarnett.plus.com wrote: Hi all, I re-implemented the re module, adding new features and speed improvements. It's available at: http://pypi.python.org/pypi/regex under the name regex so that it can be tried alongside re. I'd be interested in any comments or feedback. How does it compare with re in terms of speed on real-world data? The benchmarks suggest it should be faster, or at worst comparable. How much interest would there be in putting it in Python 3.2? The list of fixed bugs/new features is certainly impressive. How does Python's test suite go if you drop it in place of the current re module? (Ditto for test suites of major applications and frameworks like Django, etc). Off the top of my head, I would say that this won't have enough time to bake properly for inclusion in 3.2, but if the potential benefits and intended backwards compatibility are borne out by real world usage and the code fares well under review then it may be a contender for 3.3. If the backwards compatibility isn't quite there (and can't be improved), then adding it under a name other than re wouldn't be impossible, but it would be a harder sell. You should be able to replace: import re with: import regex as re and still have everything work the same, ie it's backwards compatible with re. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
On Fri, Jul 9, 2010 at 7:54 AM, MRAB pyt...@mrabarnett.plus.com wrote: You should be able to replace: import re with: import regex as re and still have everything work the same, ie it's backwards compatible with re. That's not what I'm asking. I'm asking what happens if you take an existing Python installation's re module, move it aside, and drop regex in its place as re.py. Doing that and then running Python's own test suite as well as the test suites of major Python applications and frameworks like Twisted, Zope and Django would provide solid evidence that the new version really *is* backwards compatible, rather than that it is *meant* to be backwards compatible. Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New regex module for 3.2?
Nick Coghlan wrote: On Fri, Jul 9, 2010 at 7:54 AM, MRAB pyt...@mrabarnett.plus.com wrote: You should be able to replace: import re with: import regex as re and still have everything work the same, ie it's backwards compatible with re. That's not what I'm asking. I'm asking what happens if you take an existing Python installation's re module, move it aside, and drop regex in its place as re.py. Doing that and then running Python's own test suite as well as the test suites of major Python applications and frameworks like Twisted, Zope and Django would provide solid evidence that the new version really *is* backwards compatible, rather than that it is *meant* to be backwards compatible. I had to recompile the .pyd to change its internal name from regex to re, but apart from that it passed Python's own test suite except for where I expected it to fail: 1. Some of the inline flags are scoped; for example, putting (?i) at the end of a regex will now have no effect because it's no longer a global, all-or-nothing, flag. 2. The .sub method will treat unmatched groups in an expansion as empty strings. The re module raises an exception in such cases, which means that users currently need a workaround. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com