[issue8064] Large regex handling very slow on Linux

2010-03-05 Thread Michael Foord
Michael Foord added the comment: Interestingly, the code olivers is using was originally written by Martin v. Loewis: http://www.velocityreviews.com/forums/t646421-unicode-regex-and-hindi-language.html In response to a still open bug report on \w in the Python re module: http://bugs.python.o

[issue8064] Large regex handling very slow on Linux

2010-03-04 Thread Ezio Melotti
Ezio Melotti added the comment: This is a proof that you can have an equivalent regex without including all the 'letter chars' (tested on both narrow and wide builds): >>> s = u''.join(unichr(c) for c in range(sys.maxunicode)) >>> diff = set(re.findall(u'[^\W\d]', s, re.U)) ^ set(re.findall(u'[

[issue8064] Large regex handling very slow on Linux

2010-03-04 Thread Ezio Melotti
Ezio Melotti added the comment: A workaround could be using [^\W\d], but this includes some extra chars in the categories Pc, Nl, and No that maybe you don't want. Generate a list of chars in these 3 categories and add them in the regex should be cheaper though. Since this is not a bug of Pyth

[issue8064] Large regex handling very slow on Linux

2010-03-04 Thread Jean-Paul Calderone
Jean-Paul Calderone added the comment: > So is it reasonable / unavoidable that UCS4 builds should be 1200 times > slower at regex handling? No, but it's probably reasonable / unavoidable that a more complex regex should be some number of times slower than a simpler regex. On Linux, the rege

[issue8064] Large regex handling very slow on Linux

2010-03-04 Thread Michael Foord
Michael Foord added the comment: So is it reasonable / unavoidable that UCS4 builds should be 1200 times slower at regex handling? -- ___ Python tracker ___

[issue8064] Large regex handling very slow on Linux

2010-03-04 Thread STINNER Victor
STINNER Victor added the comment: Ooops, my benchmark was wrong. It looks like the result depends sys.maxunicode: $ python2.4 -c "import sys; print sys.maxunicode" 1114111 $ python2.5 -c "import sys; print sys.maxunicode" 1114111 $ python2.6 -c "import sys; print sys.maxunicode" 1114111 $ ./pyt

[issue8064] Large regex handling very slow on Linux

2010-03-04 Thread STINNER Victor
STINNER Victor added the comment: Results on Linux (Debian Sid) with different Python versions: * Python 2.4.6: 14112.8 ms * Python 2.5.5: 14246.7 ms * Python 2.6.4+: 14753.4 ms * Python trunk (2.7a3+): 69.3 ms It looks like re engine was optimized in trunk :-) Note: I replaced stopwatch b

[issue8064] Large regex handling very slow on Linux

2010-03-04 Thread Jean-Paul Calderone
Jean-Paul Calderone added the comment: I think it's likely that the test program does drastically different things on Linux than it does on OS X: Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15) [GCC 4.4.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> im

[issue8064] Large regex handling very slow on Linux

2010-03-04 Thread Oliver Sturm
New submission from Oliver Sturm : The code in regextest.py (attached) uses a large regex to analyze a piece of text. I have tried this test program on two Macs, using the standard Python distributions. On a MacBook, 2.4 GHz dual core, Snow Leopard with Python 2.6.1, it takes 0.08 seconds On