Jacques Grove <aquara...@gmail.com> added the comment:

More an observation than a bug:

I understand that we're trading memory for performance, but I've noticed that 
the peak memory usage is rather high, e.g.:

$ cat test.py
import os
import regex as re

def resident():
    for line in open('/proc/%d/status' % os.getpid(), 'r').readlines():
        if line.startswith("VmRSS:"):
            return line.split(":")[-1].strip()

cache = {}

print resident()
for i in xrange(0,1000):
    cache[i] = 
re.compile(str(i)+"(abcd12kl|efghlajsdf|ijkllakjsdf|mnoplasjdf|qrstljasd|sdajdwxyzlasjdf|kajsdfjkasdjkf|kasdflkasjdflkajsd|klasdfljasdf)")

print resident()


Execution output on my machine (Linux x86_64, Python 2.6.5):
4328 kB
32052 kB

with the standard regex library:
3688 kB
5428 kB

So, it looks like around 16x the memory per pattern vs standard regex module

Now the example is pretty silly, the difference is even larger for more complex 
regexes.  I also understand that the once the patterns are GC-ed, python can 
reuse the memory (pymalloc doesn't return it to the OS, unfortunately).  
However, I have some applications that use large numbers (many thousands) of 
regexes and need to keep them cached (compiled) indefinitely (especially 
because compilation is expensive).  This causes some pain (long story).

I've played around with increasing RE_MIN_FAST_LENGTH, and it makes a 
significant difference, e.g.:

RE_MIN_FAST_LENGTH = 10:
4324 kB
25976 kB

In my use-cases, having a larger RE_MIN_FAST_LENGTH doesn't make a huge 
performance difference, so that might be the way I'll go.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue2636>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to