Re: regex over files

2005-04-29 Thread Peter Otten
Robin Becker wrote:

 #sscan1.py thanks to Skip
 import sys, time, mmap, os, re
 fn = sys.argv[1]
 fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
 s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
 l=n=0
 t0 = time.time()
 for mat in re.split(X, s):

re.split() returns a list, not a generator, and this list may consume a lot
of memory.

 n += 1
 l += len(mat)
 t1 = time.time()
 
 print fn=%s n=%d l=%d time=%.2f % (fn, n, l, (t1-t0))

I wrote a generator replacement for re.split(), but as you might expect, the
performance is nowhere near re.split(). For your large data it might help
somewhat because of its smaller memory footprint.

def splititer(regex, data):
# like re.split(), but never yields the separators.
if not hasattr(regex, finditer):
regex = re.compile(regex)
start = 0
for match in regex.finditer(data):
end, new_start = match.span()
yield data[start:end]
start = new_start
yield data[start:]

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-29 Thread Robin Becker
Peter Otten wrote:
Robin Becker wrote:

#sscan1.py thanks to Skip
import sys, time, mmap, os, re
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
l=n=0
t0 = time.time()
for mat in re.split(X, s):

re.split() returns a list, not a generator, and this list may consume a lot
of memory.

. that would certainly be the case and may answer why the simple way is so
bad for larger memory. I'll have a go at this experiment as well. My original
intention was to find the start of each match as a scanner would and this would
certainly do that. However, my observation with the trivial byte scan would seem
to imply that just scanning the file causes vm problems (at least in xp). I 
suppose it's
hard to explain to the os that I actually only need the relevant few pages.
--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-29 Thread Robin Becker
Peter Otten wrote:
Robin Becker wrote:

#sscan1.py thanks to Skip
import sys, time, mmap, os, re
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
l=n=0
t0 = time.time()
for mat in re.split(X, s):

re.split() returns a list, not a generator, and this list may consume a lot
of memory.

n += 1
l += len(mat)
t1 = time.time()
print fn=%s n=%d l=%d time=%.2f % (fn, n, l, (t1-t0))

I wrote a generator replacement for re.split(), but as you might expect, the
performance is nowhere near re.split(). For your large data it might help
somewhat because of its smaller memory footprint.
def splititer(regex, data):
# like re.split(), but never yields the separators.
if not hasattr(regex, finditer):
regex = re.compile(regex)
start = 0
for match in regex.finditer(data):
end, new_start = match.span()
yield data[start:end]
start = new_start
yield data[start:]
Peter
OK now the split scan times are much more comparable for 200Mb (which is what I 
have freely available according to taskmanager), but things start getting bad 
for 300Mb.

C:\code\reportlab\demos\gadflypaper\tmp\sscan0.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=23.05
C:\code\reportlab\demos\gadflypaper\tmp\sscan2.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=27.63
C:\code\reportlab\demos\gadflypaper\tmp\sscan2.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=28.13
C:\code\reportlab\demos\gadflypaper\tmp\sscan0.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=22.66
C:\code\reportlab\demos\gadflypaper\tmp\sscan2.py xxx_300mb.dat
fn=xxx_300mb.dat n=5696206 l=271519105 time=45.45
C:\code\reportlab\demos\gadflypaper\tmp\sscan0.py xxx_300mb.dat
fn=xxx_300mb.dat n=5696206 l=271519105 time=32.14
C:\code\reportlab\demos\gadflypaper\tmp\sscan0.py xxx_300mb.dat
fn=xxx_300mb.dat n=5696206 l=271519105 time=33.17
C:\code\reportlab\demos\gadflypaper\tmp\sscan2.py xxx_300mb.dat
fn=xxx_300mb.dat n=5696206 l=271519105 time=45.27
here sscan0.py is Bengt's adaptive buffer splitter and sscan2.py is Peter's 
generator splitter.
C:\code\reportlab\demos\gadflypapercat \tmp\sscan0.py
import sys, time, re
fn = sys.argv[1]
rxo = re.compile('X')

def frxsplit(path, rxo, chunksize=4096):
buffer = ''
for chunk in iter((lambda f=open(path,'rb'): f.read(chunksize)),''):
buffer += chunk
pieces = rxo.split(buffer)
for piece in pieces[:-1]: yield piece
buffer = pieces[-1]
yield buffer
l=n=0
t0 = time.time()
for mat in frxsplit(fn,rxo):
n += 1
l += len(mat)
t1 = time.time()
print fn=%s n=%d l=%d time=%.2f % (fn, n, l, (t1-t0))
C:\code\reportlab\demos\gadflypapercat \tmp\sscan2.py
import sys, time, mmap, os, re
def splititer(regex, data):
# like re.split(), but never yields the separators.
if not hasattr(regex, finditer):
regex = re.compile(regex)
start = 0
for match in regex.finditer(data):
end, new_start = match.span()
yield data[start:end]
start = new_start
yield data[start:]
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
l=n=0
t0 = time.time()
for mat in splititer(X, s):
n += 1
l += len(mat)
t1 = time.time()
print fn=%s n=%d l=%d time=%.2f % (fn, n, l, (t1-t0))

--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-28 Thread Robin Becker
Skip Montanaro wrote:
..
I'm not sure why the mmap() solution is so much slower for you.  Perhaps on
some systems files opened for reading are mmap'd under the covers.  I'm sure
it's highly platform-dependent.  (My results on MacOSX - see below - are
somewhat better.)
I'll have a go at doing the experiment on some other platforms I have 
available. The problem is certainly paging related. Perhaps the fact 
that we don't need to write dirty pages is moot when the system is 
actually writing out other processes' pages to make room for the 
incoming ones needed by the cpu hog. I do know that I cannot control 
that in detail. Also it's entirely possible that file caching/readahead 
etc etc can skew the results.

All my old compiler texts recommend the buffered read approach, but that 
might be because mmap etc weren't around. Perhaps some compiler expert 
can say? Also I suspect that in a low level language the minor overhead 
caused by the book keeping is lower than that for the paging code.

Let me return to your original problem though, doing regex operations on
files.  I modified your two scripts slightly:
..
I took the file from Bengt Richter's example and replicated it a bunch of
times to get a 122MB file.  I then ran the above two programs against it:
% python tscan1.py splitX
n=2112001 time=8.88
% python tscan0.py splitX
n=2139845 time=10.26
So the mmap'd version is within 15% of the performance of the buffered read
version and we don't have to solve the problem of any corner cases (note the
different values of n).  I'm happy to take the extra runtime in exchange for
simpler code.
Skip
I will have a go at repeating this on my system. Perhaps with Bengt's 
code in the buffered case as that would be more realistic.

It has been my experience that all systems crawl when driven into the 
swapping region and some users of our code seem anxious to run huge 
print jobs.
--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-28 Thread Skip Montanaro

Bengt To be fairer, I think you'd want to hoist the re compilation out
Bengt of the loop.

The re module compiles and caches regular expressions, so I doubt it would
affect the runtime of either version.

Bengt But also to be fairer, maybe include the overhead of splitting
Bengt correctly, at least for the simple case regex in my example -- or
Bengt is a you-goofed post for me in the usenet forwarding queues
Bengt somewhere still? ;-)

I was just too lazy to incorporate (something like) your change.  You will
note that I was also lazy enough to simply steal your X file. wink

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-28 Thread Robin Becker
Skip Montanaro wrote:
...
I'm not sure why the mmap() solution is so much slower for you.  Perhaps on
some systems files opened for reading are mmap'd under the covers.  I'm sure
it's highly platform-dependent.  (My results on MacOSX - see below - are
somewhat better.)
Let me return to your original problem though, doing regex operations on
files.  I modified your two scripts slightly:
I'm sure my results are dependent on something other than the coding 
style I suspect file/disk cache and paging operates here. Note that we 
now agree on total match length and split count. However, when the 
windows VM goes into paging mode the mmap thing falls off the world as I 
would expect for a thrashing system.

eg small memory (relatively)
C:\code\reportlab\demos\gadflypaper\tmp\sscan0.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=3.55
C:\code\reportlab\demos\gadflypaper\tmp\sscan1.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=8.25
C:\code\reportlab\demos\gadflypaper\tmp\sscan1.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=9.77
C:\code\reportlab\demos\gadflypaper\tmp\sscan0.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=5.09
C:\code\reportlab\demos\gadflypaper\tmp\sscan1.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=6.17
C:\code\reportlab\demos\gadflypaper\tmp\sscan0.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=4.64
and large memory
C:\code\reportlab\demos\gadflypaper\tmp\sscan0.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=20.16
C:\code\reportlab\demos\gadflypaper\tmp\sscan1.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=136.42
At the end of this run I had to wait quite a long time for other things 
to become responsive (ie things were entirely paged out).

as another data point with sscan0/1.py (slight mods of your code) I get 
this with a 200mb file on freeBSD 4.9

/usr/RL_HOME/users/robin/sstest:
$ python sscan0.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=7.37
/usr/RL_HOME/users/robin/sstest:
$ python sscan1.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=129.65
/usr/RL_HOME/users/robin/sstest:
ie the freeBSD vm seems to thrash just as nastily as xp :(

Here I've implemented slightly modified versions of the scanners that 
you put forward.

eg
#sscan0.py thanks to Bengt
import sys, time, re
fn = sys.argv[1]
rxo = re.compile('X')
def frxsplit(path, rxo, chunksize=4096):
buffer = ''
for chunk in iter((lambda f=open(path,'rb'): f.read(chunksize)),''):
buffer += chunk
pieces = rxo.split(buffer)
for piece in pieces[:-1]: yield piece
buffer = pieces[-1]
yield buffer
l=n=0
t0 = time.time()
for mat in frxsplit(fn,rxo):
n += 1
l += len(mat)
t1 = time.time()
print fn=%s n=%d l=%d time=%.2f % (fn, n, l, (t1-t0))
#sscan1.py thanks to Skip
import sys, time, mmap, os, re
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
l=n=0
t0 = time.time()
for mat in re.split(X, s):
n += 1
l += len(mat)
t1 = time.time()
print fn=%s n=%d l=%d time=%.2f % (fn, n, l, (t1
--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-28 Thread Robin Becker
Jeremy Bowers wrote:


 As you try to understand mmap, make sure your mental model can take into
 account the fact that it is easy and quite common to mmap a file several
 times larger than your physical memory, and it does not even *try* to read
 the whole thing in at any given time. You may benefit from
 reviewing/studying the difference between virtual memory and physical
 memory.
I've been using vm systems for 30 years and I suspect my mental model is a bit
decrepit. However, as convincingly demonstrated by testing my mental model seems
able to predict low memory problems. When systems run out of memory they tend to
perform poorly. I'm not sure the horrible degradation I see with large files is
necessary, but I know it occurs on at least two common vm implementations.
--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-28 Thread Robin Becker
Robin Becker wrote:
Skip Montanaro wrote:
..
I'm not sure why the mmap() solution is so much slower for you.  
Perhaps on
some systems files opened for reading are mmap'd under the covers.  
I'm sure
it's highly platform-dependent.  (My results on MacOSX - see below - are
somewhat better.)

. as a data point with sscan0/1.py (slight mods of your code) I get this 
with a 200mb file on freeBSD 4.9

/usr/RL_HOME/users/robin/sstest:
$ python sscan0.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=7.37
/usr/RL_HOME/users/robin/sstest:
$ python sscan1.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=129.65
/usr/RL_HOME/users/robin/sstest:
ie the freeBSD vm seems to thrash just as nastily as xp :(
--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-28 Thread Robin Becker
Jeremy Bowers wrote:
.
As you try to understand mmap, make sure your mental model can take into
account the fact that it is easy and quite common to mmap a file several
times larger than your physical memory, and it does not even *try* to read
the whole thing in at any given time. You may benefit from
reviewing/studying the difference between virtual memory and physical
memory.
I've been using vm systems for 30 years and I suspect my mental model is a bit 
decrepit. However, as convincingly demonstrated by testing my mental model seems 
able to predict low memory problems. When systems run out of memory they tend to 
perform poorly. I'm not sure the horrible degradation I see with large files is 
necessary, but I know it occurs on at least one common vm implementation.
--
Robin Becker

--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-28 Thread Robin Becker
Skip Montanaro wrote:
.

Let me return to your original problem though, doing regex operations on
files.  I modified your two scripts slightly:
.
Skip
I'm sure my results are dependent on something other than the coding style
I suspect file/disk cache and paging operates here. Note that we now agree on 
total match length and split count. However, when the windows VM goes into 
paging mode the mmap thing falls off the world as I would expect for a thrashing 
 system.

eg small memory (relatively)
C:\code\reportlab\demos\gadflypaper\tmp\sscan0.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=3.55
C:\code\reportlab\demos\gadflypaper\tmp\sscan1.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=8.25
C:\code\reportlab\demos\gadflypaper\tmp\sscan1.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=9.77
C:\code\reportlab\demos\gadflypaper\tmp\sscan0.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=5.09
C:\code\reportlab\demos\gadflypaper\tmp\sscan1.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=6.17
C:\code\reportlab\demos\gadflypaper\tmp\sscan0.py xxx_100mb.dat
fn=xxx_100mb.dat n=1898737 l=90506416 time=4.64
and large memory
C:\code\reportlab\demos\gadflypaper\tmp\sscan0.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=20.16
C:\code\reportlab\demos\gadflypaper\tmp\sscan1.py xxx_200mb.dat
fn=xxx_200mb.dat n=3797470 l=181012689 time=136.42
At the end of this run I had to wait quite a long time for other things to 
become responsive (ie things were entirely paged out).

Here I've implemented slightly modified versions of the scanners that you put 
forward.

eg
#sscan0.py thanks to Bengt
import sys, time, re
fn = sys.argv[1]
rxo = re.compile('X')
def frxsplit(path, rxo, chunksize=4096):
buffer = ''
for chunk in iter((lambda f=open(path,'rb'): f.read(chunksize)),''):
buffer += chunk
pieces = rxo.split(buffer)
for piece in pieces[:-1]: yield piece
buffer = pieces[-1]
yield buffer
l=n=0
t0 = time.time()
for mat in frxsplit(fn,rxo):
n += 1
l += len(mat)
t1 = time.time()
print fn=%s n=%d l=%d time=%.2f % (fn, n, l, (t1-t0))
#sscan1.py thanks to Skip
import sys, time, mmap, os, re
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
l=n=0
t0 = time.time()
for mat in re.split(X, s):
n += 1
l += len(mat)
t1 = time.time()
print fn=%s n=%d l=%d time=%.2f % (fn, n, l, (t1-t0))
--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-28 Thread Bengt Richter
On Thu, 28 Apr 2005 20:35:43 +, Robin Becker [EMAIL PROTECTED] wrote:

Jeremy Bowers wrote:

 
  As you try to understand mmap, make sure your mental model can take into
  account the fact that it is easy and quite common to mmap a file several
  times larger than your physical memory, and it does not even *try* to read
  the whole thing in at any given time. You may benefit from
  reviewing/studying the difference between virtual memory and physical
  memory.
I've been using vm systems for 30 years and I suspect my mental model is a bit
decrepit. However, as convincingly demonstrated by testing my mental model 
seems
able to predict low memory problems. When systems run out of memory they tend 
to
perform poorly. I'm not sure the horrible degradation I see with large files is
necessary, but I know it occurs on at least two common vm implementations.

It's interesting. One could envisage an mmap that would hava a parameter for 
its own
lru working set max page count, so mmap would only displace up to that many
pages from normal system paged-in file data. Then you could induce extra
reads by referring back to abandoned mmap-lru pages, but you wouldn't be
displacing anything else, and if you were moving sequentially and staying within
your page residency count allotment, things would work like the best of both 
worlds
(assuming this idea doesn't have a delusion-busting gotcha lurking ;-) 
But this kind of partitioning of VM lru logic would take some kernel changes 
IWT.

IIRC, don't mmap VM access ideas date back to multics at least?
Anyway, what with fancy controllers as well as fancy file systems and kernels,
it's easy to get hard-to-interpret results, but your large-file examples seem
pretty conclusive.

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-27 Thread Robin Becker
Jeremy Bowers wrote:
On Tue, 26 Apr 2005 20:54:53 +, Robin Becker wrote:

Skip Montanaro wrote:
...
If I mmap() a file, it's not slurped into main memory immediately, though as
you pointed out, it's charged to my process's virtual memory.  As I access
bits of the file's contents, it will page in only what's necessary.  If I
mmap() a huge file, then print out a few bytes from the middle, only the
page containing the interesting bytes is actually copied into physical
memory.

my simple rather stupid experiment indicates that windows mmap at least 
will reserve 25Mb of paged file for a linear scan through a 25Mb file. I 
probably only need 4096b to scan. That's a lot less than even the page 
table requirement. This isn't rocket science just an old style observation.

Are you trying to claim Skip is wrong, or what? There's little value in
saying that by mapping a file of 25MB into VM pages, you've increased your
allocated paged file space by 25MB. That's effectively tautological. 

If you are trying to claim Skip is wrong, you *do not understand* what you
are talking about. Talk less, listen and study more. (This is my best
guess, as like I said, observing that allocating things increases the
number of things that are allocated isn't worth posting so my thought is
you think you are proving something. If you really are just posting
something tautological, my apologies and disregard this paragraph but,
well, it's certainly not out of line at this point.)
Well I obviously don't understand so perhaps you can explain these results
I implemented a simple scanning algorithm in two ways. First buffered scan 
tscan0.py; second mmapped scan tscan1.py.

For small file sizes the times are comparable.
C:\code\reportlab\demos\gadflypaper\tmp\tscan0.py bingo.pdf
len=27916653 w=103 time=22.13
C:\code\reportlab\demos\gadflypaper\tmp\tscan1.py bingo.pdf
len=27916653 w=103 time=22.20
for large file sizes when paging becomes of interest buffered scan wins even 
though it has to do a lot more python statements. If this were coded in C the 
results would be plainer still. As I said this isn't about right or wrong it's 
an observation. If I inspect the performance monitor tscan0 is at 100%, but 
tscan1 is at 80-90% and all of memory gets used up so paging is important. This 
may be an effect of the poor design of xp if so perhaps it won't hold for other 
os's.

C:\code\reportlab\demos\gadflypaper\tmp\tscan0.py dingo.dat
len=139583265 w=103 time=110.91
C:\code\reportlab\demos\gadflypaper\tmp\tscan1.py dingo.dat
len=139583265 w=103 time=140.53
C:\code\reportlab\demos\gadflypapercat \tmp\tscan0.py
import sys, time
fn = sys.argv[1]
f=open(fn,'rb')
n=0
w=0
t0 = time.time()
while 1:
buf = f.read(4096)
lb = len(buf)
if not lb: break
n += lb
for i in xrange(lb):
w ^= ord(buf[i])
t1 = time.time()
print len=%d w=%d time=%.2f % (n, w, (t1-t0))
C:\code\reportlab\demos\gadflypapercat \tmp\tscan1.py
import sys, time, mmap, os
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
n=len(s)
w=0
t0 = time.time()
for i in xrange(n):
w ^= ord(s[i])
t1 = time.time()
print len=%d w=%d time=%.2f % (n, w, (t1-t0))

--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-27 Thread Skip Montanaro

Robin I implemented a simple scanning algorithm in two ways. First 
buffered scan 
Robin tscan0.py; second mmapped scan tscan1.py.

...

Robin C:\code\reportlab\demos\gadflypaper\tmp\tscan0.py dingo.dat
Robin len=139583265 w=103 time=110.91

Robin C:\code\reportlab\demos\gadflypaper\tmp\tscan1.py dingo.dat
Robin len=139583265 w=103 time=140.53

I'm not sure why the mmap() solution is so much slower for you.  Perhaps on
some systems files opened for reading are mmap'd under the covers.  I'm sure
it's highly platform-dependent.  (My results on MacOSX - see below - are
somewhat better.)

Let me return to your original problem though, doing regex operations on
files.  I modified your two scripts slightly:

tscan0.py:

import sys, time, re
fn = sys.argv[1]
f=open(fn,'rb')
n=0
t0 = time.time()
while 1:
 buf = f.read(4096)
 if not buf: break
 for i in re.split(X, buf):
 n += 1
t1 = time.time()

print n=%d time=%.2f % (n, (t1-t0))

tscan1.py:

import sys, time, mmap, os, re
fn = sys.argv[1]
fh=os.open(fn,os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
t0 = time.time()
n = 0
for mat in re.split(X, s):
n += 1
t1 = time.time()

print n=%d time=%.2f % (n, (t1-t0))

The mmap version is almost obviously correct, assuming what we want to do is
split the file on X.  The buffered read version is almost certainly
incorrect, given our understanding that corner cases lurk at buffer
boundaries.

I took the file from Bengt Richter's example and replicated it a bunch of
times to get a 122MB file.  I then ran the above two programs against it:

% python tscan1.py splitX
n=2112001 time=8.88
% python tscan0.py splitX
n=2139845 time=10.26

So the mmap'd version is within 15% of the performance of the buffered read
version and we don't have to solve the problem of any corner cases (note the
different values of n).  I'm happy to take the extra runtime in exchange for
simpler code.

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-27 Thread Bengt Richter
On Wed, 27 Apr 2005 21:39:45 -0500, Skip Montanaro [EMAIL PROTECTED] wrote:


Robin I implemented a simple scanning algorithm in two ways. First 
 buffered scan 
Robin tscan0.py; second mmapped scan tscan1.py.

...

Robin C:\code\reportlab\demos\gadflypaper\tmp\tscan0.py dingo.dat
Robin len=139583265 w=103 time=110.91

Robin C:\code\reportlab\demos\gadflypaper\tmp\tscan1.py dingo.dat
Robin len=139583265 w=103 time=140.53

I'm not sure why the mmap() solution is so much slower for you.  Perhaps on
some systems files opened for reading are mmap'd under the covers.  I'm sure
it's highly platform-dependent.  (My results on MacOSX - see below - are
somewhat better.)

Let me return to your original problem though, doing regex operations on
files.  I modified your two scripts slightly:

tscan0.py:

import sys, time, re
fn = sys.argv[1]
f=open(fn,'rb')
n=0
t0 = time.time()
while 1:
 buf = f.read(4096)
 if not buf: break
 for i in re.split(X, buf):
To be fairer, I think you'd want to hoist the re compilation out of the loop.
But also to be fairer, maybe include the overhead of splitting correctly, at
least for the simple case regex in my example -- or is a you-goofed post
for me in the usenet forwarding queues somewhere still? ;-)

 n += 1
t1 = time.time()

print n=%d time=%.2f % (n, (t1-t0))

tscan1.py:

import sys, time, mmap, os, re
fn = sys.argv[1]
fh=os.open(fn,os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
t0 = time.time()
n = 0
for mat in re.split(X, s):
n += 1
t1 = time.time()

print n=%d time=%.2f % (n, (t1-t0))

The mmap version is almost obviously correct, assuming what we want to do is
split the file on X.  The buffered read version is almost certainly
incorrect, given our understanding that corner cases lurk at buffer
boundaries.
I took the file from Bengt Richter's example and replicated it a bunch of
times to get a 122MB file.  I then ran the above two programs against it:

% python tscan1.py splitX
n=2112001 time=8.88
% python tscan0.py splitX
n=2139845 time=10.26

So the mmap'd version is within 15% of the performance of the buffered read
with regex recompilation in loop ;-)
version and we don't have to solve the problem of any corner cases (note the
different values of n).  I'm happy to take the extra runtime in exchange for
simpler code.

Agree. Hm, I wonder if the OS notices sequential page faults and schedules 
speculative
read-ahead. Hm2, I wonder if you can just touch bytes from another coordinated 
thread
to cause that, if it isn't happening ;-) Not for 15% though ;-)

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-26 Thread Robin Becker
Richard Brodie wrote:
Robin Becker [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
Gerald Klix wrote:
Map the file into RAM by using the mmap module.
The file's contents than is availabel as a seachable string.
that's a good idea, but I wonder if it actually saves on memory? I just 
tried
regexing through a 25Mb file and end up with 40Mb as working set (it rose
linearly as the loop progessed through the file). Am I actually saving anything
by not letting normal vm do its thing?

You aren't saving memory in that sense, no. If you have any RAM spare the
file will end up in it. However, if you are short on memory though, mmaping the
file gives the VM the opportunity to discard pages from the file, instead of 
paging
them out. Try again with a 25Gb file and watch the difference ;) YMMV.

:)
So we avoid dirty page writes etc etc. However, I still think I could 
get away with a small window into the file which would be more efficient.
--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-26 Thread Steve Holden
Robin Becker wrote:
Richard Brodie wrote:
Robin Becker [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
Gerald Klix wrote:
Map the file into RAM by using the mmap module.
The file's contents than is availabel as a seachable string.
that's a good idea, but I wonder if it actually saves on memory? I 
just tried
regexing through a 25Mb file and end up with 40Mb as working set (it 
rose
linearly as the loop progessed through the file). Am I actually 
saving anything
by not letting normal vm do its thing?

You aren't saving memory in that sense, no. If you have any RAM spare the
file will end up in it. However, if you are short on memory though, 
mmaping the
file gives the VM the opportunity to discard pages from the file, 
instead of paging
them out. Try again with a 25Gb file and watch the difference ;) YMMV.


:)
So we avoid dirty page writes etc etc. However, I still think I could 
get away with a small window into the file which would be more efficient.
I seem to remember that the Medusa code contains a fairly good 
overlapped search for a terminator string, if you want to chunk the file.

Take a look at the handle_read() method of class async_chat in the 
standard library's asynchat.py.

regards
 Steve
--
Steve Holden+1 703 861 4237  +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-26 Thread Robin Becker
Steve Holden wrote:
..
I seem to remember that the Medusa code contains a fairly good 
overlapped search for a terminator string, if you want to chunk the file.

Take a look at the handle_read() method of class async_chat in the 
standard library's asynchat.py.
.
thanks I'll give it a whirl
--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-26 Thread Steve Holden
Robin Becker wrote:
Steve Holden wrote:
..
I seem to remember that the Medusa code contains a fairly good 
overlapped search for a terminator string, if you want to chunk the file.

Take a look at the handle_read() method of class async_chat in the 
standard library's asynchat.py.
.
thanks I'll give it a whirl
Whoops, I don't think it's a regex search :-(
You should be able to adapt the logic fairly easily, I hope.
regards
 Steve
--
Steve Holden+1 703 861 4237  +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-26 Thread Robin Becker
Steve Holden wrote:

.
thanks I'll give it a whirl
Whoops, I don't think it's a regex search :-(
You should be able to adapt the logic fairly easily, I hope.

The buffering logic is half the problem; doing it quickly is the other half.
The third half of the problem is getting re to co-operate and probably halves 4 
5. :)
--
Robin Becker

--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-26 Thread Skip Montanaro

Robin So we avoid dirty page writes etc etc. However, I still think I
Robin could get away with a small window into the file which would be
Robin more efficient.

It's hard to imagine how sliding a small window onto a file within Python
would be more efficient than the operating system's paging system. ;-)

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-26 Thread Robin Becker
Skip Montanaro wrote:
Robin So we avoid dirty page writes etc etc. However, I still think I
Robin could get away with a small window into the file which would be
Robin more efficient.
It's hard to imagine how sliding a small window onto a file within Python
would be more efficient than the operating system's paging system. ;-)
Skip
well it might be if I only want to scan forward through the file (think lexical 
analysis). Most lexical analyzers use a buffer and produce a stream of tokens ie 
a compressed version of the input. There are problems crossing buffers etc, but 
we never normally need the whole file in memory.

If the lexical analyzer reads the whole file into memory then we need more 
pages. The mmap thing might help as we need only read pages (for a lexical scanner).

Scanners work by detecting the transitions between tokens so even if the tokens 
are very long we don't need to store them twice (in the input stream and token 
accumulator); I suppose that could be true of regex pattern matchers, but it 
doesn't seem to be for re ie we need the entire pattern in the input before we 
can match and extract to an accumulator.
--
Robin Becker

--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-26 Thread Jeremy Bowers
On Tue, 26 Apr 2005 19:32:29 +0100, Robin Becker wrote:

 Skip Montanaro wrote:
 Robin So we avoid dirty page writes etc etc. However, I still think I
 Robin could get away with a small window into the file which would be
 Robin more efficient.
 
 It's hard to imagine how sliding a small window onto a file within Python
 would be more efficient than the operating system's paging system. ;-)
 
 Skip
 well it might be if I only want to scan forward through the file (think 
 lexical 
 analysis). Most lexical analyzers use a buffer and produce a stream of tokens 
 ie 
 a compressed version of the input. There are problems crossing buffers etc, 
 but 
 we never normally need the whole file in memory.

I think you might have a misunderstanding here. mmap puts a file into
*virtual* memory. It does *not* read the whole thing into physical memory;
if it did, there would be no purpose to mmap support in the OS in the
first place, as a thin wrapper around existing file calls would work.

 If the lexical analyzer reads the whole file into memory then we need more 
 pages. The mmap thing might help as we need only read pages (for a lexical 
 scanner).

The read-write status of the pages is not why mmap is an advantage; the
advantage is that the OS naturally and transparent is taking care of
loading just the portions you want, and intelligently discarding them when
you are done (more intelligently than you could, even in theory, since it
can take advantage of knowing the entire state of the system, your program
can't). 

In other words, as Skip was trying to tell you, mmap *already
does* what you are saying might be better, and it does it better than you
can, even in theory, from inside a process (as the OS will not reveal to
you the data structures it has that you would need to match that
performance).

As you try to understand mmap, make sure your mental model can take into
account the fact that it is easy and quite common to mmap a file several
times larger than your physical memory, and it does not even *try* to read
the whole thing in at any given time. You may benefit from
reviewing/studying the difference between virtual memory and physical
memory.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-26 Thread Skip Montanaro
 It's hard to imagine how sliding a small window onto a file within Python
 would be more efficient than the operating system's paging system. ;-)

Robin well it might be if I only want to scan forward through the file
Robin (think lexical analysis). Most lexical analyzers use a buffer and
Robin produce a stream of tokens ie a compressed version of the
Robin input. There are problems crossing buffers etc, but we never
Robin normally need the whole file in memory.

If I mmap() a file, it's not slurped into main memory immediately, though as
you pointed out, it's charged to my process's virtual memory.  As I access
bits of the file's contents, it will page in only what's necessary.  If I
mmap() a huge file, then print out a few bytes from the middle, only the
page containing the interesting bytes is actually copied into physical
memory.

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-26 Thread Robin Becker
Skip Montanaro wrote:
...
If I mmap() a file, it's not slurped into main memory immediately, though as
you pointed out, it's charged to my process's virtual memory.  As I access
bits of the file's contents, it will page in only what's necessary.  If I
mmap() a huge file, then print out a few bytes from the middle, only the
page containing the interesting bytes is actually copied into physical
memory.

my simple rather stupid experiment indicates that windows mmap at least 
will reserve 25Mb of paged file for a linear scan through a 25Mb file. I 
probably only need 4096b to scan. That's a lot less than even the page 
table requirement. This isn't rocket science just an old style observation.
--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-26 Thread Jeremy Bowers
On Tue, 26 Apr 2005 20:54:53 +, Robin Becker wrote:

 Skip Montanaro wrote:
 ...
 If I mmap() a file, it's not slurped into main memory immediately, though as
 you pointed out, it's charged to my process's virtual memory.  As I access
 bits of the file's contents, it will page in only what's necessary.  If I
 mmap() a huge file, then print out a few bytes from the middle, only the
 page containing the interesting bytes is actually copied into physical
 memory.
 
 my simple rather stupid experiment indicates that windows mmap at least 
 will reserve 25Mb of paged file for a linear scan through a 25Mb file. I 
 probably only need 4096b to scan. That's a lot less than even the page 
 table requirement. This isn't rocket science just an old style observation.

Are you trying to claim Skip is wrong, or what? There's little value in
saying that by mapping a file of 25MB into VM pages, you've increased your
allocated paged file space by 25MB. That's effectively tautological. 

If you are trying to claim Skip is wrong, you *do not understand* what you
are talking about. Talk less, listen and study more. (This is my best
guess, as like I said, observing that allocating things increases the
number of things that are allocated isn't worth posting so my thought is
you think you are proving something. If you really are just posting
something tautological, my apologies and disregard this paragraph but,
well, it's certainly not out of line at this point.)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-26 Thread Bengt Richter
On Mon, 25 Apr 2005 16:01:45 +0100, Robin Becker [EMAIL PROTECTED] wrote:

Is there any way to get regexes to work on non-string/unicode objects. I would 
like to split large files by regex and it seems relatively hard to do so 
without 
having the whole file in memory. Even with buffers it seems hard to get 
regexes 
to indicate that they failed because of buffer termination and getting a 
partial 
match to be resumable seems out of the question.

What interface does re actually need for its src objects?

ISTM splitting is a special situation where you can easily
chunk through a file and split as you go, since if splitting
the current chunk succeeds, you can be sure that all but the
tail piece is valid[1]. So you can make an iterator that yields
all but the last and then sets the buffer to last+newchunk
and goes on until there are no more chunks, and the tail part
will be a valid split piece. E.g., (not tested beyond what you see ;-)

  def frxsplit(path, rxo, chunksize=8192):
 ... buffer = ''
 ... for chunk in iter((lambda f=open(path): f.read(chunksize)),''):
 ... buffer += chunk
 ... pieces = rxo.split(buffer)
 ... for piece in pieces[:-1]: yield piece
 ... buffer = pieces[-1]
 ... yield buffer
 ...
  import re
  rxo = re.compile('X')

The test file:

  print '\n%s'%open('tsplit.txt').read()
 
 This is going to be split on five X's
 like X but we will use a buffer of
 X length 2 to force buffer appending.
 We'll try a splitter at the end: X
 

  for piece in frxsplit('tsplit.txt', rxo, 2): print repr(piece)
 ...
 This is going to be split on five X's\nlike 
 ' but we will use a buffer of\n'
  length 2 to force buffer appending.\nWe'll try a splitter at the end: 
 '\n'

  rxo = re.compile('(X)')
  for piece in frxsplit('tsplit.txt', rxo, 2): print repr(piece)
 ...
 This is going to be split on five X's\nlike 
 'X'
 ' but we will use a buffer of\n'
 'X'
  length 2 to force buffer appending.\nWe'll try a splitter at the end: 
 'X'
 '\n'

[1] In some cases of regexes with lookahead context, you might
have to check that the last piece not only exists but exceeds
max lookahead length, in case there is a withlookahead|plain
kind of thing in the regex  where lookahead would have succeeded
with another chunk appended to buffer, but plain did the split.

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-25 Thread Gerald Klix
Map the file into RAM by using the mmap module.
The file's contents than is availabel as a seachable string.
HTH,
Gerald
Robin Becker schrieb:
Is there any way to get regexes to work on non-string/unicode objects. I 
would like to split large files by regex and it seems relatively hard to 
do so without having the whole file in memory. Even with buffers it 
seems hard to get regexes to indicate that they failed because of buffer 
termination and getting a partial match to be resumable seems out of the 
question.

What interface does re actually need for its src objects?
--
GPG-Key: http://keyserver.veridis.com:11371/search?q=0xA140D634
--
http://mail.python.org/mailman/listinfo/python-list


Re: regex over files

2005-04-25 Thread Robin Becker
Gerald Klix wrote:
Map the file into RAM by using the mmap module.
The file's contents than is availabel as a seachable string.
that's a good idea, but I wonder if it actually saves on memory? I just tried 
regexing through a 25Mb file and end up with 40Mb as working set (it rose 
linearly as the loop progessed through the file). Am I actually saving anything 
by not letting normal vm do its thing?

HTH,
Gerald
Robin Becker schrieb:
Is there any way to get regexes to work on non-string/unicode objects. 
I would like to split large files by regex and it seems relatively 
hard to do so without having the whole file in memory. Even with 
buffers it seems hard to get regexes to indicate that they failed 
because of buffer termination and getting a partial match to be 
resumable seems out of the question.

What interface does re actually need for its src objects?


--
Robin Becker
--
http://mail.python.org/mailman/listinfo/python-list