Re: regex over files

Robin Becker Wed, 27 Apr 2005 02:31:47 -0700

Jeremy Bowers wrote:

On Tue, 26 Apr 2005 20:54:53 +0000, Robin Becker wrote:
Skip Montanaro wrote:
...
If I mmap() a file, it's not slurped into main memory immediately, though as
you pointed out, it's charged to my process's virtual memory.  As I access
bits of the file's contents, it will page in only what's necessary.  If I
mmap() a huge file, then print out a few bytes from the middle, only the
page containing the interesting bytes is actually copied into physical
memory.
.... my simple rather stupid experiment indicates that windows mmap at least will reserve 25Mb of paged file for a linear scan through a 25Mb file. I probably only need 4096b to scan. That's a lot less than even the page table requirement. This isn't rocket science just an old style observation.
Are you trying to claim Skip is wrong, or what? There's little value in saying that by mapping a file of 25MB into VM pages, you've increased your allocated paged file space by 25MB. That's effectively tautological.
If you are trying to claim Skip is wrong, you *do not understand* what you
are talking about. Talk less, listen and study more. (This is my best
guess, as like I said, observing that allocating things increases the
number of things that are allocated isn't worth posting so my thought is
you think you are proving something. If you really are just posting
something tautological, my apologies and disregard this paragraph but,
well, it's certainly not out of line at this point.)


Well I obviously don't understand so perhaps you can explain these results

I implemented a simple scanning algorithm in two ways. First buffered scan tscan0.py; second mmapped scan tscan1.py.

For small file sizes the times are comparable.

C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py bingo.pdf
len=27916653 w=103 time=22.13

C:\code\reportlab\demos\gadflypaper>\tmp\tscan1.py bingo.pdf
len=27916653 w=103 time=22.20

for large file sizes when paging becomes of interest buffered scan wins even though it has to do a lot more python statements. If this were coded in C the results would be plainer still. As I said this isn't about right or wrong it's an observation. If I inspect the performance monitor tscan0 is at 100%, but tscan1 is at 80-90% and all of memory gets used up so paging is important. This may be an effect of the poor design of xp if so perhaps it won't hold for other os's.

C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py dingo.dat
len=139583265 w=103 time=110.91

C:\code\reportlab\demos\gadflypaper>\tmp\tscan1.py dingo.dat
len=139583265 w=103 time=140.53


C:\code\reportlab\demos\gadflypaper>cat \tmp\tscan0.py
import sys, time
fn = sys.argv[1]
f=open(fn,'rb')
n=0
w=0
t0 = time.time()
while 1:
    buf = f.read(4096)
    lb = len(buf)
    if not lb: break
    n += lb
    for i in xrange(lb):
        w ^= ord(buf[i])
t1 = time.time()

print "len=%d w=%d time=%.2f" % (n, w, (t1-t0))

C:\code\reportlab\demos\gadflypaper>cat \tmp\tscan1.py
import sys, time, mmap, os
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
n=len(s)
w=0
t0 = time.time()
for i in xrange(n):
    w ^= ord(s[i])
t1 = time.time()

print "len=%d w=%d time=%.2f" % (n, w, (t1-t0))


--
Robin Becker

--
http://mail.python.org/mailman/listinfo/python-list

Re: regex over files

Reply via email to