Re: regex over files

2005-04-29 Thread Peter Otten
Robin Becker wrote: #sscan1.py thanks to Skip import sys, time, mmap, os, re fn = sys.argv[1] fh=os.open(fn,os.O_BINARY|os.O_RDONLY) s=mmap.mmap(fh,0,access=mmap.ACCESS_READ) l=n=0 t0 = time.time() for mat in re.split(X, s): re.split() returns a list, not a generator, and this list

Re: regex over files

2005-04-29 Thread Robin Becker
Peter Otten wrote: Robin Becker wrote: #sscan1.py thanks to Skip import sys, time, mmap, os, re fn = sys.argv[1] fh=os.open(fn,os.O_BINARY|os.O_RDONLY) s=mmap.mmap(fh,0,access=mmap.ACCESS_READ) l=n=0 t0 = time.time() for mat in re.split(X, s): re.split() returns a list, not a generator, and

Re: regex over files

2005-04-29 Thread Robin Becker
Peter Otten wrote: Robin Becker wrote: #sscan1.py thanks to Skip import sys, time, mmap, os, re fn = sys.argv[1] fh=os.open(fn,os.O_BINARY|os.O_RDONLY) s=mmap.mmap(fh,0,access=mmap.ACCESS_READ) l=n=0 t0 = time.time() for mat in re.split(X, s): re.split() returns a list, not a generator, and

Re: regex over files

2005-04-28 Thread Robin Becker
Skip Montanaro wrote: .. I'm not sure why the mmap() solution is so much slower for you. Perhaps on some systems files opened for reading are mmap'd under the covers. I'm sure it's highly platform-dependent. (My results on MacOSX - see below - are somewhat better.) I'll have a go at doing

Re: regex over files

2005-04-28 Thread Skip Montanaro
Bengt To be fairer, I think you'd want to hoist the re compilation out Bengt of the loop. The re module compiles and caches regular expressions, so I doubt it would affect the runtime of either version. Bengt But also to be fairer, maybe include the overhead of splitting Bengt

Re: regex over files

2005-04-28 Thread Robin Becker
Skip Montanaro wrote: ... I'm not sure why the mmap() solution is so much slower for you. Perhaps on some systems files opened for reading are mmap'd under the covers. I'm sure it's highly platform-dependent. (My results on MacOSX - see below - are somewhat better.) Let me return to your

Re: regex over files

2005-04-28 Thread Robin Becker
Jeremy Bowers wrote: As you try to understand mmap, make sure your mental model can take into account the fact that it is easy and quite common to mmap a file several times larger than your physical memory, and it does not even *try* to read the whole thing in at any given time. You may

Re: regex over files

2005-04-28 Thread Robin Becker
Robin Becker wrote: Skip Montanaro wrote: .. I'm not sure why the mmap() solution is so much slower for you. Perhaps on some systems files opened for reading are mmap'd under the covers. I'm sure it's highly platform-dependent. (My results on MacOSX - see below - are somewhat better.)

Re: regex over files

2005-04-28 Thread Robin Becker
Jeremy Bowers wrote: . As you try to understand mmap, make sure your mental model can take into account the fact that it is easy and quite common to mmap a file several times larger than your physical memory, and it does not even *try* to read the whole thing in at any given time. You may

Re: regex over files

2005-04-28 Thread Robin Becker
Skip Montanaro wrote: . Let me return to your original problem though, doing regex operations on files. I modified your two scripts slightly: . Skip I'm sure my results are dependent on something other than the coding style I suspect file/disk cache and paging operates here. Note that we

Re: regex over files

2005-04-28 Thread Bengt Richter
On Thu, 28 Apr 2005 20:35:43 +, Robin Becker [EMAIL PROTECTED] wrote: Jeremy Bowers wrote: As you try to understand mmap, make sure your mental model can take into account the fact that it is easy and quite common to mmap a file several times larger than your physical memory, and

Re: regex over files

2005-04-27 Thread Robin Becker
Jeremy Bowers wrote: On Tue, 26 Apr 2005 20:54:53 +, Robin Becker wrote: Skip Montanaro wrote: ... If I mmap() a file, it's not slurped into main memory immediately, though as you pointed out, it's charged to my process's virtual memory. As I access bits of the file's contents, it will page

Re: regex over files

2005-04-27 Thread Skip Montanaro
Robin I implemented a simple scanning algorithm in two ways. First buffered scan Robin tscan0.py; second mmapped scan tscan1.py. ... Robin C:\code\reportlab\demos\gadflypaper\tmp\tscan0.py dingo.dat Robin len=139583265 w=103 time=110.91 Robin

Re: regex over files

2005-04-27 Thread Bengt Richter
On Wed, 27 Apr 2005 21:39:45 -0500, Skip Montanaro [EMAIL PROTECTED] wrote: Robin I implemented a simple scanning algorithm in two ways. First buffered scan Robin tscan0.py; second mmapped scan tscan1.py. ... Robin C:\code\reportlab\demos\gadflypaper\tmp\tscan0.py dingo.dat

Re: regex over files

2005-04-26 Thread Robin Becker
Richard Brodie wrote: Robin Becker [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Gerald Klix wrote: Map the file into RAM by using the mmap module. The file's contents than is availabel as a seachable string. that's a good idea, but I wonder if it actually saves on memory? I just

Re: regex over files

2005-04-26 Thread Steve Holden
Robin Becker wrote: Richard Brodie wrote: Robin Becker [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Gerald Klix wrote: Map the file into RAM by using the mmap module. The file's contents than is availabel as a seachable string. that's a good idea, but I wonder if it actually saves on

Re: regex over files

2005-04-26 Thread Robin Becker
Steve Holden wrote: .. I seem to remember that the Medusa code contains a fairly good overlapped search for a terminator string, if you want to chunk the file. Take a look at the handle_read() method of class async_chat in the standard library's asynchat.py. . thanks I'll give it a

Re: regex over files

2005-04-26 Thread Steve Holden
Robin Becker wrote: Steve Holden wrote: .. I seem to remember that the Medusa code contains a fairly good overlapped search for a terminator string, if you want to chunk the file. Take a look at the handle_read() method of class async_chat in the standard library's asynchat.py. . thanks

Re: regex over files

2005-04-26 Thread Robin Becker
Steve Holden wrote: . thanks I'll give it a whirl Whoops, I don't think it's a regex search :-( You should be able to adapt the logic fairly easily, I hope. The buffering logic is half the problem; doing it quickly is the other half. The third half of the problem is getting re to

Re: regex over files

2005-04-26 Thread Skip Montanaro
Robin So we avoid dirty page writes etc etc. However, I still think I Robin could get away with a small window into the file which would be Robin more efficient. It's hard to imagine how sliding a small window onto a file within Python would be more efficient than the operating

Re: regex over files

2005-04-26 Thread Robin Becker
Skip Montanaro wrote: Robin So we avoid dirty page writes etc etc. However, I still think I Robin could get away with a small window into the file which would be Robin more efficient. It's hard to imagine how sliding a small window onto a file within Python would be more efficient than

Re: regex over files

2005-04-26 Thread Jeremy Bowers
On Tue, 26 Apr 2005 19:32:29 +0100, Robin Becker wrote: Skip Montanaro wrote: Robin So we avoid dirty page writes etc etc. However, I still think I Robin could get away with a small window into the file which would be Robin more efficient. It's hard to imagine how sliding a

Re: regex over files

2005-04-26 Thread Skip Montanaro
It's hard to imagine how sliding a small window onto a file within Python would be more efficient than the operating system's paging system. ;-) Robin well it might be if I only want to scan forward through the file Robin (think lexical analysis). Most lexical analyzers use a

Re: regex over files

2005-04-26 Thread Robin Becker
Skip Montanaro wrote: ... If I mmap() a file, it's not slurped into main memory immediately, though as you pointed out, it's charged to my process's virtual memory. As I access bits of the file's contents, it will page in only what's necessary. If I mmap() a huge file, then print out a few bytes

Re: regex over files

2005-04-26 Thread Jeremy Bowers
On Tue, 26 Apr 2005 20:54:53 +, Robin Becker wrote: Skip Montanaro wrote: ... If I mmap() a file, it's not slurped into main memory immediately, though as you pointed out, it's charged to my process's virtual memory. As I access bits of the file's contents, it will page in only what's

Re: regex over files

2005-04-26 Thread Bengt Richter
On Mon, 25 Apr 2005 16:01:45 +0100, Robin Becker [EMAIL PROTECTED] wrote: Is there any way to get regexes to work on non-string/unicode objects. I would like to split large files by regex and it seems relatively hard to do so without having the whole file in memory. Even with buffers it seems

regex over files

2005-04-25 Thread Robin Becker
Is there any way to get regexes to work on non-string/unicode objects. I would like to split large files by regex and it seems relatively hard to do so without having the whole file in memory. Even with buffers it seems hard to get regexes to indicate that they failed because of buffer

Re: regex over files

2005-04-25 Thread Gerald Klix
Map the file into RAM by using the mmap module. The file's contents than is availabel as a seachable string. HTH, Gerald Robin Becker schrieb: Is there any way to get regexes to work on non-string/unicode objects. I would like to split large files by regex and it seems relatively hard to do so

Re: regex over files

2005-04-25 Thread Robin Becker
Gerald Klix wrote: Map the file into RAM by using the mmap module. The file's contents than is availabel as a seachable string. that's a good idea, but I wonder if it actually saves on memory? I just tried regexing through a 25Mb file and end up with 40Mb as working set (it rose linearly as the