Robin Becker wrote:
#sscan1.py thanks to Skip
import sys, time, mmap, os, re
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
l=n=0
t0 = time.time()
for mat in re.split(X, s):
re.split() returns a list, not a generator, and this list
Peter Otten wrote:
Robin Becker wrote:
#sscan1.py thanks to Skip
import sys, time, mmap, os, re
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
l=n=0
t0 = time.time()
for mat in re.split(X, s):
re.split() returns a list, not a generator, and
Peter Otten wrote:
Robin Becker wrote:
#sscan1.py thanks to Skip
import sys, time, mmap, os, re
fn = sys.argv[1]
fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
l=n=0
t0 = time.time()
for mat in re.split(X, s):
re.split() returns a list, not a generator, and
Skip Montanaro wrote:
..
I'm not sure why the mmap() solution is so much slower for you. Perhaps on
some systems files opened for reading are mmap'd under the covers. I'm sure
it's highly platform-dependent. (My results on MacOSX - see below - are
somewhat better.)
I'll have a go at doing
Bengt To be fairer, I think you'd want to hoist the re compilation out
Bengt of the loop.
The re module compiles and caches regular expressions, so I doubt it would
affect the runtime of either version.
Bengt But also to be fairer, maybe include the overhead of splitting
Bengt
Skip Montanaro wrote:
...
I'm not sure why the mmap() solution is so much slower for you. Perhaps on
some systems files opened for reading are mmap'd under the covers. I'm sure
it's highly platform-dependent. (My results on MacOSX - see below - are
somewhat better.)
Let me return to your
Jeremy Bowers wrote:
As you try to understand mmap, make sure your mental model can take into
account the fact that it is easy and quite common to mmap a file several
times larger than your physical memory, and it does not even *try* to read
the whole thing in at any given time. You may
Robin Becker wrote:
Skip Montanaro wrote:
..
I'm not sure why the mmap() solution is so much slower for you.
Perhaps on
some systems files opened for reading are mmap'd under the covers.
I'm sure
it's highly platform-dependent. (My results on MacOSX - see below - are
somewhat better.)
Jeremy Bowers wrote:
.
As you try to understand mmap, make sure your mental model can take into
account the fact that it is easy and quite common to mmap a file several
times larger than your physical memory, and it does not even *try* to read
the whole thing in at any given time. You may
Skip Montanaro wrote:
.
Let me return to your original problem though, doing regex operations on
files. I modified your two scripts slightly:
.
Skip
I'm sure my results are dependent on something other than the coding style
I suspect file/disk cache and paging operates here. Note that we
On Thu, 28 Apr 2005 20:35:43 +, Robin Becker [EMAIL PROTECTED] wrote:
Jeremy Bowers wrote:
As you try to understand mmap, make sure your mental model can take into
account the fact that it is easy and quite common to mmap a file several
times larger than your physical memory, and
Jeremy Bowers wrote:
On Tue, 26 Apr 2005 20:54:53 +, Robin Becker wrote:
Skip Montanaro wrote:
...
If I mmap() a file, it's not slurped into main memory immediately, though as
you pointed out, it's charged to my process's virtual memory. As I access
bits of the file's contents, it will page
Robin I implemented a simple scanning algorithm in two ways. First
buffered scan
Robin tscan0.py; second mmapped scan tscan1.py.
...
Robin C:\code\reportlab\demos\gadflypaper\tmp\tscan0.py dingo.dat
Robin len=139583265 w=103 time=110.91
Robin
On Wed, 27 Apr 2005 21:39:45 -0500, Skip Montanaro [EMAIL PROTECTED] wrote:
Robin I implemented a simple scanning algorithm in two ways. First
buffered scan
Robin tscan0.py; second mmapped scan tscan1.py.
...
Robin C:\code\reportlab\demos\gadflypaper\tmp\tscan0.py dingo.dat
Richard Brodie wrote:
Robin Becker [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
Gerald Klix wrote:
Map the file into RAM by using the mmap module.
The file's contents than is availabel as a seachable string.
that's a good idea, but I wonder if it actually saves on memory? I just
Robin Becker wrote:
Richard Brodie wrote:
Robin Becker [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]
Gerald Klix wrote:
Map the file into RAM by using the mmap module.
The file's contents than is availabel as a seachable string.
that's a good idea, but I wonder if it actually saves on
Steve Holden wrote:
..
I seem to remember that the Medusa code contains a fairly good
overlapped search for a terminator string, if you want to chunk the file.
Take a look at the handle_read() method of class async_chat in the
standard library's asynchat.py.
.
thanks I'll give it a
Robin Becker wrote:
Steve Holden wrote:
..
I seem to remember that the Medusa code contains a fairly good
overlapped search for a terminator string, if you want to chunk the file.
Take a look at the handle_read() method of class async_chat in the
standard library's asynchat.py.
.
thanks
Steve Holden wrote:
.
thanks I'll give it a whirl
Whoops, I don't think it's a regex search :-(
You should be able to adapt the logic fairly easily, I hope.
The buffering logic is half the problem; doing it quickly is the other half.
The third half of the problem is getting re to
Robin So we avoid dirty page writes etc etc. However, I still think I
Robin could get away with a small window into the file which would be
Robin more efficient.
It's hard to imagine how sliding a small window onto a file within Python
would be more efficient than the operating
Skip Montanaro wrote:
Robin So we avoid dirty page writes etc etc. However, I still think I
Robin could get away with a small window into the file which would be
Robin more efficient.
It's hard to imagine how sliding a small window onto a file within Python
would be more efficient than
On Tue, 26 Apr 2005 19:32:29 +0100, Robin Becker wrote:
Skip Montanaro wrote:
Robin So we avoid dirty page writes etc etc. However, I still think I
Robin could get away with a small window into the file which would be
Robin more efficient.
It's hard to imagine how sliding a
It's hard to imagine how sliding a small window onto a file within Python
would be more efficient than the operating system's paging system. ;-)
Robin well it might be if I only want to scan forward through the file
Robin (think lexical analysis). Most lexical analyzers use a
Skip Montanaro wrote:
...
If I mmap() a file, it's not slurped into main memory immediately, though as
you pointed out, it's charged to my process's virtual memory. As I access
bits of the file's contents, it will page in only what's necessary. If I
mmap() a huge file, then print out a few bytes
On Tue, 26 Apr 2005 20:54:53 +, Robin Becker wrote:
Skip Montanaro wrote:
...
If I mmap() a file, it's not slurped into main memory immediately, though as
you pointed out, it's charged to my process's virtual memory. As I access
bits of the file's contents, it will page in only what's
On Mon, 25 Apr 2005 16:01:45 +0100, Robin Becker [EMAIL PROTECTED] wrote:
Is there any way to get regexes to work on non-string/unicode objects. I would
like to split large files by regex and it seems relatively hard to do so
without
having the whole file in memory. Even with buffers it seems
Is there any way to get regexes to work on non-string/unicode objects. I would
like to split large files by regex and it seems relatively hard to do so without
having the whole file in memory. Even with buffers it seems hard to get regexes
to indicate that they failed because of buffer
Map the file into RAM by using the mmap module.
The file's contents than is availabel as a seachable string.
HTH,
Gerald
Robin Becker schrieb:
Is there any way to get regexes to work on non-string/unicode objects. I
would like to split large files by regex and it seems relatively hard to
do so
Gerald Klix wrote:
Map the file into RAM by using the mmap module.
The file's contents than is availabel as a seachable string.
that's a good idea, but I wonder if it actually saves on memory? I just tried
regexing through a 25Mb file and end up with 40Mb as working set (it rose
linearly as the
29 matches
Mail list logo