Re: [Python-ideas] Support parsing stream with `re`

Cameron Simpson Mon, 08 Oct 2018 03:20:53 -0700

On 08Oct2018 10:56, Ram Rachum <r...@rachum.com> wrote:

That's incredibly interesting. I've never used mmap before.
However, there's a problem.
I did a few experiments with mmap now, this is the latest:


path = pathlib.Path(r'P:\huge_file')

with path.open('r') as file:
   mmap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)


Just a remark: don't tromp on the "mmap" name. Maybe "mapped"?

   for match in re.finditer(b'.', mmap):
       pass

The file is 338GB in size, and it seems that Python is trying to load it
into memory. The process is now taking 4GB RAM and it's growing. I saw the
same behavior when searching for a non-existing match.

Should I open a Python bug for this?

Probably not. First figure out what is going on. BTW, how much RAM have yougot?

As you access the mapped file the OS will try to keep it in memory in case youneed that again. In the absense of competition, most stuff will get paged outto accomodate it. That's normal. All the data are "clean" (unmodified) so theOS can simply release the older pages instantly if something else needs theRAM.


However, another possibility is the the regexp is consuming lots of memory.

The regexp seems simple enough (b'.'), so I doubt it is leaking memory likemad; I'm guessing you're just seeing the OS page in as much of the file as itcan.

Also, does the loop iterate? i.e. does it find multiple matches as the memorygets consumed, or is the first iateration blocking and consuming gobs of memorybefore the first match comes back? A print() call will tell you that.


Cheers,
Cameron Simpson <c...@cskk.id.au>
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] Support parsing stream with `re`

Reply via email to