On 10/6/2018 5:00 PM, Nathaniel Smith wrote:
On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <r...@rachum.com> wrote:
I'd like to use the re module to parse a long text file, 1GB in size. I wish
that the re module could parse a stream, so I wouldn't have to load the
whole thing into memory. I'd like to iterate over matches from the stream
without keeping the old matches and input in RAM.

What do you think?

This has frustrated me too.

The case where I've encountered this is parsing HTTP/1.1. We have data
coming in incrementally over the network, and we want to find the end
of the headers. To do this, we're looking for the first occurrence of
b"\r\n\r\n" OR b"\n\n".

So our requirements are:

1. Search a bytearray for the regex b"\r\n\r\n|\n\n"

I believe that re is both overkill and slow for this particular problem.
For O(n), search forward for \n with str.index('\n') (or .find)
[I assume that this searches forward faster than
for i, c in enumerate(s):
   if c == '\n': break
and leave you to test this.]

If not found, continue with next chunk of data.
If found, look back for \r to determine whether to look forward for \n or \r\n *whenever there is enough data to do so.

2. If there's no match yet, wait for more data to arrive and try again
3. When more data arrives, start searching again *where the last
search left off*

s.index has an optional start parameter. And keep chunks in a list until you have a match and can join all at once.


--
Terry Jan Reedy

_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to