On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <r...@rachum.com> wrote:
> I'd like to use the re module to parse a long text file, 1GB in size. I wish
> that the re module could parse a stream, so I wouldn't have to load the
> whole thing into memory. I'd like to iterate over matches from the stream
> without keeping the old matches and input in RAM.
>
> What do you think?

This has frustrated me too.

The case where I've encountered this is parsing HTTP/1.1. We have data
coming in incrementally over the network, and we want to find the end
of the headers. To do this, we're looking for the first occurrence of
b"\r\n\r\n" OR b"\n\n".

So our requirements are:

1. Search a bytearray for the regex b"\r\n\r\n|\n\n"
2. If there's no match yet, wait for more data to arrive and try again
3. When more data arrives, start searching again *where the last
search left off*

The last requirement is subtle, but important. The naive approach
would be to rescan your whole receive buffer after each new packet
arrives:

end_of_headers = re.compile(b"\r\n\r\n|\n\n")
while True:
    m = end_of_headers.search(receive_buffer)
    if m is None:
        receive_buffer += await get_more_data_from_network()
        # loop around and try again
    else:
        break

But the code above is quadratic! If the headers are N bytes long, then
on each pass through the loop we perform an O(N) regex search, and we
do O(N) passes through the loop, so the whole thing is O(N**2). That
means your HTTP client-or-server can be trivially DoSed by a peer who
sends their headers broken into lots of small fragments.

Fortunately, there's an elegant and natural solution: Just save the
regex engine's internal state when it hits the end of the string, and
then when more data arrives, use the saved state to pick up the search
where we left off. Theoretically, any regex engine *could* support
this – it's especially obvious for DFA-based matchers, but even
backtrackers like Python's re could support it, basically by making
the matching engine a coroutine that can suspend itself when it hits
the end of the input, then resume it when new input arrives. Like, if
you asked Knuth for the theoretically optimal design for this parser,
I'm pretty sure this is what he'd tell you to use, and it's what
people do when writing high-performance HTTP parsers in C.

But unfortunately, in reality, re *doesn't* support this kind of
pause/resume functionality, and you can't write efficient
character-by-character algorithms in Python, so you have to use really
awkward hacks instead. For the HTTP header case, the best I've been
able to come up with is to manually analyze the regex to figure out
the maximum size string it could match (in this case, 4 bytes), and
then write a loop that tracks how long the string was before the last
time we appended new data, and on each iteration searches the
substring receive_buffer[old_length - 4 + 1:]. This is super finicky,
and especially annoying if you want to offer this as a generic API for
using regexes to deconstruct network streams. (There are a lot of
Python network libraries that have accidentally-quadratic parsers in
them.)

In practice I suspect retrofitting this functionality into 're' would
be a lot of work. But it's definitely frustrating that we have 90% of
the machinery we'd need to do things the natural/efficient way, but
then are thwarted by this arbitrary API limitation.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to