On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <r...@rachum.com> wrote: > I'd like to use the re module to parse a long text file, 1GB in size. I wish > that the re module could parse a stream, so I wouldn't have to load the > whole thing into memory. I'd like to iterate over matches from the stream > without keeping the old matches and input in RAM. > > What do you think?
This has frustrated me too. The case where I've encountered this is parsing HTTP/1.1. We have data coming in incrementally over the network, and we want to find the end of the headers. To do this, we're looking for the first occurrence of b"\r\n\r\n" OR b"\n\n". So our requirements are: 1. Search a bytearray for the regex b"\r\n\r\n|\n\n" 2. If there's no match yet, wait for more data to arrive and try again 3. When more data arrives, start searching again *where the last search left off* The last requirement is subtle, but important. The naive approach would be to rescan your whole receive buffer after each new packet arrives: end_of_headers = re.compile(b"\r\n\r\n|\n\n") while True: m = end_of_headers.search(receive_buffer) if m is None: receive_buffer += await get_more_data_from_network() # loop around and try again else: break But the code above is quadratic! If the headers are N bytes long, then on each pass through the loop we perform an O(N) regex search, and we do O(N) passes through the loop, so the whole thing is O(N**2). That means your HTTP client-or-server can be trivially DoSed by a peer who sends their headers broken into lots of small fragments. Fortunately, there's an elegant and natural solution: Just save the regex engine's internal state when it hits the end of the string, and then when more data arrives, use the saved state to pick up the search where we left off. Theoretically, any regex engine *could* support this – it's especially obvious for DFA-based matchers, but even backtrackers like Python's re could support it, basically by making the matching engine a coroutine that can suspend itself when it hits the end of the input, then resume it when new input arrives. Like, if you asked Knuth for the theoretically optimal design for this parser, I'm pretty sure this is what he'd tell you to use, and it's what people do when writing high-performance HTTP parsers in C. But unfortunately, in reality, re *doesn't* support this kind of pause/resume functionality, and you can't write efficient character-by-character algorithms in Python, so you have to use really awkward hacks instead. For the HTTP header case, the best I've been able to come up with is to manually analyze the regex to figure out the maximum size string it could match (in this case, 4 bytes), and then write a loop that tracks how long the string was before the last time we appended new data, and on each iteration searches the substring receive_buffer[old_length - 4 + 1:]. This is super finicky, and especially annoying if you want to offer this as a generic API for using regexes to deconstruct network streams. (There are a lot of Python network libraries that have accidentally-quadratic parsers in them.) In practice I suspect retrofitting this functionality into 're' would be a lot of work. But it's definitely frustrating that we have 90% of the machinery we'd need to do things the natural/efficient way, but then are thwarted by this arbitrary API limitation. -n -- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/