On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith <n...@pobox.com> wrote: > > On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <r...@rachum.com> wrote: > > I'd like to use the re module to parse a long text file, 1GB in size. I wish > > that the re module could parse a stream, so I wouldn't have to load the > > whole thing into memory. I'd like to iterate over matches from the stream > > without keeping the old matches and input in RAM. > > > > What do you think? > > This has frustrated me too. > > The case where I've encountered this is parsing HTTP/1.1. We have data > coming in incrementally over the network, and we want to find the end > of the headers. To do this, we're looking for the first occurrence of > b"\r\n\r\n" OR b"\n\n". > > So our requirements are: > > 1. Search a bytearray for the regex b"\r\n\r\n|\n\n" > 2. If there's no match yet, wait for more data to arrive and try again > 3. When more data arrives, start searching again *where the last > search left off* > > The last requirement is subtle, but important. The naive approach > would be to rescan your whole receive buffer after each new packet > arrives: > > end_of_headers = re.compile(b"\r\n\r\n|\n\n") > while True: > m = end_of_headers.search(receive_buffer) > if m is None: > receive_buffer += await get_more_data_from_network() > # loop around and try again > else: > break > > But the code above is quadratic! If the headers are N bytes long, then > on each pass through the loop we perform an O(N) regex search, and we > do O(N) passes through the loop, so the whole thing is O(N**2). That > means your HTTP client-or-server can be trivially DoSed by a peer who > sends their headers broken into lots of small fragments.
Quadratic in the size of the headers only, so you could just cap it - if the receive buffer is too large, just reject it. Sometimes, the simplest approach is the best. ChrisA _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/