On Sun, Oct 7, 2018 at 9:54 AM Nathaniel Smith <n...@pobox.com> wrote: > > On Sat, Oct 6, 2018 at 2:04 PM, Chris Angelico <ros...@gmail.com> wrote: > > On Sun, Oct 7, 2018 at 8:01 AM Nathaniel Smith <n...@pobox.com> wrote: > >> > >> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <r...@rachum.com> wrote: > >> > I'd like to use the re module to parse a long text file, 1GB in size. I > >> > wish > >> > that the re module could parse a stream, so I wouldn't have to load the > >> > whole thing into memory. I'd like to iterate over matches from the stream > >> > without keeping the old matches and input in RAM. > >> > > >> > What do you think? > >> > >> This has frustrated me too. > >> > >> The case where I've encountered this is parsing HTTP/1.1. We have data > >> coming in incrementally over the network, and we want to find the end > >> of the headers. To do this, we're looking for the first occurrence of > >> b"\r\n\r\n" OR b"\n\n". > >> > >> So our requirements are: > >> > >> 1. Search a bytearray for the regex b"\r\n\r\n|\n\n" > >> 2. If there's no match yet, wait for more data to arrive and try again > >> 3. When more data arrives, start searching again *where the last > >> search left off* > >> > >> The last requirement is subtle, but important. The naive approach > >> would be to rescan your whole receive buffer after each new packet > >> arrives: > >> > >> end_of_headers = re.compile(b"\r\n\r\n|\n\n") > >> while True: > >> m = end_of_headers.search(receive_buffer) > >> if m is None: > >> receive_buffer += await get_more_data_from_network() > >> # loop around and try again > >> else: > >> break > >> > >> But the code above is quadratic! If the headers are N bytes long, then > >> on each pass through the loop we perform an O(N) regex search, and we > >> do O(N) passes through the loop, so the whole thing is O(N**2). That > >> means your HTTP client-or-server can be trivially DoSed by a peer who > >> sends their headers broken into lots of small fragments. > > > > Quadratic in the size of the headers only, so you could just cap it - > > if the receive buffer is too large, just reject it. Sometimes, the > > simplest approach is the best. > > But OTOH, every problem has a solution that's simple, obvious, and wrong :-). > > Of course you set a cap on the header size, to prevent other kinds of > DoS (e.g. memory exhaustion). But it turns out people stuff a lot of > data into HTTP headers [1], so if the cap is large enough to support > non-malicious usage, then it's also large enough to let people DoS the > naive O(N**2) algorithm. Production-quality HTTP/1.1 parsers really do > have to use an O(N) algorithm here.
Fair enough! You could instead use a simple linear parser rather than searching for end of headers first. That way, all you're looking for is a \n, no regex needed. Sometimes the VERY simplest solution doesn't work, but I still do like to keep things as simple as possible :) ChrisA _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/