On Sun, Oct 7, 2018 at 5:09 PM, Terry Reedy <tjre...@udel.edu> wrote: > On 10/6/2018 5:00 PM, Nathaniel Smith wrote: >> >> On Sat, Oct 6, 2018 at 12:22 AM, Ram Rachum <r...@rachum.com> wrote: >>> >>> I'd like to use the re module to parse a long text file, 1GB in size. I >>> wish >>> that the re module could parse a stream, so I wouldn't have to load the >>> whole thing into memory. I'd like to iterate over matches from the stream >>> without keeping the old matches and input in RAM. >>> >>> What do you think? >> >> >> This has frustrated me too. >> >> The case where I've encountered this is parsing HTTP/1.1. We have data >> coming in incrementally over the network, and we want to find the end >> of the headers. To do this, we're looking for the first occurrence of >> b"\r\n\r\n" OR b"\n\n". >> >> So our requirements are: >> >> 1. Search a bytearray for the regex b"\r\n\r\n|\n\n" > > > I believe that re is both overkill and slow for this particular problem. > For O(n), search forward for \n with str.index('\n') (or .find) > [I assume that this searches forward faster than > for i, c in enumerate(s): > if c == '\n': break > and leave you to test this.] > > If not found, continue with next chunk of data. > If found, look back for \r to determine whether to look forward for \n or > \r\n *whenever there is enough data to do so.
Are you imagining something roughly like this? (Ignoring chunk boundary handling for the moment.) def find_double_line_end(buf): start = 0 while True: next_idx = buf.index(b"\n", start) if buf[next_idx - 1:next_idx + 1] == b"\n" or buf[next_idx - 3:next_idx] == b"\r\n\r": return next_idx start = next_idx + 1 That's much more complicated than using re.search, and on some random HTTP headers I have lying around it benchmarks ~70% slower too. Which makes sense, since we're basically trying to replicate re engine's work by hand in a slower language. BTW, if we only want to find a fixed string like b"\r\n\r\n", then re.search and bytearray.index are almost identical in speed. If you have a problem that can be expressed as a regular expression, then regular expression engines are actually pretty good at solving those :-) -n -- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/