On 10/6/18 7:25 AM, Ram Rachum wrote:
"This is a regular expression problem, rather than a Python problem."

Do you have evidence for this assertion, except that other regex implementations have this limitation? Is there a regex specification somewhere that specifies that streams aren't supported? Is there a fundamental reason that streams aren't supported?


"Can the lexing be done on a line-by-line basis?"

For my use case, it unfortunately can't.

You mentioned earlier that your use case doesn't have to worry about the "a.*b" problem.  Can you tell us more about your scenario?  How would the stream know it had read enough to match or not match? Perhaps that same logic can be used to feed the data in chunks?

--Ned.


On Sat, Oct 6, 2018 at 1:53 PM Jonathan Fine <jfine2...@gmail.com <mailto:jfine2...@gmail.com>> wrote:

    Hi Ram

    You wrote:

    > I'd like to use the re module to parse a long text file, 1GB in
    size. I
    > wish that the re module could parse a stream, so I wouldn't have
    to load
    > the whole thing into memory. I'd like to iterate over matches
    from the
    > stream without keeping the old matches and input in RAM.

    This is a regular expression problem, rather than a Python
    problem. A search for
        regular expression large file
    brings up some URLs that might help you, starting with
    
https://stackoverflow.com/questions/23773669/grep-pattern-match-between-very-large-files-is-way-too-slow

    This might also be helpful
    https://svn.boost.org/trac10/ticket/11776

    What will work for your problem depends on the nature of the problem
    you have. The simplest thing that might work is to iterate of the file
    line-by-line, and use a regular expression to extract matches from
    each line.

    In other words, something like (not tested)

       def helper(lines):
           for line in lines:
               yield from re.finditer(pattern, line)

        lines = open('my-big-file.txt')
        for match in helper(lines):
            # Do your stuff here

    Parsing is not the same as lexing, see
    https://en.wikipedia.org/wiki/Lexical_analysis

    I suggest you use regular expressions ONLY for the lexing phase. If
    you'd like further help, perhaps first ask yourself this. Can the
    lexing be done on a line-by-line basis? And if not, why not?

    If line-by-line not possible, then you'll have to modify the helper.
    At the end of each line, they'll be a residue / remainder, which
    you'll have to bring into the next line. In other words, the helper
    will have to record (and change) the state that exists at the end of
    each line. A bit like the 'carry' that is used when doing long
    addition.

    I hope this helps.

-- Jonathan



_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to