Re: streaming a file object through re.finditer
I did try to see if I could get that to work, but I couldn't figure it out. I'll see if I can play around more with that api. So say I did investigate a little more to see how much work it would take to adapt the re module to accept an iterator (while leaving the current string api as another code path). Depending on how complicated a change this would be, how much interest would there be in other people using this feature? From what I understand about regular expressions, they're essentially stream processing and don't need backtracking, so reading from an interator should work too (right?). Thanks, -e -- http://mail.python.org/mailman/listinfo/python-list
Re: streaming a file object through re.finditer
On Wed, 2 Feb 2005 22:22:27 -0500, rumours say that Daniel Bickett <[EMAIL PROTECTED]> might have written: >Erick wrote: >> True, but it doesn't work with multiline regular expressions :( >If your intent is for the expression to traverse multiple lines (and >possibly match *across* multiple lines,) then, as far as I know, you >have no choice but to load the whole file into memory. *If* the OP knows that their multiline re won't match more than, say, 4 lines at a time, the code attached at the end of this post could be useful. Usage: for group_of_lines in line_groups(, line_count=4): # bla bla The OP should take care to ignore multiple matches as the n-line window scans through the input file; eg. if your re searches for '3\n4', it will match 3 times in the first example of my code. |import collections | |def line_groups(fileobj, line_count=2): |iterator = iter(fileobj) |group = collections.deque() |joiner = ''.join | |try: |while len(group) < line_count: |group.append(iterator.next()) |except StopIteration: |yield joiner(group) |return | |for line in iterator: |group.append(line) |del group[0] |yield joiner(group) | |if __name__=="__main__": |import os, tempfile | |# create two temp file for 4-line groups | |# write n+3 lines in first file |testname1= tempfile.mktemp() # depracated & insecure but ok for this test |testfile= open(testname1, "w") |testfile.write('\n'.join(map(str, range(7 |testfile.close() | |# write n-2 lines in second file |testname2= tempfile.mktemp() |testfile= open(testname2, "w") |testfile.write('\n'.join(map(str, range(2 |testfile.close() | |# now iterate over four line groups | |for bunch_o_lines in line_groups( open(testname1), line_count=4): |print repr(bunch_o_lines), |print | |for bunch_o_lines in line_groups( open(testname2), line_count=4): |print repr(bunch_o_lines), |print | |os.remove(testname1); os.remove(testname2) -- TZOTZIOY, I speak England very best. "Be strict when sending and tolerant when receiving." (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list
Re: streaming a file object through re.finditer
Erick wrote: Hello, I've been looking for a while for an answer, but so far I haven't been able to turn anything up yet. Basically, what I'd like to do is to use re.finditer to search a large file (or a file stream), but I haven't figured out how to get finditer to work without loading the entire file into memory, or just reading one line at a time (or more complicated buffering). Can you use mmap? http://docs.python.org/lib/module-mmap.html "You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file." Seems applicable, and it should keep your memory use down, but I'm not very experienced with it... Steve -- http://mail.python.org/mailman/listinfo/python-list
Re: streaming a file object through re.finditer
Erick wrote: > True, but it doesn't work with multiline regular expressions :( If your intent is for the expression to traverse multiple lines (and possibly match *across* multiple lines,) then, as far as I know, you have no choice but to load the whole file into memory. -- Daniel Bickett dbickett at gmail.com http://heureusement.org/ -- http://mail.python.org/mailman/listinfo/python-list
Re: streaming a file object through re.finditer
Is it not possible to wrap your loop below within a loop doing file.read([size]) (or readline() or readlines([size]), reading the file a chunk at a time then running your re on a per-chunk basis? -ej "Erick" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Ack, typo. What I meant was this: > cat a b c > blah > > >>> import re > >>> for m in re.finditer('\w+', file('blah')): > > ... print m.group() > ... > Traceback (most recent call last): > File "", line 1, in ? > TypeError: buffer object expected > > Of course, this works fine, but it loads the file completely into > memory (right?): > >>> for m in re.finditer('\w+', file('blah').read()): > ... print m.group() > ... > a > b > c > -- http://mail.python.org/mailman/listinfo/python-list
Re: streaming a file object through re.finditer
True, but it doesn't work with multiline regular expressions :( -e -- http://mail.python.org/mailman/listinfo/python-list
Re: streaming a file object through re.finditer
Ack, typo. What I meant was this: cat a b c > blah >>> import re >>> for m in re.finditer('\w+', file('blah')): ... print m.group() ... Traceback (most recent call last): File "", line 1, in ? TypeError: buffer object expected Of course, this works fine, but it loads the file completely into memory (right?): >>> for m in re.finditer('\w+', file('blah').read()): ... print m.group() ... a b c -- http://mail.python.org/mailman/listinfo/python-list
Re: streaming a file object through re.finditer
The following example loads the file into memory only one line at a time, so it should suit your purposes: >>> data = file( "important.dat" , "w" ) >>> data.write("this\nis\nimportant\ndata") >>> data.close() now read it >>> import re >>> data = file( "important.dat" , "r" ) >>> line = data.readline() >>> while line: for x in re.finditer( "\w+" , line): print x.group() line = data.readline() this is important data >>> -- Daniel Bickett dbickett at gmail.com http://heureusement.org/ -- http://mail.python.org/mailman/listinfo/python-list
streaming a file object through re.finditer
Hello, I've been looking for a while for an answer, but so far I haven't been able to turn anything up yet. Basically, what I'd like to do is to use re.finditer to search a large file (or a file stream), but I haven't figured out how to get finditer to work without loading the entire file into memory, or just reading one line at a time (or more complicated buffering). For example, say I do this: cat a b c > blah Then run this python script: >>> import re >>> for m in re.finditer('\w+', buffer(file('blah'))): ... print m.group() ... Traceback (most recent call last): File "", line 1, in ? TypeError: buffer object expected Of course, this works fine, but it loads the file completely into memory (right?): >>> for m in re.finditer('\w+', buffer(file('blah').read())): ... print m.group() ... a b c So, is there any way to do this? Thanks, -e -- http://mail.python.org/mailman/listinfo/python-list