Re: streaming a file object through re.finditer

2005-02-03 Thread Erick
I did try to see if I could get that to work, but I couldn't figure it
out. I'll see if I can play around more with that api.

So say I did investigate a little more to see how much work it would
take to adapt the re module to accept an iterator (while leaving the
current string api as another code path). Depending on how complicated
a change this would be, how much interest would there be in other
people using this feature? From what I understand about regular
expressions, they're essentially stream processing and don't need
backtracking, so reading from an interator should work too (right?).

Thanks,

-e

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: streaming a file object through re.finditer

2005-02-03 Thread TZOTZIOY
On Wed, 2 Feb 2005 22:22:27 -0500, rumours say that Daniel Bickett
<[EMAIL PROTECTED]> might have written:

>Erick wrote:
>> True, but it doesn't work with multiline regular expressions :(

>If your intent is for the expression to traverse multiple lines (and
>possibly match *across* multiple lines,) then, as far as I know, you
>have no choice but to load the whole file into memory.

*If* the OP knows that their multiline re won't match more than, say, 4 lines at
a time, the code attached at the end of this post could be useful.  Usage:

for group_of_lines in line_groups(, line_count=4):
# bla bla

The OP should take care to ignore multiple matches as the n-line window scans
through the input file; eg. if your re searches for '3\n4', it will match 3
times in the first example of my code.

|import collections
|
|def line_groups(fileobj, line_count=2):
|iterator = iter(fileobj)
|group = collections.deque()
|joiner = ''.join
|
|try:
|while len(group) < line_count:
|group.append(iterator.next())
|except StopIteration:
|yield joiner(group)
|return
|
|for line in iterator:
|group.append(line)
|del group[0]
|yield joiner(group)
|
|if __name__=="__main__":
|import os, tempfile
|
|# create two temp file for 4-line groups
|
|# write n+3 lines in first file
|testname1= tempfile.mktemp() # depracated & insecure but ok for this test
|testfile= open(testname1, "w")
|testfile.write('\n'.join(map(str, range(7
|testfile.close()
|
|# write n-2 lines in second file
|testname2= tempfile.mktemp()
|testfile= open(testname2, "w")
|testfile.write('\n'.join(map(str, range(2
|testfile.close()
|
|# now iterate over four line groups
|
|for bunch_o_lines in line_groups( open(testname1), line_count=4):
|print repr(bunch_o_lines),
|print
|
|for bunch_o_lines in line_groups( open(testname2), line_count=4):
|print repr(bunch_o_lines),
|print
|
|os.remove(testname1); os.remove(testname2)

-- 
TZOTZIOY, I speak England very best.
"Be strict when sending and tolerant when receiving." (from RFC1958)
I really should keep that in mind when talking with people, actually...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: streaming a file object through re.finditer

2005-02-02 Thread Steven Bethard
Erick wrote:
Hello,
I've been looking for a while for an answer, but so far I haven't been
able to turn anything up yet. Basically, what I'd like to do is to use
re.finditer to search a large file (or a file stream), but I haven't
figured out how to get finditer to work without loading the entire file
into memory, or just reading one line at a time (or more complicated
buffering).
Can you use mmap?
http://docs.python.org/lib/module-mmap.html
"You can use mmap objects in most places where strings are expected; for 
example, you can use the re module to search through a memory-mapped file."

Seems applicable, and it should keep your memory use down, but I'm not 
very experienced with it...

Steve
--
http://mail.python.org/mailman/listinfo/python-list


Re: streaming a file object through re.finditer

2005-02-02 Thread Daniel Bickett
Erick wrote:
> True, but it doesn't work with multiline regular expressions :(

If your intent is for the expression to traverse multiple lines (and
possibly match *across* multiple lines,) then, as far as I know, you
have no choice but to load the whole file into memory.

-- 
Daniel Bickett
dbickett at gmail.com
http://heureusement.org/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: streaming a file object through re.finditer

2005-02-02 Thread Erik Johnson

Is it not possible to wrap your loop below within a loop doing
file.read([size]) (or readline() or readlines([size]),
reading the file a chunk at a time then running your re on a per-chunk
basis?

-ej


"Erick" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> Ack, typo. What I meant was this:
> cat a b c > blah
>
> >>> import re
> >>> for m in re.finditer('\w+', file('blah')):
>
> ...   print m.group()
> ...
> Traceback (most recent call last):
> File "", line 1, in ?
> TypeError: buffer object expected
>
> Of course, this works fine, but it loads the file completely into
> memory (right?):
> >>> for m in re.finditer('\w+', file('blah').read()):
> ...   print m.group()
> ...
> a
> b
> c
>


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: streaming a file object through re.finditer

2005-02-02 Thread Erick
True, but it doesn't work with multiline regular expressions :(

-e

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: streaming a file object through re.finditer

2005-02-02 Thread Erick
Ack, typo. What I meant was this:
cat a b c > blah

>>> import re
>>> for m in re.finditer('\w+', file('blah')):

...   print m.group()
...
Traceback (most recent call last):
File "", line 1, in ?
TypeError: buffer object expected

Of course, this works fine, but it loads the file completely into
memory (right?):
>>> for m in re.finditer('\w+', file('blah').read()):
...   print m.group()
...
a
b
c

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: streaming a file object through re.finditer

2005-02-02 Thread Daniel Bickett
The following example loads the file into memory only one line at a
time, so it should suit your purposes:

>>> data = file( "important.dat" , "w" )
>>> data.write("this\nis\nimportant\ndata")
>>> data.close()

now read it

>>> import re
>>> data = file( "important.dat" , "r" )
>>> line = data.readline()
>>> while line:
for x in re.finditer( "\w+" , line):
print x.group()
line = data.readline()


this
is
important
data
>>> 


-- 
Daniel Bickett
dbickett at gmail.com
http://heureusement.org/
-- 
http://mail.python.org/mailman/listinfo/python-list


streaming a file object through re.finditer

2005-02-02 Thread Erick
Hello,

I've been looking for a while for an answer, but so far I haven't been
able to turn anything up yet. Basically, what I'd like to do is to use
re.finditer to search a large file (or a file stream), but I haven't
figured out how to get finditer to work without loading the entire file
into memory, or just reading one line at a time (or more complicated
buffering).

For example, say I do this:
cat a b c > blah

Then run this python script:
>>> import re
>>> for m in re.finditer('\w+', buffer(file('blah'))):
...   print m.group()
...
Traceback (most recent call last):
File "", line 1, in ?
TypeError: buffer object expected

Of course, this works fine, but it loads the file completely into
memory (right?):
>>> for m in re.finditer('\w+', buffer(file('blah').read())):
...   print m.group()
...
a
b
c

So, is there any way to do this?

Thanks,

-e

-- 
http://mail.python.org/mailman/listinfo/python-list