On Aug 27, 10:45 am, Roy Smith <[email protected]> wrote:
> In article <[email protected]>,
> Steven D'Aprano <[email protected]> wrote:
>
> > open("file.txt") # opens the file
> > .read() # reads the contents of the file
> > .split("\n\n") # splits the text on double-newlines.
>
> The biggest problem with this code is that read() slurps the entire file
> into a string. That's fine for moderately sized files, but will fail
> (or at least be grossly inefficient) for very large files.
>
> It's always annoyed me a little that while it's easy to iterate over the
> lines of a file, it's more complicated to iterate over a file character
> by character. You could write your own generator to do that:
>
> for c in getchar(open("file.txt")):
> whatever
>
> def getchar(f):
> for line in f:
> for c in line:
> yield c
>
> but that's annoyingly verbose (and probably not hugely efficient).
read() takes an optional size parameter; so f.read(1) is another
option...
>
> Of course, the next problem for the specific problem at hand is that
> even with an iterator over the characters of a file, split() only works
> on strings. It would be nice to have a version of split which took an
> iterable and returned an iterator over the split components. Maybe
> there is such a thing and I'm just missing it?
I don't know if there is such a thing; but for the OP's problem you
could read the file in chunks, e.g.:
def readgroup(f, delim, buffsize=8192):
tail=''
while True:
s = f.read(buffsize)
if not s:
yield tail
break
groups = (tail + s).split(delim)
tail = groups[-1]
for group in groups[:-1]:
yield group
for group in readgroup(open('file.txt'), '\n\n'):
# do something
Cheers - Chas
--
http://mail.python.org/mailman/listinfo/python-list