> Dave Angel <[email protected]> wrote (and I agreed with):
>> I'd suggest you open the file twice, and get two file objects. Then you
>> can iterate over them independently.
On Sep 18, 2013, at 9:09 AM, Oscar Benjamin wrote:
> There's no need to use OS resources by opening the file twice or to
> screw up the IO caching with seek().
There's no reason NOT to use OS resources. That's what the OS is there for; to
make life easier on application programmers. Opening a file twice costs almost
nothing. File descriptors are almost as cheap as whitespace.
> Peter's version holds just as many lines as is necessary in an
> internal Python buffer and performs the minimum possible
> amount of IO.
I believe by "Peter's version", you're talking about:
> from itertools import islice, tee
>
> with open("tmp.txt") as f:
> while True:
> for outer in f:
> print outer,
> if "*" in outer:
> f, g = tee(f)
> for inner in islice(g, 3):
> print " ", inner,
> break
> else:
> break
There's this note from
http://docs.python.org/2.7/library/itertools.html#itertools.tee:
> This itertool may require significant auxiliary storage (depending on how
> much temporary data needs to be stored). In general, if one iterator uses
> most or all of the data before another iterator starts, it is faster to use
> list() instead of tee().
I have no idea how that interacts with the pattern above where you call tee()
serially. You're basically doing
with open("my_file") as f:
while True:
f, g = tee(f)
Are all of those g's just hanging around, eating up memory, while waiting to be
garbage collected? I have no idea. But I do know that no such problems exist
with the two file descriptor versions.
> I would expect this to be more
> efficient as well as less error-prone on Windows.
>
>
> Oscar
>
---
Roy Smith
[email protected]
--
https://mail.python.org/mailman/listinfo/python-list