[issue37036] Iterating a text file by line should not implicitly disable tell

2019-05-27 Thread Jeffrey Kintscher


Change by Jeffrey Kintscher :


--
nosy: +Jeffrey.Kintscher

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37036] Iterating a text file by line should not implicitly disable tell

2019-05-24 Thread Josh Rosenberg


Josh Rosenberg  added the comment:

Left a dangling sentence in there:

"I used two arg iter in both cases to keep the code paths as similar as 
possible so the `telling`."

should read:

"I used iter(f.readline, '') in both cases to keep the code paths as similar as 
possible so the `telling` optimization was tested in isolation."

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37036] Iterating a text file by line should not implicitly disable tell

2019-05-24 Thread Josh Rosenberg

New submission from Josh Rosenberg :

TextIOWrapper explicitly sets the `telling` flag to 0 when .__next__ ( 
textiowrapper_iternext ) is called ( 
https://github.com/python/cpython/blob/3.7/Modules/_io/textio.c#L2974 ), e.g. 
during standard for loops over the file of this form, trying to call tell 
raises an exception:

with open(filename) as myfile:
for line in myfile:
myfile.tell()

which raises:

OSError: telling position disabled by next() call

while the effectively equivalent:

with open(filename) as myfile:
for line in iter(myfile.readline, ''):
myfile.tell()

works fine.

The implementation of __next__ and readline is almost identical (__next__ is 
calling readline and handling the EOF sentinel per the iterator protocol, 
that's all). Given they're implemented identically, I see no reason aside from 
nannying (discouraging slow operations like seek/tell during iteration by 
forbidding them on the most convenient means of iterating) to forbid tell after 
beginning iteration, but not after readline. Given the general Python 
philosophy of "we're all adults here", I don't see nannying as a good reason, 
which leaves the performance benefit of avoiding snapshotting as the only 
compelling reason to do this.

But the performance benefit is trivial; in local tests, the savings from 
avoiding that work is barely noticeable, per ipython microbenchmarks (on 3.7.2 
Linux x86-64):

>>> %%timeit -r5 from collections import deque; f = 
open('American-english.txt'); consume = deque(maxlen=0).extend
... f.seek(0)
... consume(iter(f.readline, ''))
...
...
15.8 ms ± 38.4 μs per loop (mean ± std. dev. of 5 runs, 100 loops each)

>>> %%timeit -r5 from collections import deque; f = 
open('American-english.txt'); consume = deque(maxlen=0).extend
... f.seek(0)
... next(f)  # Triggers optimization for all future read_chunk calls
... consume(iter(f.readline, ''))  # Otherwise iterated identically
...
...
15.7 ms ± 98.5 μs per loop (mean ± std. dev. of 5 runs, 100 loops each)

The two blocks are identical except that the second one explicitly advances the 
file one line at the beginning with next(f) to flip `telling` to 0 so future 
calls to readline don't involve the snapshotting code in 
textiowrapper_read_chunk.

Calling consume(f) would drop the time to 9.86 ms, but that's saying more about 
the optimization of the raw iterator protocol over method calls than it is 
about the `telling` optimization; I used two arg iter in both cases to keep the 
code paths as similar as possible so the `telling`.

For reference, the input file was 931708 bytes (931467 characters thanks to a 
few scattered non-ASCII UTF-8 characters), 98569 lines long.

Presumably, the speed difference of 0.1 ms can be chalked up to the telling 
optimization, so removing it would increase the cost of normal iteration from 
9.86 ms to 9.96 ms. That seems de minimis to my mind, in the context of text 
oriented I/O.

Given that, it seems like triggering this optimization via __next__ should be 
dropped; it's a microoptimization at best, that's mostly irrelevant compared to 
the overhead of text-oriented I/O, and it introduces undocumented limitations 
on the use of TextIOWrapper.

The changes would be to remove all use of the `telling` variable, and (if we 
want to keep the optimization for unseekable files, where at least no 
functionality is lost by having it), change the two tests in 
textiowrapper_read_chunk to test `seekable` in its place. Or we drop the 
optimization entirely and save 50+ lines of code that provide a fairly tiny 
benefit in any event.

--
components: IO, Library (Lib)
messages: 343402
nosy: josh.r
priority: normal
severity: normal
status: open
title: Iterating a text file by line should not implicitly disable tell
versions: Python 3.7, Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com