[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2017-11-09 Thread Serhiy Storchaka

Change by Serhiy Storchaka :


--
status: pending -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2017-09-20 Thread Serhiy Storchaka

Changes by Serhiy Storchaka :


--
status: open -> pending

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-16 Thread Martin Panter

Martin Panter added the comment:

I think changing the TextIOBase API would be hard to do if you want to keep 
compatibility with existing code. I agree that encoding the position to a 
number and back seems like a bad design, but I doubt it is worth changing it at 
this point.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-16 Thread Марк Коренберг

Марк Коренберг added the comment:

Well,  03e61104f7a2 adds good description, why not to enforce checks instead of 
saying that some values are unsupported ?

Also, idea in returning special object instance from tell(), this object should 
incapsulate byte offset. And allow for the seek() either such objects or zero.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-15 Thread STINNER Victor

STINNER Victor added the comment:

> If the “slow reconstruction algorithm” was clarified or removed, ...

I wrote this algorithm, or I helpd to write it, I don't recall.

The problem is readahead: TextIOWrapper read more bytes than requested for 
performances. But when tell() is called, the user expects to get the current 
file position, not the "read ahead" file position. So we have to go backward. 
Problem: TextIOWrapper uses text (Unicode) whereas all files are bytes on the 
disk. We need to compute the size of the readahead buffer in bytes from a 
buffer in characters.

The bad performances comes from multibyte codecs which requires heuristic to 
first guess the number of bytes and then really encode back bytes to find the 
exact size.

See _pyio.TextIOWrapper.tell() for the Python implementation.

# Fast search for an acceptable start point, close to our
# current pos.
# Rationale: calling decoder.decode() has a large overhead
# regardless of chunk size; we want the number of such calls to
# be O(1) in most situations (common decoders, non-crazy input).
# Actually, it will be exactly 1 for fixed-size codecs (all
# 8-bit codecs, also UTF-16 and UTF-32).

(Incomplete) history of the Python implementation of the tell() method:

* changeset 7c6972f37fe3 (2007)
* changeset 28bc7ed26574: More efficient implementation
* changeset b5a2e753b682: use the new getstate/setstate decoder API
* changeset 04050373d799 (2008): fix for stateful decoders
* changeset 39a4f4393ef1: additional fixes to the handling of 'limit'
* (Lib/io.py moved to Lib/_pyio.py)
* changeset 4b6052320e98 (Issue #4): optimization

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-15 Thread Antoine Pitrou

Antoine Pitrou added the comment:

I don't understand what the complaint is. If you think seek()/tell() are not 
useful, just don't use them.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-14 Thread Martin Panter

Martin Panter added the comment:

You might be right about the “reconstruction algorithm”. This text was added in 
revision 0bba533c0959; maybe Antoine can comment whether we should clarify or 
remove it.

I think the text added for TextIOBase.seek() in revision 03e61104f7a2 (Issue 
12922) is closer to the truth. Though I would probably drop the bit about 
tell() not usually returning a byte position; for many codecs it does seem to.

This illustrates the only four cases of seeking I understand are allowed for 
text streams:

>>> text = TextIOWrapper(BytesIO(), "utf-7")
>>> text.write("привет")
6
>>> text.seek(0)  # 1: Rewind to start
0
>>> text.read(1)
'п'
>>> saved = text.tell()
>>> text.read()
'ривет'
>>> text.seek(saved)  # 2: Seek to saved offset
340282368347045388720132684115559317504
>>> text.read(1)
'р'
>>> text.seek(0, SEEK_CUR)  # 3: No movement
680564735267983852183507291547327528960
>>> text.read(1)
'и'
>>> text.seek(0, SEEK_END)  # 4: Seek to end
18
>>> text.read()  # EOF
''

If the “slow reconstruction algorithm” was clarified or removed, and the 
documentation explained that you cannot seek to arbitrary characters without 
having previously called tell(), would that work?

--
nosy: +martin.panter, pitrou

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-14 Thread Martin Panter

Martin Panter added the comment:

I’m starting to understand that there might be a “reconstruction algorithm” 
needed. When reading, TextIOWrapper buffers decoded characters. If you call 
tell() and there is unread but decoded data, it is not enough to return the 
incremental decoder state. You have to handle the unread buffered data as well. 
Looking at the _pyio tell() implementation, it tries to wind the decoder 
backwards to minimize the state.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-14 Thread Марк Коренберг

Марк Коренберг added the comment:

First, it seems that there are no real "reconstruction algorithm" at all. Seek 
is allowed to point to any byte position, even to place "inside" characters for 
multibyte encodings, such as UTF-8.

Second, about performance:  I talk about implementation mentioned in first 
message. If it is not used (and will not be used), we may forget about that 
sentence.

Next, once again:

I consider it is a bug in allowing to seek to invalid byte offsets for text 
files. Since we cannot easily calculate what offset will be valid (for example, 
seek past the end of file, or places inside character), just disallow seek. In 
real applications, no one will seek/peek to places other than

* beginning of the file
* current byte offset
* seeking to the end of file.

so this seeks/peeks must be allowed.

This is applicable only to variable multibyte encodings (such as UTF-8).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-14 Thread Марк Коренберг

Марк Коренберг added the comment:

s/peek/tell/

--
status: closed -> open

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-14 Thread Марк Коренберг

Марк Коренберг added the comment:

Also, can you provide the case, where such random seeks can be used on text 
files? It would be programmer error to seek to places other I mention. Does not 
it ?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-14 Thread R. David Murray

R. David Murray added the comment:

I think you haven't quite gotten what "opaque token" means in this context.  
The way you use tell/seek with text files is: you have read to some certain 
point in the file.  You call 'tell' and get back an opqaue token.  Later you 
can call seek with that token to get back to the place in the file that you 
"bookmarked".  It will never be between characters, because tell won't return 
such a poitner.  If you decide to call seek with something (other than 0) that 
you didn't get from tell, then you are on your own.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-13 Thread Марк Коренберг

Марк Коренберг added the comment:

https://docs.python.org/3.5/library/io.html?highlight=stringio#id3 :

Also, TextIOWrapper.tell() and TextIOWrapper.seek() are both quite slow due to 
the reconstruction algorithm used.

What is reconstruction algorightm ? Experiments show, that seek() and tell() 
returns values of count of bytes (not letters).


#!/usr/bin/python3.5
import tempfile

with tempfile.TemporaryFile(mode='r+t') as f:
l = f.write('привет')
print(l, f.tell()) # "6 12"
f.seek(3)
f.write('прекол42')
f.seek(0)
print(f.read()) # raise UnicodeDecodeError

So, please reopen. Issue is still here.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-13 Thread R. David Murray

R. David Murray added the comment:

I'm still not seeing a bug.

If you have a performance enhancement or functional enhancement you'd like us 
to consider, please attach a patch, with benchmark results.

Since you say "are quite slow because of the reconstruction algorithm", what 
makes you say this?  I'd think the "algorithm" was just using the underlying 
bytes tell/seek value, which then becomes a black box token because it does not 
have a one to one releationship to the character count.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-12 Thread R. David Murray

R. David Murray added the comment:

As mentioned in those issues, currently the peek/seek token is a black box.  
That doesn't mean it isn't useful.  Those issues are talking about potential 
ways to make it more useful, so any discussion should occur there.

--
nosy: +r.david.murray
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25849] files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes

2015-12-12 Thread Марк Коренберг

New submission from Марк Коренберг:

It seems, that we should deprecate .seek() on files, opened in text mode.

Since it is not possible to seek to position between symbols. Yes, it is 
possible to decode UTF-8 (or other charset) starting from beginning of the file 
and count symbols, but it is EXTREMELY SLOW, and is not what user expect. If 
so, seeking from end of file back to begin may be implemented in even more hard 
and error-prone way.

Moreover, I consider that we should disallow seek in text files except seek() 
to begin of the file (position 0) or end of file (seek(0, SEEK_END)).

Seel also issue25190 #25190 about something related for that.

--
components: IO, Library (Lib), Unicode
messages: 256291
nosy: ezio.melotti, haypo, mmarkk
priority: normal
severity: normal
status: open
title: files, opened in unicode (text): write() returns symbols count, but 
seek() expect offset in bytes
versions: Python 3.3, Python 3.4, Python 3.5, Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com