On Wed, May 25, 2011 at 3:52 PM, MRAB <pyt...@mrabarnett.plus.com> wrote: > What do you mean by "may include the decoder state in its return value"? > > It does make sense that the values returned from tell() won't be in the > middle of an encoded sequence of bytes.
If you take a look at the source code, tell() returns a long that includes decoder state data in the upper bytes. For example: >>> data = b' ' + '\u0302a'.encode('utf-16') >>> data b' \xff\xfe\x02\x03a\x00' >>> f = open('test.txt', 'wb') >>> f.write(data) 7 >>> f.close() >>> f = open('test.txt', 'r', encoding='utf-16') >>> f.read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\python32\lib\codecs.py", line 300, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) File "c:\python32\lib\encodings\utf_16.py", line 61, in _buffer_decode codecs.utf_16_ex_decode(input, errors, 0, final) UnicodeDecodeError: 'utf16' codec can't decode bytes in position 6-6: truncated data The problem of course is the initial space, throwing off the decoder. We can try to seek past it: >>> f.seek(1) 1 >>> f.read() '\ufeff\u0302a' But notice that since we're not reading from the beginning of the file, the BOM has now been interpreted as data. However: >>> f.seek(1 + (2 << 65)) 73786976294838206465 >>> f.read() '\u0302a' And you can see that instead of reading from position 73786976294838206465 it has read from position 1 starting in the "read a BOM" state. Note that I wouldn't recommend doing anything remotely like this in production code, not least because the value that I passed into seek() is platform-dependent. This is just a demonstration of how the seek() value can include decoder state. Cheers, Ian -- http://mail.python.org/mailman/listinfo/python-list