"Johannes Bauer" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
John Machin schrieb:
On Dec 6, 5:36 am, Johannes Bauer <[EMAIL PROTECTED]> wrote:
So UTF-16 has an explicit EOF marker within the text? I cannot find one
in original file, only some kind of starting sequence I suppose
(0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
simple \r\n line ending.
Sorry, *WRONG*. It ends in 00 0d 00 0a 00. The file is 1559 bytes
long, an ODD number, which shouldn't happen with utf16. The file is
stuffed. Python 3.0 has a bug; it should give a meaningful error
message.
Yes, you are right. I fixed the file, yet another error pops up
(http://www.file-upload.net/download-1299688/2008_12_05_Handy_Backup.txt.html):
Traceback (most recent call last):
File "./modify.py", line 12, in <module>
a = AddressBook("2008_12_05_Handy_Backup.txt")
File "./modify.py", line 7, in __init__
line = f.readline()
File "/usr/local/lib/python3.0/io.py", line 1807, in readline
while self._read_chunk():
File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "/usr/local/lib/python3.0/io.py", line 1293, in decode
output = self.decoder.decode(input, final=final)
File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
truncated data
File size is 1630 bytes - so this clearly cannot be.
How about posting your code? The first file is incorrect. It contains an
extra 0x00 byte at the end of the file, but is otherwise correctly encoded
with a big-endian UTF16 BOM and data. The second file is a correct UTF16-BE
file as well.
This code (Python 2.6) decodes the first file, removing the trailing extra
byte:
raw = open('2008_11_05_Handy_Backup.txt').read()
data = raw[:-1].decode('utf16')
and this code (Python 2.6) decodes the second:
raw = open('2008_12_05_Handy_Backup.txt').read()
data = raw.decode('utf16')
Python 3.0 also has no problems with decoding or accurate error messages:
data = open('2008_12_05_Handy_Backup.txt',encoding='utf16').read()
data = open('2008_11_05_Handy_Backup.txt',encoding='utf16').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\dev\python30\lib\io.py", line 1724, in read
decoder.decode(self.buffer.read(), final=True))
File "C:\dev\python30\lib\io.py", line 1295, in decode
output = self.decoder.decode(input, final=final)
File "C:\dev\python30\lib\codecs.py", line 300, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
File "c:\dev\python30\lib\encodings\utf_16.py", line 61, in _buffer_decode
codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 1558:
trunc
ated data
-Mark
--
http://mail.python.org/mailman/listinfo/python-list