"Johannes Bauer" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]
John Machin schrieb:
On Dec 6, 5:36 am, Johannes Bauer <[EMAIL PROTECTED]> wrote:
So UTF-16 has an explicit EOF marker within the text? I cannot find one
in original file, only some kind of starting sequence I suppose
(0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
simple \r\n line ending.

Sorry, *WRONG*. It ends in 00 0d 00 0a 00. The file is 1559 bytes
long, an ODD number, which shouldn't happen with utf16.  The file is
stuffed. Python 3.0 has a bug; it should give a meaningful error
message.

Yes, you are right. I fixed the file, yet another error pops up
(http://www.file-upload.net/download-1299688/2008_12_05_Handy_Backup.txt.html):

Traceback (most recent call last):
 File "./modify.py", line 12, in <module>
   a = AddressBook("2008_12_05_Handy_Backup.txt")
 File "./modify.py", line 7, in __init__
   line = f.readline()
 File "/usr/local/lib/python3.0/io.py", line 1807, in readline
   while self._read_chunk():
 File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
   self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
 File "/usr/local/lib/python3.0/io.py", line 1293, in decode
   output = self.decoder.decode(input, final=final)
 File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
   (result, consumed) = self._buffer_decode(data, self.errors, final)
 File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
   return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
truncated data

File size is 1630 bytes - so this clearly cannot be.

How about posting your code? The first file is incorrect. It contains an extra 0x00 byte at the end of the file, but is otherwise correctly encoded with a big-endian UTF16 BOM and data. The second file is a correct UTF16-BE file as well.

This code (Python 2.6) decodes the first file, removing the trailing extra byte:

   raw = open('2008_11_05_Handy_Backup.txt').read()
   data = raw[:-1].decode('utf16')

and this code (Python 2.6) decodes the second:

   raw = open('2008_12_05_Handy_Backup.txt').read()
   data = raw.decode('utf16')

Python 3.0 also has no problems with decoding or accurate error messages:

data = open('2008_12_05_Handy_Backup.txt',encoding='utf16').read()
data = open('2008_11_05_Handy_Backup.txt',encoding='utf16').read()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\dev\python30\lib\io.py", line 1724, in read
   decoder.decode(self.buffer.read(), final=True))
 File "C:\dev\python30\lib\io.py", line 1295, in decode
   output = self.decoder.decode(input, final=final)
 File "C:\dev\python30\lib\codecs.py", line 300, in decode
   (result, consumed) = self._buffer_decode(data, self.errors, final)
 File "c:\dev\python30\lib\encodings\utf_16.py", line 61, in _buffer_decode
   codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 1558: trunc
ated data

-Mark


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to