Re: Python 3.0 automatic decoding of UTF16

Mark Tolonen Sat, 06 Dec 2008 11:22:02 -0800

"Johannes Bauer" <[EMAIL PROTECTED]> wrote in messagenews:[EMAIL PROTECTED]

John Machin schrieb:

On Dec 6, 5:36 am, Johannes Bauer <[EMAIL PROTECTED]> wrote:

So UTF-16 has an explicit EOF marker within the text? I cannot find one
in original file, only some kind of starting sequence I suppose
(0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
simple \r\n line ending.


Sorry, *WRONG*. It ends in 00 0d 00 0a 00. The file is 1559 bytes
long, an ODD number, which shouldn't happen with utf16.  The file is
stuffed. Python 3.0 has a bug; it should give a meaningful error
message.


Yes, you are right. I fixed the file, yet another error pops up
(http://www.file-upload.net/download-1299688/2008_12_05_Handy_Backup.txt.html):

Traceback (most recent call last):
 File "./modify.py", line 12, in <module>
   a = AddressBook("2008_12_05_Handy_Backup.txt")
 File "./modify.py", line 7, in __init__
   line = f.readline()
 File "/usr/local/lib/python3.0/io.py", line 1807, in readline
   while self._read_chunk():
 File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
   self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
 File "/usr/local/lib/python3.0/io.py", line 1293, in decode
   output = self.decoder.decode(input, final=final)
 File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
   (result, consumed) = self._buffer_decode(data, self.errors, final)
 File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
_buffer_decode
   return self.decoder(input, self.errors, final)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
truncated data

File size is 1630 bytes - so this clearly cannot be.

How about posting your code? The first file is incorrect. It contains anextra 0x00 byte at the end of the file, but is otherwise correctly encodedwith a big-endian UTF16 BOM and data. The second file is a correct UTF16-BEfile as well.

This code (Python 2.6) decodes the first file, removing the trailing extrabyte:


   raw = open('2008_11_05_Handy_Backup.txt').read()
   data = raw[:-1].decode('utf16')

and this code (Python 2.6) decodes the second:

   raw = open('2008_12_05_Handy_Backup.txt').read()
   data = raw.decode('utf16')

Python 3.0 also has no problems with decoding or accurate error messages:

data = open('2008_12_05_Handy_Backup.txt',encoding='utf16').read()
data = open('2008_11_05_Handy_Backup.txt',encoding='utf16').read()

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\dev\python30\lib\io.py", line 1724, in read
   decoder.decode(self.buffer.read(), final=True))
 File "C:\dev\python30\lib\io.py", line 1295, in decode
   output = self.decoder.decode(input, final=final)
 File "C:\dev\python30\lib\codecs.py", line 300, in decode
   (result, consumed) = self._buffer_decode(data, self.errors, final)
 File "c:\dev\python30\lib\encodings\utf_16.py", line 61, in _buffer_decode
   codecs.utf_16_ex_decode(input, errors, 0, final)

UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 1558:trunc

ated data

-Mark


--
http://mail.python.org/mailman/listinfo/python-list

Re: Python 3.0 automatic decoding of UTF16

Reply via email to