Re: Python 3.0 automatic decoding of UTF16

John Machin Sat, 06 Dec 2008 14:35:54 -0800

On Dec 7, 9:01 am, David Bolen <[EMAIL PROTECTED]> wrote:
> Johannes Bauer <[EMAIL PROTECTED]> writes:
> > This is very strange - when using "utf16", endianness should be detected
> > automatically. When I simply truncate the trailing zero byte, I receive:
>
> Any chance that whatever you used to "simply truncate the trailing
> zero byte" also removed the BOM at the start of the file?  Without it,
> utf16 wouldn't be able to detect endianness and would, I believe, fall
> back to native order.


When I read this, I thought "O no, surely not!". Seems that you are
correct:
[Python 2.5.2, Windows XP]
| >>> nobom = u'abcde'.encode('utf_16_be')
| >>> nobom
| '\x00a\x00b\x00c\x00d\x00e'
| >>> nobom.decode('utf16')
| u'\u6100\u6200\u6300\u6400\u6500'

This may well explain one of the Python 3.0 problems that the OP's 2
files exhibit: data appears to have been byte-swapped under some
conditions. Possibility: it is reading the file a chunk at a time and
applying the utf_16 encoding independently to each chunk -- only the
first chunk will have a BOM.

--
http://mail.python.org/mailman/listinfo/python-list

Re: Python 3.0 automatic decoding of UTF16

Reply via email to