On Dec 7, 9:01 am, David Bolen <[EMAIL PROTECTED]> wrote: > Johannes Bauer <[EMAIL PROTECTED]> writes: > > This is very strange - when using "utf16", endianness should be detected > > automatically. When I simply truncate the trailing zero byte, I receive: > > Any chance that whatever you used to "simply truncate the trailing > zero byte" also removed the BOM at the start of the file? Without it, > utf16 wouldn't be able to detect endianness and would, I believe, fall > back to native order.
When I read this, I thought "O no, surely not!". Seems that you are correct: [Python 2.5.2, Windows XP] | >>> nobom = u'abcde'.encode('utf_16_be') | >>> nobom | '\x00a\x00b\x00c\x00d\x00e' | >>> nobom.decode('utf16') | u'\u6100\u6200\u6300\u6400\u6500' This may well explain one of the Python 3.0 problems that the OP's 2 files exhibit: data appears to have been byte-swapped under some conditions. Possibility: it is reading the file a chunk at a time and applying the utf_16 encoding independently to each chunk -- only the first chunk will have a BOM. -- http://mail.python.org/mailman/listinfo/python-list