On Dec 7, 9:34 am, John Machin <[EMAIL PROTECTED]> wrote: > On Dec 7, 9:01 am, David Bolen <[EMAIL PROTECTED]> wrote: > > > Johannes Bauer <[EMAIL PROTECTED]> writes: > > > This is very strange - when using "utf16", endianness should be detected > > > automatically. When I simply truncate the trailing zero byte, I receive: > > > Any chance that whatever you used to "simply truncate the trailing > > zero byte" also removed the BOM at the start of the file? Without it, > > utf16 wouldn't be able to detect endianness and would, I believe, fall > > back to native order. > > When I read this, I thought "O no, surely not!". Seems that you are > correct: > [Python 2.5.2, Windows XP] > | >>> nobom = u'abcde'.encode('utf_16_be') > | >>> nobom > | '\x00a\x00b\x00c\x00d\x00e' > | >>> nobom.decode('utf16') > | u'\u6100\u6200\u6300\u6400\u6500' > > This may well explain one of the Python 3.0 problems that the OP's 2 > files exhibit: data appears to have been byte-swapped under some > conditions. Possibility: it is reading the file a chunk at a time and > applying the utf_16 encoding independently to each chunk -- only the > first chunk will have a BOM.
Well, no, on further investigation, we're not byte-swapped, we've tricked ourselves into decoding on odd-byte boundaries. Here's the scoop: It's a bug in the newline handling (in io.py, class IncrementalNewlineDecoder, method decode). It reads text files in 128- byte chunks. Converting CR LF to \n requires special case handling when '\r' is detected at the end of the decoded chunk n in case there's an LF at the start of chunk n+1. Buggy solution: prepend b'\r' to the chunk n+1 bytes and decode that -- suddenly with a 2-bytes-per- char encoding like UTF-16 we are 1 byte out of whack. Better (IMVH[1] O) solution: prepend '\r' to the result of decoding the chunk n+1 bytes. Each of the OP's files have \r on a 64-character boundary. Note: They would exhibit the same symptoms if encoded in utf-16LE instead of utf-16BE. With the better solution applied, the first file [the truncated one] gave the expected error, and the second file [the apparently OK one] gave sensible looking output. [1] I thought it best to be Very Humble given what you see when you do: import io print(io.__author__) Hope my surge protector can cope with this :-) ^%!//() NO CARRIER -- http://mail.python.org/mailman/listinfo/python-list