Re: [Python-Dev] Unicode byte order mark decoding

Nicholas Bastin Wed, 06 Apr 2005 20:09:48 -0700


On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:

Note that the UTF-16 codec is strict w/r to the presence
of the BOM mark: you get a UnicodeError if a stream does
not start with a BOM mark. For the UTF-8-SIG codec, this
should probably be relaxed to not require the BOM.

I've actually been confused about this point for quite some time now, but never had a chance to bring it up. I do not understand why UnicodeError should be raised if there is no BOM. I know that PEP-100 says:

'utf-16': 16-bit variable length encoding (little/big endian)

and:

Note: 'utf-16' should be implemented by using and requiring byte order marks (BOM) for file input/output.

But this appears to be in error, at least in the current unicode standard. 'utf-16', as defined by the unicode standard, is big-endian in the absence of a BOM:

--- 3.10.D42: UTF-16 encoding scheme: ... * The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian. ---

The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec. I allow for the possibility that this was ambiguous in the standard when the PEP was written, but it is certainly not ambiguous now.

--
Nick

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode byte order mark decoding

Reply via email to