Evan Jones wrote: > I recently rediscovered this strange behaviour in Python's Unicode > handling. I *think* it is a bug, but before I go and try to hack > together a patch, I figure I should run it by the experts here on > Python-Dev. If you understand Unicode, please let me know if there are > problems with making these minor changes. > > >>>> import codecs >>>> codecs.BOM_UTF8.decode( "utf8" ) > u'\ufeff' >>>> codecs.BOM_UTF16.decode( "utf16" ) > u'' > > Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder > turns it into a character?
The BOM (byte order mark) was a non-standard Microsoft invention to detect Unicode text data as such (MS always uses UTF-16-LE for Unicode text files). It is not needed for the UTF-8 because that format doesn't rely on the byte order and the BOM character at the beginning of a stream is a legitimate ZWNBSP (zero width non breakable space) code point. The "utf-16" codec detects and removes the mark, while the two others "utf-16-le" (little endian byte order) and "utf-16-be" (big endian byte order) don't. > The UTF-16 decoder contains logic to > correctly handle the BOM. It even handles byte swapping, if necessary. I > propose that the UTF-8 decoder should have the same logic: it should > remove the BOM if it is detected at the beginning of a string. -1; there's no standard for UTF-8 BOMs - adding it to the codecs module was probably a mistake to begin with. You usually only get UTF-8 files with BOM marks as the result of recoding UTF-16 files into UTF-8. > This will > remove a bit of manual work for Python programs that deal with UTF-8 > files created on Windows, which frequently have the BOM at the > beginning. The Unicode standard is unclear about how it should be > handled (version 4, section 15.9): > >> Although there are never any questions of byte order with UTF-8 text, >> this sequence can serve as signature for UTF-8 encoded text where the >> character set is unmarked. [...] Systems that use the byte order mark >> must recognize when an initial U+FEFF signals the byte order. In those >> cases, it is not part of the textual content and should be removed >> before processing, because otherwise it may be mistaken for a >> legitimate zero width no-break space. > > > At the very least, it would be nice to add a note about this to the > documentation, and possibly add this example function that implements > the "UTF-8 or ASCII?" logic: > > def autodecode( s ): > if s.beginswith( codecs.BOM_UTF8 ): > # The byte string s is UTF-8 > out = s.decode( "utf8" ) > return out[1:] > else: return s.decode( "ascii" ) Well, I'd say that's a very English way of dealing with encoded text ;-) BTW, how do you know that s came from the start of a file and not from slicing some already loaded file somewhere in the middle ? > As a second issue, the UTF-16LE and UTF-16BE encoders almost do the > right thing: They turn the BOM into a character, just like the Unicode > specification says they should. > >>>> codecs.BOM_UTF16_LE.decode( "utf-16le" ) > u'\ufeff' >>>> codecs.BOM_UTF16_BE.decode( "utf-16be" ) > u'\ufeff' > > However, they also *incorrectly* handle the reversed byte order mark: > >>>> codecs.BOM_UTF16_BE.decode( "utf-16le" ) > u'\ufffe' > > This is *not* a valid Unicode character. The Unicode specification > (version 4, section 15.8) says the following about non-characters: > >> Applications are free to use any of these noncharacter code points >> internally but should never attempt to exchange them. If a >> noncharacter is received in open interchange, an application is not >> required to interpret it in any way. It is good practice, however, to >> recognize it as a noncharacter and to take appropriate action, such as >> removing it from the text. Note that Unicode conformance freely allows >> the removal of these characters. (See C10 in Section3.2, Conformance >> Requirements.) > > > My interpretation of the specification means that Python should silently > remove the character, resulting in a zero length Unicode string. > Similarly, both of the following lines should also result in a zero > length Unicode string: > >>>> '\xff\xfe\xfe\xff'.decode( "utf16" ) > u'\ufffe' >>>> '\xff\xfe\xff\xff'.decode( "utf16" ) > u'\uffff' Hmm, wouldn't it be better to raise an error ? After all, a reversed BOM mark in the stream looks a lot like you're trying to decode a UTF-16 stream assuming the wrong byte order ?! Other than that: +1 on fixing this case. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 01 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com