Re: [Python-Dev] Unicode byte order mark decoding

Evan Jones Fri, 01 Apr 2005 19:03:37 -0800

On Apr 1, 2005, at 15:19, M.-A. Lemburg wrote:

The BOM (byte order mark) was a non-standard Microsoft invention
to detect Unicode text data as such (MS always uses UTF-16-LE for
Unicode text files).

Well, it's origins do not really matter since at this point the BOM is firmly encoded in the Unicode standard. It seems to me that it is in everyone's best interest to support it.

It is not needed for the UTF-8 because that format doesn't rely on
the byte order and the BOM character at the beginning of a stream is
a legitimate ZWNBSP (zero width non breakable space) code point.

You are correct: it is a legitimate character. However, its use as a ZWNBSP character has been deprecated:

The overloading of semantics for this code point has caused problems for programs and protocols. The new character U+2060 WORD JOINER has the same semantics in all cases as U+FEFF, except that it cannot be used as a signature. Implementers are strongly encouraged to use word joiner in those circumstances whenever word joining semantics is intended.

Also, the Unicode specification is ambiguous on what an implementation should do about a leading ZWNBSP that is encoded in UTF-16. Like I mentioned, if you look at the Unicode standard, version 4, section 15.9, it says:

2. Unmarked Character Set. In some circumstances, the character set information for a stream of coded characters (such as a file) is not available. The only information available is that the stream contains text, but the precise character set is not known.

This seems to indicate that it is permitted to strip the BOM from the beginning of UTF-8 text.

-1; there's no standard for UTF-8 BOMs - adding it to the
codecs module was probably a mistake to begin with. You usually
only get UTF-8 files with BOM marks as the result of recoding
UTF-16 files into UTF-8.

This is clearly incorrect. The UTF-8 is specified in the Unicode standard version 4, section 15.9:

In UTF-8, the BOM corresponds to the byte sequence <EF BB BF>.

I normally find files with UTF-8 BOMs from many Windows applications when you save a text file as UTF8. I think that Notepad or WordPad does this, for example. I think UltraEdit also does the same thing. I know that Scintilla definitely does.

At the very least, it would be nice to add a note about this to the
documentation, and possibly add this example function that implements
the "UTF-8 or ASCII?" logic.

Well, I'd say that's a very English way of dealing with encoded
text ;-)

Please note I am saying only that something like this may want to me considered for addition to the documentation, and not to the Python standard library. This example function more closely replicates the logic that is used on those Windows applications when opening ".txt" files. It uses the default locale if there is no BOM:

def autodecode( s ):
        if s.beginswith( codecs.BOM_UTF8 ):
                # The byte string s is UTF-8
                out = s.decode( "utf8" )
                return out[1:]
        else: return s.decode()

BTW, how do you know that s came from the start of a file
and not from slicing some already loaded file somewhere
in the middle ?

Well, the same argument could be applied to the UTF-16 decoder know that the string came from the start of a file, and not from slicing some already loaded file? The standard states that:

In the UTF-16 encoding scheme, U+FEFF at the very beginning of a file or stream explicitly signals the byte order.

So it is perfectly permissible to perform this type of processing if you consider a string to be equivalent to a stream.

My interpretation of the specification means that Python should silently remove the character, resulting in a zero length Unicode string.
Hmm, wouldn't it be better to raise an error ? After all,
a reversed BOM mark in the stream looks a lot like you're
trying to decode a UTF-16 stream assuming the wrong
byte order ?!

Well, either one is possible, however the Unicode standard suggests, but does not require, silently removing them:

It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters.

I would prefer silently ignoring them from the str.decode() function, since I believe in "be strict in what you emit, but liberal in what you accept." I think that this only applies to str.decode(). Any other attempt to create non-characters, such as unichr( 0xffff ), *should* raise an exception because clearly the programmer is making a mistake.

Other than that: +1 on fixing this case.


Cool!

Evan Jones

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode byte order mark decoding

Reply via email to