Stephen J. Turnbull wrote: >>>>>>"MAL" == M <[EMAIL PROTECTED]> writes: > > > MAL> The BOM (byte order mark) was a non-standard Microsoft > MAL> invention to detect Unicode text data as such (MS always uses > MAL> UTF-16-LE for Unicode text files). > > The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds > them to existing UTF-8 files lacking them.
Is that a MS application ? AFAIK, notepad, wordpad and MS Office always use UTF-16-LE + BOM when saving text as "Unicode text". > MAL> -1; there's no standard for UTF-8 BOMs - adding it to the > MAL> codecs module was probably a mistake to begin with. You > MAL> usually only get UTF-8 files with BOM marks as the result of > MAL> recoding UTF-16 files into UTF-8. > > There is a standard for UTF-8 _signatures_, however. I don't have the > most recent version of the ISO-10646 standard, but Amendment 2 (which > defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to > Annex F of that standard. Evan quotes Version 4 of the Unicode > standard, which explicitly defines the UTF-8 signature. Ok, as signature the BOM does make some sense - whether to strip signatures from a document is a good idea or not is a different matter, though. Here's the Unicode Cons. FAQ on the subject: http://www.unicode.org/faq/utf_bom.html#22 They also explicitly warn about adding BOMs to UTF-8 data since it can break applications and protocols that do not expect such a signature. > So there is a standard for the UTF-8 signature, and I know of > applications which produce it. While I agree with you that Python's > codecs shouldn't produce it (by default), providing an option to strip > is a good idea. > > However, this option should be part of the initialization of an IO > stream which produces Unicodes, _not_ an operation on arbitrary > internal strings (whether raw or Unicode). Right. > MAL> BTW, how do you know that s came from the start of a file and > MAL> not from slicing some already loaded file somewhere in the > MAL> middle ? > > The programmer or the application might, but Python's codecs don't. > The point is that this is also true of rawstrings that happen to > contain UTF-16 or UTF-32 data. The UTF-16 ("auto-endian") codec > shouldn't strip leading BOMs either, unless it has been told it has > the beginning of the string. The UTF-16 stream codecs implement this logic. The UTF-16 encode and decode functions will however always strip the BOM mark from the beginning of a string. If the application doesn't want this stripping to happen, it should use the UTF-16-LE or -BE codec resp. > MAL> Evan Jones wrote: > > >> This is *not* a valid Unicode character. The Unicode > >> specification (version 4, section 15.8) says the following > >> about non-characters: > >> > >>> Applications are free to use any of these noncharacter code > >>> points internally but should never attempt to exchange > >>> them. If a noncharacter is received in open interchange, an > >>> application is not required to interpret it in any way. It is > >>> good practice, however, to recognize it as a noncharacter and > >>> to take appropriate action, such as removing it from the > >>> text. Note that Unicode conformance freely allows the removal > >>> of these characters. (See C10 in Section3.2, Conformance > >>> Requirements.) > >> > >> My interpretation of the specification means that Python should > > The specification _permits_ silent removal; it does not recommend. > > >> silently remove the character, resulting in a zero length > >> Unicode string. Similarly, both of the following lines should > >> also result in a zero length Unicode string: > > >>>> '\xff\xfe\xfe\xff'.decode( "utf16" ) > > u'\ufffe' > >>>> '\xff\xfe\xff\xff'.decode( "utf16" ) > > u'\uffff' > > I strongly disagree; these decisions should be left to a higher layer. > In the case of specified UTFs, the codecs should simply invert the UTF > to Python's internal encoding. > > MAL> Hmm, wouldn't it be better to raise an error ? After all, a > MAL> reversed BOM mark in the stream looks a lot like you're > MAL> trying to decode a UTF-16 stream assuming the wrong byte > MAL> order ?! > > +1 on (optionally) raising an error. The advantage of raising an error is that the application can deal with the situation in whatever way seems fit (by registering a special error handler or by simply using "ignore" or "replace"). I agree that much of this lies outside the scope of codecs and should be handled at an application or protocol level. > -1 on removing it or anything > like that, unless under control of the application (ie, the program > written in Python, not Python itself). It's far too easy for software > to generate broken Unicode streams[1], and the choice of how to deal > with those should be with the application, not with the implementation > language. > > Footnotes: > [1] An egregious example was the Outlook Express distributed with > early Win2k betas, which produced MIME bodies with apparent > Content-Type: text/html; charset=utf-16, but the HTML tags and > newlines were 7-bit ASCII! > -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Apr 05 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com