Re: Is this a bug? BOM decoded with UTF8
What are you talking about? The BOM and UTF-16 go hand-and-hand. Without a Byte Order Mark, you can't unambiguosly determine whether big or little endian UTF-16 was used. In the old days, UCS-2 was *implicitly* big-endian. It was only when Microsoft got that wrong that little-endian version of UCS-2 came along. So while the BOM is now part of all relevant specifications, it is still "Microsoft crap". For more details, see: http://www.unicode.org/faq/utf_bom.html#BOM "some higher level protocols", "can be useful" - not "is inherent part of all byte-level encodings". Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? BOM decoded with UTF8
> They say that it makes no sense as an byte-order indicator but they > indicate that it can be used as a file signature. > > And I'm not sure what you mean about decoding a UTF-8 string given any > 8-bit encoding. Of course the encoder must be know: That every utf-8 string can be decoded in any byte-sized encoding. Does it make sense? No. But does it fail (as decoding utf-8 frequently does)? No. So if you are in a situation where you _don't_ know the encoding, a decoding can only be based on a heuristic. And a utf-8 BOM can be part of that heuristic - but it still is only a hint. Besides that, lots of tools don't produce it. E.g. everything that produces/consumes xml doesn't need it. > >>> u'T\N{LATIN SMALL LETTER U WITH DIAERESIS}r' > ... .encode('utf-8').decode('latin1').encode('latin1') > 'T\xc3\xbcr' If the encoder is to be known, using the BOM becomes obsolete. > I can assume you that most Germans can differentiate between "Tür" and > "Tã¼r". Oh, germans can. Computers oth can't. You could try and use common words like "für" and so on for a heuristic. But that is no guarantee. > Using a BOM with UTF-8 makes it easy to indentify it as such AND it > shouldn't break any probably written Unicode-aware tools. As the faq states, that can very well happen. -- Regards, Diez B. Roggisch -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? BOM decoded with UTF8
Diez B. Roggisch wrote: I'm well aware of the need of a bom for fixed-size multibyte-characters like utf16. But I don't see the need for that on an utf-8 byte sequence, and I first encountered that in MS tool output - can't remember when and what exactly that was. And I have to confess that I attributed that as a stupidity from MS. But according to the FAQ you mentioned, it is apparently legal in utf-8 too. Neverless the FAQ states: [snipped] So they admit that it makes no sense - especially as decoding a utf-8 string given any 8-bit encoding like latin1 will succeed. They say that it makes no sense as an byte-order indicator but they indicate that it can be used as a file signature. And I'm not sure what you mean about decoding a UTF-8 string given any 8-bit encoding. Of course the encoder must be know: >>> u'T\N{LATIN SMALL LETTER U WITH DIAERESIS}r' ... .encode('utf-8').decode('latin1').encode('latin1') 'T\xc3\xbcr' I can assume you that most Germans can differentiate between "Tür" and "Tã¼r". Using a BOM with UTF-8 makes it easy to indentify it as such AND it shouldn't break any probably written Unicode-aware tools. Cheers, Brian -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? BOM decoded with UTF8
Diez B. Roggisch wrote: So they admit that it makes no sense - especially as decoding a utf-8 string given any 8-bit encoding like latin1 will succeed. So in the end, I stand corrected. But I still think its crap - But not MS crap. :) Oh, good. I'm not the only person who went "A BOM in UTF-8 data? WTF do you need a byte order marker for when you have 8-bit data?" It also clarifies Martin's comment about the UTF-8 codec ignoring the existence of this piece of silliness :) Cheers, Nick. -- Nick Coghlan | [EMAIL PROTECTED] | Brisbane, Australia --- http://boredomandlaziness.skystorm.net -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? BOM decoded with UTF8
> What are you talking about? The BOM and UTF-16 go hand-and-hand. > Without a Byte Order Mark, you can't unambiguosly determine whether big > or little endian UTF-16 was used. If, for example, you came across a > UTF-16 text file containing this hexidecimal data: 2200> > what would you assume? That is is quote character in little-endian > format or that it is a for-all symbol in big-endian format? I'm well aware of the need of a bom for fixed-size multibyte-characters like utf16. But I don't see the need for that on an utf-8 byte sequence, and I first encountered that in MS tool output - can't remember when and what exactly that was. And I have to confess that I attributed that as a stupidity from MS. But according to the FAQ you mentioned, it is apparently legal in utf-8 too. Neverless the FAQ states: """ Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order? A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature ? an indication that an otherwise unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently in 8-bit environments, the use of a BOM will interfere with any protocol or file format that expects specific ASCII characters at the beginning, such as the use of "#!" of at the beginning of Unix shell scripts. [AF] & [MD] """ So they admit that it makes no sense - especially as decoding a utf-8 string given any 8-bit encoding like latin1 will succeed. So in the end, I stand corrected. But I still think its crap - But not MS crap. :) -- Regards, Diez B. Roggisch -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? BOM decoded with UTF8
Diez B. Roggisch wrote: I know its easy (string.replace()) but why does UTF-16 do it on its own then? Is that according to Unicode standard or just Python convention? BOM is microsoft-proprietary crap. UTF-16 is defined in the unicode standard. What are you talking about? The BOM and UTF-16 go hand-and-hand. Without a Byte Order Mark, you can't unambiguosly determine whether big or little endian UTF-16 was used. If, for example, you came across a UTF-16 text file containing this hexidecimal data: 2200 what would you assume? That is is quote character in little-endian format or that it is a for-all symbol in big-endian format? For more details, see: http://www.unicode.org/faq/utf_bom.html#BOM Cheers, Brian -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? BOM decoded with UTF8
Diez B. Roggisch wrote: I know its easy (string.replace()) but why does UTF-16 do it on its own then? Is that according to Unicode standard or just Python convention? BOM is microsoft-proprietary crap. Uh, no. BOM is part of the Unicode standard. The intent is to allow consumers of Unicode text files to disambiguate UTF-8, big-endian UTF-16 and little-endian UTF-16. See http://www.unicode.org/faq/utf_bom.html#BOM Kent -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? BOM decoded with UTF8
> I know its easy (string.replace()) but why does UTF-16 do > it on its own then? Is that according to Unicode standard or just > Python convention? BOM is microsoft-proprietary crap. UTF-16 is defined in the unicode standard. -- Regards, Diez B. Roggisch -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? BOM decoded with UTF8
pekka niiranen wrote: I have two files "my.utf8" and "my.utf16" which both contain BOM and two "a" characters. Contents of "my.utf8" in HEX: EFBBBF6161 Contents of "my.utf16" in HEX: FEFF6161 This is not true: this byte string does not denote two "a" characters. Instead, it is a single character U+6161. Correct, I used hexeditor to create those files. Is there a trick to read UTF8 encoded file with BOM not decoded? It's very easy: just drop the first character if it is the BOM. I know its easy (string.replace()) but why does UTF-16 do it on its own then? Is that according to Unicode standard or just Python convention? The UTF-8 codec will never do this on its own. Never? Hmm, so that is not going to change in future versions? Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Is this a bug? BOM decoded with UTF8
pekka niiranen wrote: I have two files "my.utf8" and "my.utf16" which both contain BOM and two "a" characters. Contents of "my.utf8" in HEX: EFBBBF6161 Contents of "my.utf16" in HEX: FEFF6161 This is not true: this byte string does not denote two "a" characters. Instead, it is a single character U+6161. Is there a trick to read UTF8 encoded file with BOM not decoded? It's very easy: just drop the first character if it is the BOM. The UTF-8 codec will never do this on its own. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Is this a bug? BOM decoded with UTF8
Hi there, I have two files "my.utf8" and "my.utf16" which both contain BOM and two "a" characters. Contents of "my.utf8" in HEX: EFBBBF6161 Contents of "my.utf16" in HEX: FEFF6161 For some reason Python2.4 decodes the BOM for UTF8 but not for UTF16. See below: >>> fh = codecs.open("my.uft8", "rb", "utf8") >>> fh.readlines() [u'\ufeffaa'] # BOM is decoded, why >>> fh.close() >>> fh = codecs.open("my.utf16", "rb", "utf16") >>> fh.readlines() [u'\u6161'] # No BOM here >>> fh.close() Is there a trick to read UTF8 encoded file with BOM not decoded? -pekka- -- http://mail.python.org/mailman/listinfo/python-list