Re: Is this a bug? BOM decoded with UTF8

2005-02-12 Thread "Martin v. Löwis"
What are you talking about? The BOM and UTF-16 go hand-and-hand. Without 
a Byte Order Mark, you can't unambiguosly determine whether big or 
little endian UTF-16 was used.
In the old days, UCS-2 was *implicitly* big-endian. It was only
when Microsoft got that wrong that little-endian version of UCS-2
came along. So while the BOM is now part of all relevant specifications,
it is still "Microsoft crap".
For more details, see:
http://www.unicode.org/faq/utf_bom.html#BOM
"some higher level protocols", "can be useful" - not
"is inherent part of all byte-level encodings".
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? BOM decoded with UTF8

2005-02-11 Thread Diez B. Roggisch
> They say that it makes no sense as an byte-order indicator but they
> indicate that it can be used as a file signature.
> 
> And I'm not sure what you mean about decoding a UTF-8 string given any
> 8-bit encoding. Of course the encoder must be know:

That every utf-8 string can be decoded in any byte-sized encoding. Does it
make sense? No. But does it fail (as decoding utf-8 frequently does)? No. 

So if you are in a situation where you _don't_ know the encoding, a decoding
can only be based on a heuristic. And a utf-8 BOM can be part of that
heuristic - but it still is only a hint. Besides that, lots of tools don't
produce it. E.g. everything that produces/consumes xml doesn't need it.

>  >>> u'T\N{LATIN SMALL LETTER U WITH DIAERESIS}r'
> ...   .encode('utf-8').decode('latin1').encode('latin1')
> 'T\xc3\xbcr'

If the encoder is to be known, using the BOM becomes obsolete.

> I can assume you that most Germans can differentiate between "Tür" and
> "Tã¼r".

Oh, germans can. Computers oth can't. You could try and use common words
like "für" and so on for a heuristic. But that is no guarantee.

> Using a BOM with UTF-8 makes it easy to indentify it as such AND it
> shouldn't break any probably written Unicode-aware tools.

As the faq states, that can very well happen.

-- 
Regards,

Diez B. Roggisch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? BOM decoded with UTF8

2005-02-11 Thread Brian Quinlan
Diez B. Roggisch wrote:
I'm well aware of the need of a bom for fixed-size multibyte-characters like
utf16.
But I don't see the need for that on an utf-8 byte sequence, and I first
encountered that in MS tool output - can't remember when and what exactly
that was. And I have to confess that I attributed that as a stupidity from
MS. But according to the FAQ you mentioned, it is apparently legal in utf-8
too. Neverless the FAQ states:
[snipped]
So they admit that it makes no sense - especially as decoding a utf-8 string
given any 8-bit encoding like latin1 will succeed.
They say that it makes no sense as an byte-order indicator but they 
indicate that it can be used as a file signature.

And I'm not sure what you mean about decoding a UTF-8 string given any 
8-bit encoding. Of course the encoder must be know:

>>> u'T\N{LATIN SMALL LETTER U WITH DIAERESIS}r'
...   .encode('utf-8').decode('latin1').encode('latin1')
'T\xc3\xbcr'
I can assume you that most Germans can differentiate between "Tür" and 
"Tã¼r".

Using a BOM with UTF-8 makes it easy to indentify it as such AND it 
shouldn't break any probably written Unicode-aware tools.

Cheers,
Brian
--
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? BOM decoded with UTF8

2005-02-11 Thread Nick Coghlan
Diez B. Roggisch wrote:
So they admit that it makes no sense - especially as decoding a utf-8 string
given any 8-bit encoding like latin1 will succeed.
So in the end, I stand corrected. But I still think its crap - But not MS
crap. :)
Oh, good. I'm not the only person who went "A BOM in UTF-8 data? WTF do you need 
a byte order marker for when you have 8-bit data?"

It also clarifies Martin's comment about the UTF-8 codec ignoring the existence 
of this piece of silliness :)

Cheers,
Nick.
--
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
http://boredomandlaziness.skystorm.net
--
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? BOM decoded with UTF8

2005-02-11 Thread Diez B. Roggisch
> What are you talking about? The BOM and UTF-16 go hand-and-hand.
> Without a Byte Order Mark, you can't unambiguosly determine whether big
> or little endian UTF-16 was used. If, for example, you came across a
> UTF-16 text file containing this hexidecimal data: 2200> 
> what would you  assume? That is is quote character in little-endian
> format or that it is a for-all symbol in big-endian format?

I'm well aware of the need of a bom for fixed-size multibyte-characters like
utf16.

But I don't see the need for that on an utf-8 byte sequence, and I first
encountered that in MS tool output - can't remember when and what exactly
that was. And I have to confess that I attributed that as a stupidity from
MS. But according to the FAQ you mentioned, it is apparently legal in utf-8
too. Neverless the FAQ states:

"""
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
yes, then can I still assume the remaining UTF-8 bytes are in big-endian
order?


A: Yes, UTF-8 can contain a BOM. However, it makes no difference as to the
endianness of the byte stream. UTF-8 always has the same byte order. An
initial BOM is only used as a signature ? an indication that an otherwise
unmarked text file is in UTF-8. Note that some recipients of UTF-8 encoded
data do not expect a BOM. Where UTF-8 is used transparently in 8-bit
environments, the use of a BOM will interfere with any protocol or file
format that expects specific ASCII characters at the beginning, such as the
use of "#!" of at the beginning of Unix shell scripts. [AF] & [MD]
"""

So they admit that it makes no sense - especially as decoding a utf-8 string
given any 8-bit encoding like latin1 will succeed.

So in the end, I stand corrected. But I still think its crap - But not MS
crap. :)

-- 
Regards,

Diez B. Roggisch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? BOM decoded with UTF8

2005-02-11 Thread Brian Quinlan
Diez B. Roggisch wrote:
I know its easy (string.replace()) but why does UTF-16 do
it on its own then? Is that according to Unicode standard or just
Python convention?

BOM is microsoft-proprietary crap. UTF-16 is defined in the unicode
standard.
What are you talking about? The BOM and UTF-16 go hand-and-hand. 
Without a Byte Order Mark, you can't unambiguosly determine whether big 
or little endian UTF-16 was used. If, for example, you came across a 
UTF-16 text file containing this hexidecimal data: 2200

what would you  assume? That is is quote character in little-endian 
format or that it is a for-all symbol in big-endian format?

For more details, see:
http://www.unicode.org/faq/utf_bom.html#BOM
Cheers,
Brian
--
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? BOM decoded with UTF8

2005-02-11 Thread Kent Johnson
Diez B. Roggisch wrote:
I know its easy (string.replace()) but why does UTF-16 do
it on its own then? Is that according to Unicode standard or just
Python convention?

BOM is microsoft-proprietary crap. 
Uh, no. BOM is part of the Unicode standard. The intent is to allow consumers of Unicode text files 
to disambiguate UTF-8, big-endian UTF-16 and little-endian UTF-16.
See http://www.unicode.org/faq/utf_bom.html#BOM

Kent
--
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? BOM decoded with UTF8

2005-02-11 Thread Diez B. Roggisch
> I know its easy (string.replace()) but why does UTF-16 do
> it on its own then? Is that according to Unicode standard or just
> Python convention?

BOM is microsoft-proprietary crap. UTF-16 is defined in the unicode
standard.
-- 
Regards,

Diez B. Roggisch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? BOM decoded with UTF8

2005-02-11 Thread pekka niiranen
pekka niiranen wrote:
I have two files "my.utf8" and "my.utf16" which
both contain BOM and two "a" characters.
Contents of "my.utf8" in HEX:
EFBBBF6161
Contents of "my.utf16" in HEX:
FEFF6161

This is not true: this byte string does not denote
two "a" characters. Instead, it is a single character
U+6161.
Correct, I used hexeditor to create those files.
Is there a trick to read UTF8 encoded file with BOM not decoded?

It's very easy: just drop the first character if it is the BOM.
I know its easy (string.replace()) but why does UTF-16 do
it on its own then? Is that according to Unicode standard or just
Python convention?
The UTF-8 codec will never do this on its own.

Never? Hmm, so that is not going to change in future versions?
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: Is this a bug? BOM decoded with UTF8

2005-02-10 Thread "Martin v. Löwis"
pekka niiranen wrote:
I have two files "my.utf8" and "my.utf16" which
both contain BOM and two "a" characters.
Contents of "my.utf8" in HEX:
EFBBBF6161
Contents of "my.utf16" in HEX:
FEFF6161
This is not true: this byte string does not denote
two "a" characters. Instead, it is a single character
U+6161.
Is there a trick to read UTF8 encoded file with BOM not decoded?
It's very easy: just drop the first character if it is the BOM.
The UTF-8 codec will never do this on its own.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Is this a bug? BOM decoded with UTF8

2005-02-10 Thread pekka niiranen
Hi there,
I have two files "my.utf8" and "my.utf16" which
both contain BOM and two "a" characters.
Contents of "my.utf8" in HEX:
EFBBBF6161
Contents of "my.utf16" in HEX:
FEFF6161
For some reason Python2.4 decodes the BOM for UTF8
but not for UTF16. See below:
>>> fh = codecs.open("my.uft8", "rb", "utf8")
>>> fh.readlines()
[u'\ufeffaa']   # BOM is decoded, why
>>> fh.close()
>>> fh = codecs.open("my.utf16", "rb", "utf16")
>>> fh.readlines()
[u'\u6161'] # No BOM here
>>> fh.close()
Is there a trick to read UTF8 encoded file with BOM not decoded?
-pekka-
--
http://mail.python.org/mailman/listinfo/python-list