Re: why isn't Unicode the default encoding?

2006-03-21 Thread Jon Ribbens
In article <[EMAIL PROTECTED]>, Martin v. Löwis wrote:
> In any case, it doesn't matter what encoding the document is in:
> read(2) always returns two bytes.

It returns *up to* two bytes. Sorry to be picky but I think it's
relevant to the topic because it illustrates how it's difficult
to change the definition of file.read() to return characters
instead of bytes (if the file is ready to read, there will always
be one or more bytes available (or EOF), but there won't always
be one or more characters available).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: why isn't Unicode the default encoding?

2006-03-20 Thread Martin v. Löwis
John Salerno wrote:
> Interesting. So then the read() method, if given a numeric argument for 
> bytes to read, would act differently depending on if you were using 
> Unicode or not?

The read method currently returns a byte string, not a Unicode string.
It's not clear to me how the numeric argument should be interpreted when
it returns characters some day; it might be best to take the number as
counting characters, then. However, not supporting a numeric argument
at all might also be reasonable.

> As it is now, it seems to equate the bytes with number 
> of characters, but if the document was written using Unicode characters, 
> is it possible that read(2) might only pull out one character?

Unicode isn't a character coding (*all* documents in the world are
"written in Unicode", including those encoded with ASCII or
Latin-1).

In any case, it doesn't matter what encoding the document is in:
read(2) always returns two bytes. How many characters that constitutes
depends on the encoding - but read() doesn't return a character
string.

It might be that these two bytes are only part of a character,
e.g. if you need three bytes to encode a character, or it might
be that they are parts of two characters, e.g. when you get the
second byte of the first character and the first byte of the
second one. In some encodings (e.g. ISO-2022), these bytes
may indicate *no* character, e.g. when the bytes just indicate
an in-stream change of character set.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: why isn't Unicode the default encoding?

2006-03-20 Thread Matt Goodall
John Salerno wrote:
> Martin v. Löwis wrote:
> 
>> The real problem is that the Python string type is used to represent
>> two very different concepts: bytes, and characters. You can't just drop
>> the current Python string type, and use the Unicode type instead - then
>> you would have no good way to represent sequences of bytes anymore.
>> Byte sequences occur more often than you might think: a ZIP file, a
>> MS Word file, a PDF file, and even an HTTP conversation are represented
>> through byte sequences.
>>
>> So for a byte sequence, internal representation is important; for a
>> character string, it is not. Now, for historical reasons, the Python
>> string literals create byte strings, not character strings. Since we
>> cannot know whether a certain string literal is meant to denote bytes
>> or characters, we can't just change the interpretation.
> 
> Interesting. So then the read() method, if given a numeric argument for 
> bytes to read, would act differently depending on if you were using 
> Unicode or not? As it is now, it seems to equate the bytes with number 
> of characters, but if the document was written using Unicode characters, 
> is it possible that read(2) might only pull out one character?

Exactly. read(2) might pull out one character, or only half a character.
It all depends on the encoding of the data you're reading.

If you're reading or writing text to a file (or anywhere, for that
matter) you need to know the unicode encoding of the file's content to
read it correctly.

Fortunately, the codecs module makes the whole process relatively painless:

>>> import codecs
>>> f = open("a_utf8_encoded_file.txt")
>>> stream = codecs.getreader('utf-8')(f)
>>> c = stream.read(1)

The 'stream' works on unicode characters so 'c' is a unicode instance,
i.e. a whole textual character.

- Matt

-- 
 __
/  \__ Matt Goodall, Pollenation Internet Ltd
\__/  \w: http://www.pollenation.net
  __/  \__/e: [EMAIL PROTECTED]
 /  \__/  \t: +44 (0)113 2252500
 \__/  \__/
 /  \  Any views expressed are my own and do not necessarily
 \__/  reflect the views of my employer.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: why isn't Unicode the default encoding?

2006-03-20 Thread and-google
John Salerno wrote:

> So as it turns out, Unicode and UTF-8 are not the same thing?

Well yes. UTF-8 is one scheme in which the whole Unicode character
repertoire can be represented as bytes.

Confusion arises because Windows uses the name 'Unicode' in character
encoding lists, to mean UTF-16_LE, which is another encoding that can
store the whole Unicode character repertoire as bytes. However
UTF-16_LE is not any more definitively 'Unicode' than UTF-8 is.

Further confusion arises because the encoding 'UTF-16' can actually
mean two things that are deceptively different:

  - Unicode characters stored natively in 16-bit units (using two
UTF-16 characters to represent characters outside of the Basic
Multilingual Plane)

  - Either of the 8-bit encodings UTF-16_LE and UTF-16_BE, detected
automatically using a Byte Order Mark when loaded, or chosen
arbitrarily when saving

Yet more confusion arises because UTF-32 (which can reference any
Unicode character directly) has the same problem. And though
wide-unicode builds of Python understand the first meaning (unicode()
strings are stored natively as UTF-32), they don't support the 8-bit
encodings UTF-32_LE and UTF-32_BE. Phew!

To summarise: confusion.

> Am I right to say that UTF-8 stores the first 128 Unicode code points
> in a single byte, and then stores higher code points in however many
> bytes they may need?

That is correct.

To answer the original question, we're always going to need byte
strings. They're a fundamental part of computing and the need to
process them isn't going to go away. However as Unicode text
manipulation becomes a more common event than byte string processing,
it makes sense to change the default kind of string you get when you
type a literal.

Personally I would like to see byte strings available under an easy
syntax like b'...' and UTF-32 strings available as w'...', or something
like that - currently having u'...' mean either UTF-16 or UTF-32
depending on compile-time options is very very annoying to the few
kinds of programs that really do need to know the difference. But
whatever is chosen, it's all tasty Python 3000 future-soup and not
worth worrying about for the moment.

-- 
And Clover
mailto:[EMAIL PROTECTED]
http://www.doxdesk.com/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: why isn't Unicode the default encoding?

2006-03-20 Thread John Salerno
Martin v. Löwis wrote:

> The real problem is that the Python string type is used to represent
> two very different concepts: bytes, and characters. You can't just drop
> the current Python string type, and use the Unicode type instead - then
> you would have no good way to represent sequences of bytes anymore.
> Byte sequences occur more often than you might think: a ZIP file, a
> MS Word file, a PDF file, and even an HTTP conversation are represented
> through byte sequences.
> 
> So for a byte sequence, internal representation is important; for a
> character string, it is not. Now, for historical reasons, the Python
> string literals create byte strings, not character strings. Since we
> cannot know whether a certain string literal is meant to denote bytes
> or characters, we can't just change the interpretation.

Interesting. So then the read() method, if given a numeric argument for 
bytes to read, would act differently depending on if you were using 
Unicode or not? As it is now, it seems to equate the bytes with number 
of characters, but if the document was written using Unicode characters, 
is it possible that read(2) might only pull out one character?
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: why isn't Unicode the default encoding?

2006-03-20 Thread John Salerno
Martin v. Löwis wrote:
> John Salerno wrote:
>> Robert Kern wrote:
>>
>>>   http://www.joelonsoftware.com/articles/Unicode.html
>>
>> That was fascinating. Thank you. So as it turns out, Unicode and UTF-8 
>> are not the same thing? Am I right to say that UTF-8 stores the first 
>> 128 Unicode code points in a single byte, and then stores higher code 
>> points in however many bytes they may need? If so, I guess I had been 
>> mislead by the '8' in the name, thinking that UTF-8 was another way of 
>> storing characters in one byte (which would make it no different than 
>> Latin-1, I suppose).
> 
> That's all correct, except for the last parenthetical remark: using
> a single-byte character set isn't the same as using Latin-1. There
> are various single-byte characters sets; they have names like Latin-2,
> Latin-5, Latin-15, KOI8-R, CP437, windows-1252, and so on.
> 
> Regards,
> Martin

Oh, I just meant that Latin-1 was an example of a one-byte character 
set, right? So UTF-8 would be identical to it if it worked how I used to 
think it did.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: why isn't Unicode the default encoding?

2006-03-20 Thread Martin v. Löwis
> I figured this might have something to do with it, but then again I 
> thought that Unicode was created as a subset of ASCII and Latin-1 so 
> that they would be compatible...but I guess it's never that easy. :)

The real problem is that the Python string type is used to represent
two very different concepts: bytes, and characters. You can't just drop
the current Python string type, and use the Unicode type instead - then
you would have no good way to represent sequences of bytes anymore.
Byte sequences occur more often than you might think: a ZIP file, a
MS Word file, a PDF file, and even an HTTP conversation are represented
through byte sequences.

So for a byte sequence, internal representation is important; for a
character string, it is not. Now, for historical reasons, the Python
string literals create byte strings, not character strings. Since we
cannot know whether a certain string literal is meant to denote bytes
or characters, we can't just change the interpretation.

Unicode is a superset of ASCII and Latin-1, but not of byte sequences.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: why isn't Unicode the default encoding?

2006-03-20 Thread Martin v. Löwis
John Salerno wrote:
> Robert Kern wrote:
> 
>>   http://www.joelonsoftware.com/articles/Unicode.html
> 
> That was fascinating. Thank you. So as it turns out, Unicode and UTF-8 
> are not the same thing? Am I right to say that UTF-8 stores the first 
> 128 Unicode code points in a single byte, and then stores higher code 
> points in however many bytes they may need? If so, I guess I had been 
> mislead by the '8' in the name, thinking that UTF-8 was another way of 
> storing characters in one byte (which would make it no different than 
> Latin-1, I suppose).

That's all correct, except for the last parenthetical remark: using
a single-byte character set isn't the same as using Latin-1. There
are various single-byte characters sets; they have names like Latin-2,
Latin-5, Latin-15, KOI8-R, CP437, windows-1252, and so on.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: why isn't Unicode the default encoding?

2006-03-20 Thread John Salerno
Robert Kern wrote:

>   http://www.joelonsoftware.com/articles/Unicode.html

That was fascinating. Thank you. So as it turns out, Unicode and UTF-8 
are not the same thing? Am I right to say that UTF-8 stores the first 
128 Unicode code points in a single byte, and then stores higher code 
points in however many bytes they may need? If so, I guess I had been 
mislead by the '8' in the name, thinking that UTF-8 was another way of 
storing characters in one byte (which would make it no different than 
Latin-1, I suppose).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: why isn't Unicode the default encoding?

2006-03-20 Thread John Salerno
Robert Kern wrote:

>> I figured this might have something to do with it, but then again I 
>> thought that Unicode was created as a subset of ASCII and Latin-1 so 
>> that they would be compatible...but I guess it's never that easy. :)
> 
> No, it isn't. You seem to be somewhat confused about Unicode. At least you are
> misusing terminology quite a bit. You may want to read the following articles:

I meant to say 'superset'
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: why isn't Unicode the default encoding?

2006-03-20 Thread Jan Niklas Fingerle
Robert Kern <[EMAIL PROTECTED]> wrote:
> > I see UTF-8 a lot, but this particular book also mentions that UTF-16 is 
> > the most common. Is that true?
> 
> I think it unlikely, but I have no numbers to give. And I'll bet that that 
> book
> doesn't either.

I haven't got any numbers, but my guess would be that many the chinese
will add their share to the UTF-16 numbers. I don't know about other
asian languages, though.

Cheers,
  --Jan Niklas
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: why isn't Unicode the default encoding?

2006-03-20 Thread Robert Kern
John Salerno wrote:
> Robert Kern wrote:
> 
>>Well, *I* use UTF-8, but that's neither here nor there.
> 
> I see UTF-8 a lot, but this particular book also mentions that UTF-16 is 
> the most common. Is that true?

I think it unlikely, but I have no numbers to give. And I'll bet that that book
doesn't either.

>>>Why can't Unicode replace them so we no longer need the 'u' 
>>>prefix or the encoding tricks?
>>
>>It would break a hell of a lot of code. Try using the -U command line argument
>>to the Python interpreter. That makes unicode strings default.
> 
> I figured this might have something to do with it, but then again I 
> thought that Unicode was created as a subset of ASCII and Latin-1 so 
> that they would be compatible...but I guess it's never that easy. :)

No, it isn't. You seem to be somewhat confused about Unicode. At least you are
misusing terminology quite a bit. You may want to read the following articles:

  http://www.joelonsoftware.com/articles/Unicode.html
  http://effbot.org/zone/unicode-objects.htm

-- 
Robert Kern
[EMAIL PROTECTED]

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: why isn't Unicode the default encoding?

2006-03-20 Thread Jan Niklas Fingerle
John Salerno <[EMAIL PROTECTED]> wrote:
> to convert back and forth. But why isn't Unicode considered a regular 
> string by now? Is it for historical reasons that we still use ASCII and 
> Latin-1? 

The point is, that, with a regular string, you don't know its encoding
or whether it has an encoding at all - it might as well be just a byte
buffer. The best thing would be to have byte buffer and a unicode string
type but, this can't happen as long as you don't want to break existing
code.

> Why can't Unicode replace them so we no longer need the 'u' 
> prefix or the encoding tricks?

It's proposed for python 3000 (http://www.python.org/doc/peps/pep-3000/)
and I think it will make it into the language. 

Cheers,
  --Jan Niklas
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: why isn't Unicode the default encoding?

2006-03-20 Thread John Salerno
Robert Kern wrote:

> Well, *I* use UTF-8, but that's neither here nor there.

I see UTF-8 a lot, but this particular book also mentions that UTF-16 is 
the most common. Is that true?

>> Why can't Unicode replace them so we no longer need the 'u' 
>> prefix or the encoding tricks?
> 
> It would break a hell of a lot of code. Try using the -U command line argument
> to the Python interpreter. That makes unicode strings default.

I figured this might have something to do with it, but then again I 
thought that Unicode was created as a subset of ASCII and Latin-1 so 
that they would be compatible...but I guess it's never that easy. :)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: why isn't Unicode the default encoding?

2006-03-20 Thread Robert Kern
John Salerno wrote:
> Forgive my newbieness, but I don't quite understand why Unicode is still 
> something that needs special treatment in Python (and perhaps 
> elsewhere). I'm reading Dive Into Python right now, and it constantly 
> refers to a 'regular string' versus a 'Unicode string' and how you need 
> to convert back and forth. But why isn't Unicode considered a regular 
> string by now? Is it for historical reasons that we still use ASCII and 
> Latin-1?

Well, *I* use UTF-8, but that's neither here nor there.

> Why can't Unicode replace them so we no longer need the 'u' 
> prefix or the encoding tricks?

It would break a hell of a lot of code. Try using the -U command line argument
to the Python interpreter. That makes unicode strings default.

[~]$ python -U
Python 2.4.1 (#2, Mar 31 2005, 00:05:10)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 'foo'
u'foo'
>>>

Python tries very hard to remain backwards compatible. Python 3.0 is the
designated "break compatibility so we can remove all of the cruft that's built
up" release. It is still several years away although Guido is starting to work
on it now.

-- 
Robert Kern
[EMAIL PROTECTED]

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco

-- 
http://mail.python.org/mailman/listinfo/python-list