Re: why isn't Unicode the default encoding?
In article <[EMAIL PROTECTED]>, Martin v. Löwis wrote: > In any case, it doesn't matter what encoding the document is in: > read(2) always returns two bytes. It returns *up to* two bytes. Sorry to be picky but I think it's relevant to the topic because it illustrates how it's difficult to change the definition of file.read() to return characters instead of bytes (if the file is ready to read, there will always be one or more bytes available (or EOF), but there won't always be one or more characters available). -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
John Salerno wrote: > Interesting. So then the read() method, if given a numeric argument for > bytes to read, would act differently depending on if you were using > Unicode or not? The read method currently returns a byte string, not a Unicode string. It's not clear to me how the numeric argument should be interpreted when it returns characters some day; it might be best to take the number as counting characters, then. However, not supporting a numeric argument at all might also be reasonable. > As it is now, it seems to equate the bytes with number > of characters, but if the document was written using Unicode characters, > is it possible that read(2) might only pull out one character? Unicode isn't a character coding (*all* documents in the world are "written in Unicode", including those encoded with ASCII or Latin-1). In any case, it doesn't matter what encoding the document is in: read(2) always returns two bytes. How many characters that constitutes depends on the encoding - but read() doesn't return a character string. It might be that these two bytes are only part of a character, e.g. if you need three bytes to encode a character, or it might be that they are parts of two characters, e.g. when you get the second byte of the first character and the first byte of the second one. In some encodings (e.g. ISO-2022), these bytes may indicate *no* character, e.g. when the bytes just indicate an in-stream change of character set. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
John Salerno wrote: > Martin v. Löwis wrote: > >> The real problem is that the Python string type is used to represent >> two very different concepts: bytes, and characters. You can't just drop >> the current Python string type, and use the Unicode type instead - then >> you would have no good way to represent sequences of bytes anymore. >> Byte sequences occur more often than you might think: a ZIP file, a >> MS Word file, a PDF file, and even an HTTP conversation are represented >> through byte sequences. >> >> So for a byte sequence, internal representation is important; for a >> character string, it is not. Now, for historical reasons, the Python >> string literals create byte strings, not character strings. Since we >> cannot know whether a certain string literal is meant to denote bytes >> or characters, we can't just change the interpretation. > > Interesting. So then the read() method, if given a numeric argument for > bytes to read, would act differently depending on if you were using > Unicode or not? As it is now, it seems to equate the bytes with number > of characters, but if the document was written using Unicode characters, > is it possible that read(2) might only pull out one character? Exactly. read(2) might pull out one character, or only half a character. It all depends on the encoding of the data you're reading. If you're reading or writing text to a file (or anywhere, for that matter) you need to know the unicode encoding of the file's content to read it correctly. Fortunately, the codecs module makes the whole process relatively painless: >>> import codecs >>> f = open("a_utf8_encoded_file.txt") >>> stream = codecs.getreader('utf-8')(f) >>> c = stream.read(1) The 'stream' works on unicode characters so 'c' is a unicode instance, i.e. a whole textual character. - Matt -- __ / \__ Matt Goodall, Pollenation Internet Ltd \__/ \w: http://www.pollenation.net __/ \__/e: [EMAIL PROTECTED] / \__/ \t: +44 (0)113 2252500 \__/ \__/ / \ Any views expressed are my own and do not necessarily \__/ reflect the views of my employer. -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
John Salerno wrote: > So as it turns out, Unicode and UTF-8 are not the same thing? Well yes. UTF-8 is one scheme in which the whole Unicode character repertoire can be represented as bytes. Confusion arises because Windows uses the name 'Unicode' in character encoding lists, to mean UTF-16_LE, which is another encoding that can store the whole Unicode character repertoire as bytes. However UTF-16_LE is not any more definitively 'Unicode' than UTF-8 is. Further confusion arises because the encoding 'UTF-16' can actually mean two things that are deceptively different: - Unicode characters stored natively in 16-bit units (using two UTF-16 characters to represent characters outside of the Basic Multilingual Plane) - Either of the 8-bit encodings UTF-16_LE and UTF-16_BE, detected automatically using a Byte Order Mark when loaded, or chosen arbitrarily when saving Yet more confusion arises because UTF-32 (which can reference any Unicode character directly) has the same problem. And though wide-unicode builds of Python understand the first meaning (unicode() strings are stored natively as UTF-32), they don't support the 8-bit encodings UTF-32_LE and UTF-32_BE. Phew! To summarise: confusion. > Am I right to say that UTF-8 stores the first 128 Unicode code points > in a single byte, and then stores higher code points in however many > bytes they may need? That is correct. To answer the original question, we're always going to need byte strings. They're a fundamental part of computing and the need to process them isn't going to go away. However as Unicode text manipulation becomes a more common event than byte string processing, it makes sense to change the default kind of string you get when you type a literal. Personally I would like to see byte strings available under an easy syntax like b'...' and UTF-32 strings available as w'...', or something like that - currently having u'...' mean either UTF-16 or UTF-32 depending on compile-time options is very very annoying to the few kinds of programs that really do need to know the difference. But whatever is chosen, it's all tasty Python 3000 future-soup and not worth worrying about for the moment. -- And Clover mailto:[EMAIL PROTECTED] http://www.doxdesk.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
Martin v. Löwis wrote: > The real problem is that the Python string type is used to represent > two very different concepts: bytes, and characters. You can't just drop > the current Python string type, and use the Unicode type instead - then > you would have no good way to represent sequences of bytes anymore. > Byte sequences occur more often than you might think: a ZIP file, a > MS Word file, a PDF file, and even an HTTP conversation are represented > through byte sequences. > > So for a byte sequence, internal representation is important; for a > character string, it is not. Now, for historical reasons, the Python > string literals create byte strings, not character strings. Since we > cannot know whether a certain string literal is meant to denote bytes > or characters, we can't just change the interpretation. Interesting. So then the read() method, if given a numeric argument for bytes to read, would act differently depending on if you were using Unicode or not? As it is now, it seems to equate the bytes with number of characters, but if the document was written using Unicode characters, is it possible that read(2) might only pull out one character? -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
Martin v. Löwis wrote: > John Salerno wrote: >> Robert Kern wrote: >> >>> http://www.joelonsoftware.com/articles/Unicode.html >> >> That was fascinating. Thank you. So as it turns out, Unicode and UTF-8 >> are not the same thing? Am I right to say that UTF-8 stores the first >> 128 Unicode code points in a single byte, and then stores higher code >> points in however many bytes they may need? If so, I guess I had been >> mislead by the '8' in the name, thinking that UTF-8 was another way of >> storing characters in one byte (which would make it no different than >> Latin-1, I suppose). > > That's all correct, except for the last parenthetical remark: using > a single-byte character set isn't the same as using Latin-1. There > are various single-byte characters sets; they have names like Latin-2, > Latin-5, Latin-15, KOI8-R, CP437, windows-1252, and so on. > > Regards, > Martin Oh, I just meant that Latin-1 was an example of a one-byte character set, right? So UTF-8 would be identical to it if it worked how I used to think it did. -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
> I figured this might have something to do with it, but then again I > thought that Unicode was created as a subset of ASCII and Latin-1 so > that they would be compatible...but I guess it's never that easy. :) The real problem is that the Python string type is used to represent two very different concepts: bytes, and characters. You can't just drop the current Python string type, and use the Unicode type instead - then you would have no good way to represent sequences of bytes anymore. Byte sequences occur more often than you might think: a ZIP file, a MS Word file, a PDF file, and even an HTTP conversation are represented through byte sequences. So for a byte sequence, internal representation is important; for a character string, it is not. Now, for historical reasons, the Python string literals create byte strings, not character strings. Since we cannot know whether a certain string literal is meant to denote bytes or characters, we can't just change the interpretation. Unicode is a superset of ASCII and Latin-1, but not of byte sequences. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
John Salerno wrote: > Robert Kern wrote: > >> http://www.joelonsoftware.com/articles/Unicode.html > > That was fascinating. Thank you. So as it turns out, Unicode and UTF-8 > are not the same thing? Am I right to say that UTF-8 stores the first > 128 Unicode code points in a single byte, and then stores higher code > points in however many bytes they may need? If so, I guess I had been > mislead by the '8' in the name, thinking that UTF-8 was another way of > storing characters in one byte (which would make it no different than > Latin-1, I suppose). That's all correct, except for the last parenthetical remark: using a single-byte character set isn't the same as using Latin-1. There are various single-byte characters sets; they have names like Latin-2, Latin-5, Latin-15, KOI8-R, CP437, windows-1252, and so on. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
Robert Kern wrote: > http://www.joelonsoftware.com/articles/Unicode.html That was fascinating. Thank you. So as it turns out, Unicode and UTF-8 are not the same thing? Am I right to say that UTF-8 stores the first 128 Unicode code points in a single byte, and then stores higher code points in however many bytes they may need? If so, I guess I had been mislead by the '8' in the name, thinking that UTF-8 was another way of storing characters in one byte (which would make it no different than Latin-1, I suppose). -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
Robert Kern wrote: >> I figured this might have something to do with it, but then again I >> thought that Unicode was created as a subset of ASCII and Latin-1 so >> that they would be compatible...but I guess it's never that easy. :) > > No, it isn't. You seem to be somewhat confused about Unicode. At least you are > misusing terminology quite a bit. You may want to read the following articles: I meant to say 'superset' -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
Robert Kern <[EMAIL PROTECTED]> wrote: > > I see UTF-8 a lot, but this particular book also mentions that UTF-16 is > > the most common. Is that true? > > I think it unlikely, but I have no numbers to give. And I'll bet that that > book > doesn't either. I haven't got any numbers, but my guess would be that many the chinese will add their share to the UTF-16 numbers. I don't know about other asian languages, though. Cheers, --Jan Niklas -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
John Salerno wrote: > Robert Kern wrote: > >>Well, *I* use UTF-8, but that's neither here nor there. > > I see UTF-8 a lot, but this particular book also mentions that UTF-16 is > the most common. Is that true? I think it unlikely, but I have no numbers to give. And I'll bet that that book doesn't either. >>>Why can't Unicode replace them so we no longer need the 'u' >>>prefix or the encoding tricks? >> >>It would break a hell of a lot of code. Try using the -U command line argument >>to the Python interpreter. That makes unicode strings default. > > I figured this might have something to do with it, but then again I > thought that Unicode was created as a subset of ASCII and Latin-1 so > that they would be compatible...but I guess it's never that easy. :) No, it isn't. You seem to be somewhat confused about Unicode. At least you are misusing terminology quite a bit. You may want to read the following articles: http://www.joelonsoftware.com/articles/Unicode.html http://effbot.org/zone/unicode-objects.htm -- Robert Kern [EMAIL PROTECTED] "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
John Salerno <[EMAIL PROTECTED]> wrote: > to convert back and forth. But why isn't Unicode considered a regular > string by now? Is it for historical reasons that we still use ASCII and > Latin-1? The point is, that, with a regular string, you don't know its encoding or whether it has an encoding at all - it might as well be just a byte buffer. The best thing would be to have byte buffer and a unicode string type but, this can't happen as long as you don't want to break existing code. > Why can't Unicode replace them so we no longer need the 'u' > prefix or the encoding tricks? It's proposed for python 3000 (http://www.python.org/doc/peps/pep-3000/) and I think it will make it into the language. Cheers, --Jan Niklas -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
Robert Kern wrote: > Well, *I* use UTF-8, but that's neither here nor there. I see UTF-8 a lot, but this particular book also mentions that UTF-16 is the most common. Is that true? >> Why can't Unicode replace them so we no longer need the 'u' >> prefix or the encoding tricks? > > It would break a hell of a lot of code. Try using the -U command line argument > to the Python interpreter. That makes unicode strings default. I figured this might have something to do with it, but then again I thought that Unicode was created as a subset of ASCII and Latin-1 so that they would be compatible...but I guess it's never that easy. :) -- http://mail.python.org/mailman/listinfo/python-list
Re: why isn't Unicode the default encoding?
John Salerno wrote: > Forgive my newbieness, but I don't quite understand why Unicode is still > something that needs special treatment in Python (and perhaps > elsewhere). I'm reading Dive Into Python right now, and it constantly > refers to a 'regular string' versus a 'Unicode string' and how you need > to convert back and forth. But why isn't Unicode considered a regular > string by now? Is it for historical reasons that we still use ASCII and > Latin-1? Well, *I* use UTF-8, but that's neither here nor there. > Why can't Unicode replace them so we no longer need the 'u' > prefix or the encoding tricks? It would break a hell of a lot of code. Try using the -U command line argument to the Python interpreter. That makes unicode strings default. [~]$ python -U Python 2.4.1 (#2, Mar 31 2005, 00:05:10) [GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> 'foo' u'foo' >>> Python tries very hard to remain backwards compatible. Python 3.0 is the designated "break compatibility so we can remove all of the cruft that's built up" release. It is still several years away although Guido is starting to work on it now. -- Robert Kern [EMAIL PROTECTED] "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list