Dave Angel wrote:
¯º¿Â wrote:
On 3 Αύγ, 18:41, Dave Angel <da...@ieee.org> wrote:
Different encodings equal different ways of storing the data to the
media, correct?
Exactly. The file is a stream of bytes, and Unicode has more than 256
possible characters. Further, even the subset of characters that *do*
take one byte are different for different encodings. So you need to tell
the editor what encoding you want to use.

For example an 'a' char in iso-8859-1 is stored different than an 'a'
char in iso-8859-7 and an 'a' char of utf-8 ?


Nope, the ASCII subset is identical. It's the ones between 80 and ff that differ, and of course not all of those. Further, some of the codes that are one byte in 8859 are two bytes in utf-8.

You *could* just decide that you're going to hardwire the assumption that you'll be dealing with a single character set that does fit in 8 bits, and most of this complexity goes away. But if you do that, do *NOT* use utf-8.

But if you do want to be able to handle more than 256 characters, or more than one encoding, read on.

Many people confuse encoding and decoding. A unicode character is an abstraction which represents a raw character. For convenience, the first 128 code points map directly onto the 7 bit encoding called ASCII. But before Unicode there were several other extensions to 256, which were incompatible with each other. For example, a byte which might be a European character in one such encoding might be a kata-kana character in another one. Each encoding was 8 bits, but it was difficult for a single program to handle more than one such encoding.

One encoding might be ASCII + accented Latin, another ASCII + Greek,
another ASCII + Cyrillic, etc. If you wanted ASCII + accented Latin +
Greek then you'd need more than 1 byte per character.

If you're working with multiple alphabets it gets very messy, which is
where Unicode comes in. It contains all those characters, and UTF-8 can
encode all of them in a straightforward manner.

So along comes unicode, which is typically implemented in 16 or 32 bit cells. And it has an 8 bit encoding called utf-8 which uses one byte for the first 192 characters (I think), and two bytes for some more, and three bytes beyond that.

[snip]
In UTF-8 the first 128 codepoints are encoded to 1 byte.
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to