On Tue, Oct 19, 2010 at 1:31 PM, Tobiah <t...@rcsreg.com> wrote: >> There is no such thing as "plain Unicode representation". The closest >> thing would be an abstract sequence of Unicode codepoints (ala Python's >> `unicode` type), but this is way too abstract to be used for >> sharing/interchange, because storing anything in a file or sending it >> over a network ultimately involves serialization to binary, which is not >> directly defined for such an abstract representation (Indeed, this is >> exactly what encodings are: mappings between abstract codepoints and >> concrete binary; the problem is, there's more than one of them). > > Ok, so the encoding is just the binary representation scheme for > a conceptual list of unicode points. So why so many? I get that > someone might want big-endian, and I see the various virtues of > the UTF strains, but why isn't a handful of these representations > enough? Languages may vary widely but as far as I know, computers > really don't that much. big/little endian is the only problem I > can think of. A byte is a byte. So why so many encoding schemes? > Do some provide advantages to certain human languages?
UTF-8 has the virtue of being backward-compatible with ASCII. UTF-16 has all codepoints in the Basic Multilingual Plane take up exactly 2 bytes; all others take up 4 bytes. The Unicode people originally thought they would only include modern scripts, so 2 bytes would be enough to encode all characters. However, they later broadened their scope, thus the complication of "surrogate pairs" was introduced. UTF-32 has *all* Unicode codepoints take up exactly 4 bytes. This slightly simplifies processing, but wastes a lot of space for e.g. English texts. And then there are a whole bunch of national encodings defined for backward compatibility, but they typically only encode a portion of all the Unicode codepoints. More info: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings Cheers, Chris -- Essentially, blame backward compatibility and finite storage space. http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list