Re: A few questiosn about encoding

Mark Lawrence Thu, 20 Jun 2013 12:48:20 -0700

On 20/06/2013 17:27, wxjmfa...@gmail.com wrote:

Le jeudi 20 juin 2013 13:43:28 UTC+2, MRAB a écrit :

On 20/06/2013 07:26, Steven D'Aprano wrote:

On Wed, 19 Jun 2013 18:46:59 -0700, Rick Johnson wrote:

On Thursday, June 13, 2013 2:11:08 AM UTC-5, Steven D'Aprano wrote:

Gah! That's twice I've screwed that up. Sorry about that!

Yeah, and your difficulty explaining the Unicode implementation reminds

me of a passage from the Python zen:

  "If the implementation is hard to explain, it's a bad idea."

The *implementation* is easy to explain. It's the names of the encodings

which I get tangled up in.


You're off by one below!

ASCII: Supports exactly 127 code points, each of which takes up exactly 7

bits. Each code point represents a character.


128 codepoints.

Latin-1, Latin-2, MacRoman, MacGreek, ISO-8859-7, Big5, Windows-1251, and

about a gazillion other legacy charsets, all of which are mutually

incompatible: supports anything from 127 to 65535 different code points,

usually under 256.


128 to 65536 codepoints.

UCS-2: Supports exactly 65535 code points, each of which takes up exactly

two bytes. That's fewer than required, so it is obsoleted by:


65536 codepoints.



etc.

UTF-16: Supports all 1114111 code points in the Unicode charset, using a

variable-width system where the most popular characters use exactly two-

bytes and the remaining ones use a pair of characters.

UCS-4: Supports exactly 4294967295 code points, each of which takes up

exactly four bytes. That is more than needed for the Unicode charset, so

this is obsoleted by:

UTF-32: Supports all 1114111 code points, using exactly four bytes each.

Code points outside of the range 0 through 1114111 inclusive are an error.

UTF-8: Supports all 1114111 code points, using a variable-width system

where popular ASCII characters require 1 byte, and others use 2, 3 or 4

bytes as needed.

Ignoring the legacy charsets, only UTF-16 is a terribly complicated

implementation, due to the surrogate pairs. But even that is not too bad.

The real complication comes from the interactions between systems which

use different encodings, and that's nothing to do with Unicode.


And all these coding schemes have something in common,
they work all with a unique set of code points, more
precisely a unique set of encoded code points (not
the set of implemented code points (byte)).

Just what the flexible string representation is not
doing, it artificially devides unicode in subsets and try
to handle eache subset differently.

On this other side, that is because it is impossible to
work properly with multiple sets of encoded code points
that all these coding schemes exist today. There are simply
no other way.

Even "exotic" schemes like "CID-fonts" used in pdf
are based on that scheme.

jmf

I entirely agree with the viewpoints of jmfauth, Nick the Greek, rr,Xah Lee and Ilias Lazaridis on the grounds that disagreeing and statingmy beliefs ends up with the Python Mailing List police standing on myback doorsetep. Give me the NSA or GCHQ any day of the week :(

--

"Steve is going for the pink ball - and for those of you who arewatching in black and white, the pink is next to the green." Snookercommentator 'Whispering' Ted Lowe.


Mark Lawrence

--
http://mail.python.org/mailman/listinfo/python-list

Re: A few questiosn about encoding

Reply via email to