On 16/07/18 20:40, Marko Rauhamaa wrote:
Terry Reedy<tjre...@udel.edu>:

On 7/15/2018 5:28 PM, Marko Rauhamaa wrote:
if your new system used Python3's UTF-32 strings as a foundation,
Since 3.3, Python's strings are not (always) UFT-32 strings.
You are right. Python's strings are a superset of UTF-32. More
accurately, Python's strings are UTF-32 plus surrogate characters.

Nor are they always UCS-2 (or partly UTF-16) strings. Nor are the
always Latin-1 or Ascii strings. Python's Flexible String
Representation uses the narrowest possible internal code for any
particular string. This is all transparent to the user except for
memory size.
How CPython chooses to represent its strings internally is not what I'm
talking about.

UTF-32, after all, is a variable-width encoding.
Nope.  It a fixed-width (32 bits, 4 bytes) encoding.

Perhaps you should ask more questions before pontificating.
You mean each code point is one code point wide. But that's rather an
irrelevant thing to state. The main point is that UTF-32 (aka Unicode)
uses one or more code points to represent what people would consider an
individual character.

UTF-32 != Unicode, but that's a separate esoteric argument.

The problem everyone is having with you, Marko, is that you are using the terminology incorrectly. When you say that more than one codepoint can be used to represent what people would consider an individual character, you are correct (and would be more correct if you called "what people would consider an individual character" a "glyph"). When you call UTF-32 a variable-width encoding, you are incorrect.

You are of course welcome to use whatever terminology you personally like, like Humpty Dumpty. However when you point to a duck and say "That's a gnu," people are likely to stop taking you seriously.

--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to