On Mon, 16 Jul 2018 22:40:13 +0300, Marko Rauhamaa wrote: > Terry Reedy <tjre...@udel.edu>: > >> On 7/15/2018 5:28 PM, Marko Rauhamaa wrote: >>> if your new system used Python3's UTF-32 strings as a foundation, >> >> Since 3.3, Python's strings are not (always) UFT-32 strings. > > You are right. Python's strings are a superset of UTF-32. More > accurately, Python's strings are UTF-32 plus surrogate characters.
The first thing you are doing wrong is conflating the semantics of the data type with one possible implementation of that data type. UTF-32 is implementation, not semantics: it specifies how to represent Unicode code points as bytes in memory, not what Unicode code points are. Python 3 strings are sequences of abstract characters ("code points") with no mandatory implementation. In CPython, some string objects are encoded in Latin-1. Some are encoded in UTF-16. Some are encoded in UTF-32. Some implementations (MicroPython) use UTF-8. Your second error is a more minor point: it isn't clear (at least not to me) that "Unicode plus surrogates" is a superset of Unicode. Surrogates are part of Unicode. The only extension here is that Python strings are not necessarily well-formed surrogate-free Unicode strings, but they're still Unicode strings. >> Nor are they always UCS-2 (or partly UTF-16) strings. Nor are the >> always Latin-1 or Ascii strings. Python's Flexible String >> Representation uses the narrowest possible internal code for any >> particular string. This is all transparent to the user except for >> memory size. > > How CPython chooses to represent its strings internally is not what I'm > talking about. Then why do you repeatedly talk about the internal storage representation? UTF-32 is not a character set, it is an encoding. It specifies how to implement a sequence of Unicode abstract characters. >>> UTF-32, after all, is a variable-width encoding. >> >> Nope. It a fixed-width (32 bits, 4 bytes) encoding. >> >> Perhaps you should ask more questions before pontificating. > > You mean each code point is one code point wide. But that's rather an > irrelevant thing to state. No, he means that each code point is one code unit wide. > The main point is that UTF-32 (aka Unicode) UTF-32 is not a synonym for Unicode. Many legacy encodings don't distinguish between the character set and the mapping between bytes and characters, but Unicode is not one of those. > uses one or more code points to represent what people would consider an > individual character. That's a reasonable observation to make. But that's not what fixed- and variable-width refers to. So does ASCII, and in both cases, it is irrelevant since the term of art is to define fixed- and variable-width in terms of *code points* not human meaningful characters. "Character" is context- and language- dependent and frequently ambiguous. "LL" or "CH" (for example) could be a single character or a double character, depending on context and language. Even in ASCII English, something as large as "ough" might be considered to be a single unit of language, which some people might choose to call a character. (But not a single letter, naturally.) If you don't like that example, "qu" is probably a better one: aside from acronyms and loan words, no modern English word can fail to follow a Q with a U. > Code points are about as interesting as individual bytes in UTF-8. That's your opinion. I see no justification for it. -- Steven D'Aprano "Ever since I learned about confirmation bias, I've been seeing it everywhere." -- Jon Ronson -- https://mail.python.org/mailman/listinfo/python-list