I think that we're more or less in broad agreement, but I wanted to 
comment on this:

On Sun, Oct 27, 2019 at 09:41:00PM -0700, Andrew Barnert wrote:

> Yes, that’s the whole point of the message you were responding to: 
> extended grapheme clusters are the Unicode approximation of 
> characters; code units are not.

I don't think that's quite correct. See:

http://www.unicode.org/glossary/#abstract_character

http://www.unicode.org/glossary/#character

http://www.unicode.org/glossary/#extended_grapheme_cluster

http://www.unicode.org/glossary/#code_point

From the glossay definition of code point: "A value, or position, for a 
character, in any coded character set." In other words, the code point 
is a numeric code such as U+041 that represents a character such as "A". 
(Except when it is a numeric code that represents a non-character.)

And from definitions D60 and D61 here:

http://www.unicode.org/versions/Unicode12.1.0/ch03.pdf

"Grapheme clusters and extended grapheme clusters may not have any 
particular linguistic significance"

"The grapheme cluster represents a horizontally segmentable unit of 
text, consisting of some grapheme base (which may consist of a Korean 
SYLLABLE) together with any number of nonspacing marks applied to it."
[Emphasis added.]

"A grapheme cluster is similar, but not identical to a combining 
character sequence."

So it is much more complicated than just "code point != character, 
extended grapheme cluster = character". Lots of code points are 
characters; lots of graphemes aren't characters but syllables or some 
other linguistic entity, or no linguistic entity at all; and lots of 
things that are characters aren't graphemes, such such combining 
character sequences.

And none of this mentions what to do with variation selectors, flags 
etc. The whole thing is very complicated and I don't pretend to 
understand all the details. (Until now, I thought that combining 
character sequences were grapheme clusters. Apparently they aren't.)


-- 
Steven
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/EWQL4T7QGVSSPBYTAM7BSLFVZ2WSB5SO/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to