Re: Glyphs and graphemes [was Re: Cult-like behaviour]

Richard Damon Tue, 17 Jul 2018 04:33:02 -0700

> On Jul 17, 2018, at 3:44 AM, Steven D'Aprano 
> <steve+comp.lang.pyt...@pearwood.info> wrote:
> 
> On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote:
> 
>>> On Jul 16, 2018, at 9:21 PM, Steven D'Aprano
>>> <steve+comp.lang.pyt...@pearwood.info> wrote:
>>> 
>>>> On Mon, 16 Jul 2018 19:02:36 -0400, Richard Damon wrote:
>>>> 
>>>> You are defining a variable/fixed width codepoint set. Many others
>>>> want to deal with CHARACTER sets.
>>> 
>>> Good luck coming up with a universal, objective, language-neutral,
>>> consistent definition for a character.
>>> 
>> Who says there needs to be one. A good engineer will use the definition
>> that is most appropriate to the task at hand. Some things need very
>> solid definitions, and some things don’t.
> 
> The the problem is solved: we have a perfectly good de facto definition 
> of character: it is a synonym for "code point", and every single one of 
> Marko's objections disappears.
> 
Which is a ‘changed’ definition! Do you agree that the concept of variable 
width encoding vastly predates the creation of Unicode? Can you also find any 
use of the word codepoint that predates the development of Unicode? 
Code points and code words are an invention of the Unicode consortium, and as 
such should really only be used in talking about IT and not some other 
encodings. I believe that Unicode also created the idea of storing composed 
characters as a series of codepoints instead of it being done in the input 
routine and the character set needing to define a character code for every 
needed composed character.
> 
>> This goes back to my original point, where I said some people consider
>> UTF-32 as a variable width encoding. For very many things, practically,
>> the ‘codepoint’ isn’t the important thing, 
> 
> Ah, is this another one of those "let's pick a definition that nobody 
> else uses, and state it as a fact" like UTF-32 being variable width?
> 
> If by "very many things", you mean "not very many things", I agree with 
> you. In my experience, dealing with code points is "good enough", 
> especially if you use Western European alphabets, and even more so if 
> you're willing to do a normalization step before processing text.
> 
AH, that is the rub, you only deal with the parts of Unicode that are simple 
and regular. This is EXACTLY the issue that you blame people who want to use 
ASCII or Codepages to solve, just the next step in the evolution.


One problem with normalization is that for Western European characters it tends 
to be able to convert every ‘Character’ to a code point, but in some corner 
cases, especially for other languages it can’t. I am not just talking about 
digraphs like ch that have been mentioned, but the real composed characters 
with a base glyph with marks above/below/embedded on it. Unicode represents 
many of them with a code point, but no where near all of them.

If you actually read the Unicode documents, they do talk about Characters, and 
admit that they aren’t necessarily codepoints, so if you actually want to talk 
about a CHARACTER set, Unicode, even UTF-32 needs to sometimes be treated as 
variable width. 

> But of course other people's experience may vary. I'm interested in 
> learning about the library you use to process graphemes in your software.
> 
> 
>> so the fact that every UTF-32
>> code point takes the same number of bytes or code words isn’t that
>> important. They are dealing with something that needs to be rendered and
>> preserving larger units, like the grapheme is important.
> 
> If you're writing a text widget or a shell, you need to worry about 
> rendering glyphs. Everyone else just delegates to their text widget, GUI 
> framework, or shell.
> 
But someone needs to write that text widget, or it might not do exactly what 
you want, say wrapping the text around obstacles already placed on the 
screen/page.

And try using that text widget to find the ‘middle’ (as shown) of a text 
string, (other than iterating with multiple calls to it to try and find it).

Unicode made the processing of Codepoints simpler, but made the processing of 
actual rendered text much more complicated if you want to handle everything 
right. 
> 
>>>> This doesn’t mean that UTF-32 is an awful system, just that it isn’t
>>>> the magical cure that some were hoping for.
>>> 
>>> Nobody ever claimed it was, except for the people railing that since it
>>> isn't a magically system we ought to go back to the Good Old Days of
>>> code page hell, or even further back when everyone just used ASCII.
>>> 
>> Sometimes ASCII is good enough, especially on a small machine with
>> limited resources.
> 
> I doubt that there are many general purpose computers with resources 
> *that* limited. Even MicroPython supports Unicode, and that runs on 
> embedded devices with memory measured in kilobytes. 8K is considered the 
> smallest amount of memory usable with MicroPython, although 128K is more 
> realistic as the *practical* lower limit.
> 
> In the mid 1980s, I was using computers with 128K of RAM, and they were 
> still able to deal with more than just ASCII. I think the "limited 
> resources" argument is bogus.
> 
I regularly use processors with 8k of Ram and 32k of flash. Yes, I will admit 
that I wouldn’t think of using Python there, as the overhead would be 
excessive. Yes if I needed to I could put a bigger processor in there, but it 
would cost space, dollars, and power, so I don’t. The applications there can 
deal with just ASCII so I do. I would say that on such a processor that 
actually trying to really process Unicode would be out of reach, as even as 
simple of a function as isdigit wouldn’t fit if you wanted a proper Unicode 
definition, and tolower would be out of the question.

> 
> -- 
> Steven D'Aprano
> "Ever since I learned about confirmation bias, I've been seeing
> it everywhere." -- Jon Ronson
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Glyphs and graphemes [was Re: Cult-like behaviour]

Reply via email to