Re: Dicebot on leaving D: It is anarchy driven development in all its glory.

Chris via Digitalmars-d Thu, 06 Sep 2018 04:20:42 -0700

On Thursday, 6 September 2018 at 10:44:45 UTC, Joakim wrote:
[snip]

You're not being fair here, Chris. I just saw this SO questionthat I think exemplifies how most programmers react to Unicode:
"Trying to understand the subtleties of modern Unicode ismaking my head hurt. In particular, the distinction betweencode points, characters, glyphs and graphemes - concepts whichin the simplest case, when dealing with English text usingASCII characters, all have a one-to-one relationship with eachother - is causing me trouble.
Seeing how these terms get used in documents like MatthiasBynens' JavaScript has a unicode problem or Wikipedia's pieceon Han unification, I've gathered that these concepts are notthe same thing and that it's dangerous to conflate them, butI'm kind of struggling to grasp what each term means.
The Unicode Consortium offers a glossary to explain this stuff,but it's full of "definitions" like this:
Abstract Character. A unit of information used for theorganization, control, or representation of textual data. ...
...
Character. ... (2) Synonym for abstract character. (3) Thebasic unit of encoding for the Unicode character encoding. ...
...
Glyph. (1) An abstract form that represents one or more glyphimages. (2) A synonym for glyph image. In displaying Unicodecharacter data, one or more glyphs may be selected to depict aparticular character.
...
Grapheme. (1) A minimally distinctive unit of writing in thecontext of a particular writing system. ...
Most of these definitions possess the quality of sounding veryacademic and formal, but lack the quality of meaning anything,or else defer the problem of definition to yet another glossaryentry or section of the standard.
So I seek the arcane wisdom of those more learned than I. Howexactly do each of these concepts differ from each other, andin what circumstances would they not have a one-to-onerelationship with each other?"
https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
Honestly, unicode is a mess, and I believe we will all have todump the Unicode standard and start over one day. Until thatfine day, there is no neat solution to how to handle it, nomatter how much you'd like to think so. Also, much of thecomplexity actually comes from the complexity of the variouslanguage alphabets, so that cannot be waved away no matter whatstandard you come up with, though Unicode certainly adds moreunneeded complexity on top, which is why it should be dumped.

One problem imo is that they mixed the terms up: "Grapheme: Aminimally distinctive unit of writing in the context of aparticular writing system." In linguistics a grapheme is not asingle character like "á" or "g". It may also be a combination ofcharacters like in English spelling <sh> ("s" + "h") that maps toa phoneme (e.g. ship, shut, shadow). In German this sound iswritten as <sch> as in "Schiff" (ship) (but not always, cf. "s"in "Stange").

Since Unicode is such a difficult beast to deal with, I'd say D(or any PL for that matter) needs, first and foremost, a clearpolicy about what's the default behavior - not ad hoc patches.Then maybe a strategy as to how the default behavior can beturned on and off, say for performance reasons. One way _could_be a compiler switch to turn the default behavior on/off -unicodeor -uni or -utf8 or whatever, or maybe better a library solutionlike `ustring`.

If you need high performance and checks are no issue for the mostpart (web crawling, data harvesting etc), get rid ofautodecoding. Once you need to check for character/graphemecorrectness (e.g. translation tools) make it available throughsomething like `to!ustring`. Which ever way: be clear about it.But don't let the unsuspecting user use `string` and get bittenby it.

Re: Dicebot on leaving D: It is anarchy driven development in all its glory.

Reply via email to