On Thursday, 6 September 2018 at 10:44:45 UTC, Joakim wrote:
[snip]
You're not being fair here, Chris. I just saw this SO question
that I think exemplifies how most programmers react to Unicode:
"Trying to understand the subtleties of modern Unicode is
making my head hurt. In particular, the distinction between
code points, characters, glyphs and graphemes - concepts which
in the simplest case, when dealing with English text using
ASCII characters, all have a one-to-one relationship with each
other - is causing me trouble.
Seeing how these terms get used in documents like Matthias
Bynens' JavaScript has a unicode problem or Wikipedia's piece
on Han unification, I've gathered that these concepts are not
the same thing and that it's dangerous to conflate them, but
I'm kind of struggling to grasp what each term means.
The Unicode Consortium offers a glossary to explain this stuff,
but it's full of "definitions" like this:
Abstract Character. A unit of information used for the
organization, control, or representation of textual data. ...
...
Character. ... (2) Synonym for abstract character. (3) The
basic unit of encoding for the Unicode character encoding. ...
...
Glyph. (1) An abstract form that represents one or more glyph
images. (2) A synonym for glyph image. In displaying Unicode
character data, one or more glyphs may be selected to depict a
particular character.
...
Grapheme. (1) A minimally distinctive unit of writing in the
context of a particular writing system. ...
Most of these definitions possess the quality of sounding very
academic and formal, but lack the quality of meaning anything,
or else defer the problem of definition to yet another glossary
entry or section of the standard.
So I seek the arcane wisdom of those more learned than I. How
exactly do each of these concepts differ from each other, and
in what circumstances would they not have a one-to-one
relationship with each other?"
https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme
Honestly, unicode is a mess, and I believe we will all have to
dump the Unicode standard and start over one day. Until that
fine day, there is no neat solution to how to handle it, no
matter how much you'd like to think so. Also, much of the
complexity actually comes from the complexity of the various
language alphabets, so that cannot be waved away no matter what
standard you come up with, though Unicode certainly adds more
unneeded complexity on top, which is why it should be dumped.
One problem imo is that they mixed the terms up: "Grapheme: A
minimally distinctive unit of writing in the context of a
particular writing system." In linguistics a grapheme is not a
single character like "á" or "g". It may also be a combination of
characters like in English spelling <sh> ("s" + "h") that maps to
a phoneme (e.g. ship, shut, shadow). In German this sound is
written as <sch> as in "Schiff" (ship) (but not always, cf. "s"
in "Stange").
Since Unicode is such a difficult beast to deal with, I'd say D
(or any PL for that matter) needs, first and foremost, a clear
policy about what's the default behavior - not ad hoc patches.
Then maybe a strategy as to how the default behavior can be
turned on and off, say for performance reasons. One way _could_
be a compiler switch to turn the default behavior on/off -unicode
or -uni or -utf8 or whatever, or maybe better a library solution
like `ustring`.
If you need high performance and checks are no issue for the most
part (web crawling, data harvesting etc), get rid of
autodecoding. Once you need to check for character/grapheme
correctness (e.g. translation tools) make it available through
something like `to!ustring`. Which ever way: be clear about it.
But don't let the unsuspecting user use `string` and get bitten
by it.