On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:
It's not a hard concept, except that these different letters have
lookalike forms with completely unrelated letters. Again:
- Lowercase Latin m looks visually the same as lowercase Cyrillic Т in
cursive form. In some font renderings the two are IDENTICAL glyphs, in
spite of being completely different, unrelated letters. However, in
non-cursive form, Cyrillic lowercase т is visually distinct.
- Similarly, lowercase Cyrillic П in cursive font looks like lowercase
Latin n, and in some fonts they are identical glyphs. Again,
completely unrelated letters, yet they have the SAME VISUAL
REPRESENTATION. However, in non-cursive font, lowercase Cyrillic П is
п, which is visually distinct from Latin n.
- These aren't the only ones, either. Other Cyrillic false friends
include cursive Д, which in some fonts looks like lowercase Latin g.
But in non-cursive font, it's д.
Just given the above, it should be clear that going by visual
representation is NOT enough to disambiguate between these different
letters.
It works for books. Unicode invented a problem, and came up with a thoroughly
wretched "solution" that we'll be stuck with for generations. One of those bad
solutions is have the reader not know what a glyph actually is without pulling
back the cover to read the codepoint. It's madness.
By your argument, since lowercase Cyrillic Т is, visually,
just m, it should be encoded the same way as lowercase Latin m. But this
is untenable, because the letterform changes with a different font. So
you end up with the unworkable idea of a font-dependent encoding.
Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicode
codepoint decisions.
Or, to use an example closer to home, uppercase Latin O and the digit 0
are visually identical. Should they be encoded as a single code point or
two? Worse, in some fonts, the digit 0 is rendered like Ø (to
differentiate it from uppercase O). Does that mean that it should be
encoded the same way as the Danish letter Ø? Obviously not, but
according to your "visual representation" idea, the answer should be
yes.
Don't confuse fonts with code points. It'd be adequate if Unicode defined a
canonical glyph for each code point, and let the font makers do what they wish.
The notion of 'case' should not be part of Unicode, as that is
semantic information that is beyond the scope of Unicode.
But what should "i".toUpper return?
Not relevant to my point that Unicode shouldn't decide what "upper case" for all
languages means, any more than Unicode should specify a font. Now when you argue
that Unicode should make such decisions, note what a spectacularly hopeless job
of it they've done.