Re: The Case Against Autodecode

Walter Bright via Digitalmars-d Fri, 03 Jun 2016 20:06:47 -0700

On 6/3/2016 6:08 PM, H. S. Teoh via Digitalmars-d wrote:

It's not a hard concept, except that these different letters have
lookalike forms with completely unrelated letters. Again:


- Lowercase Latin m looks visually the same as lowercase Cyrillic Т in
  cursive form. In some font renderings the two are IDENTICAL glyphs, in
  spite of being completely different, unrelated letters.  However, in
  non-cursive form, Cyrillic lowercase т is visually distinct.

- Similarly, lowercase Cyrillic П in cursive font looks like lowercase
  Latin n, and in some fonts they are identical glyphs. Again,
  completely unrelated letters, yet they have the SAME VISUAL
  REPRESENTATION.  However, in non-cursive font, lowercase Cyrillic П is
  п, which is visually distinct from Latin n.

- These aren't the only ones, either.  Other Cyrillic false friends
  include cursive Д, which in some fonts looks like lowercase Latin g.
  But in non-cursive font, it's д.

Just given the above, it should be clear that going by visual
representation is NOT enough to disambiguate between these different
letters.

It works for books. Unicode invented a problem, and came up with a thoroughlywretched "solution" that we'll be stuck with for generations. One of those badsolutions is have the reader not know what a glyph actually is without pullingback the cover to read the codepoint. It's madness.

By your argument, since lowercase Cyrillic Т is, visually,
just m, it should be encoded the same way as lowercase Latin m. But this
is untenable, because the letterform changes with a different font. So
you end up with the unworkable idea of a font-dependent encoding.

Oh rubbish. Let go of the idea that choosing bad fonts should drive Unicodecodepoint decisions.

Or, to use an example closer to home, uppercase Latin O and the digit 0
are visually identical. Should they be encoded as a single code point or
two?  Worse, in some fonts, the digit 0 is rendered like Ø (to
differentiate it from uppercase O). Does that mean that it should be
encoded the same way as the Danish letter Ø?  Obviously not, but
according to your "visual representation" idea, the answer should be
yes.

Don't confuse fonts with code points. It'd be adequate if Unicode defined acanonical glyph for each code point, and let the font makers do what they wish.

The notion of 'case' should not be part of Unicode, as that is
semantic information that is beyond the scope of Unicode.

But what should "i".toUpper return?

Not relevant to my point that Unicode shouldn't decide what "upper case" for alllanguages means, any more than Unicode should specify a font. Now when you arguethat Unicode should make such decisions, note what a spectacularly hopeless jobof it they've done.

Re: The Case Against Autodecode

Reply via email to