On Fri, Jun 03, 2016 at 11:43:07AM -0700, Walter Bright via Digitalmars-d wrote: > On 6/3/2016 9:28 AM, H. S. Teoh via Digitalmars-d wrote: > > Eventually you have no choice but to encode by logical meaning > > rather than by appearance, since there are many lookalikes between > > different languages that actually mean something completely > > different, and often behaves completely differently. > > It's almost as if printed documents and books have never existed!
But if we were to encode appearance instead of logical meaning, that would mean the *same* lowercase Cyrillic ь would have multiple, different encodings depending on which font was in use. That doesn't seem like the right solution either. Do we really want Unicode strings to encode font information too?? 'Cos by that argument, serif and sans serif letters should have different encodings, because in languages like Hebrew, a tiny little serif could mean the difference between two completely different letters. And what of the Arabic and Indic scripts? They would need to encode the same letter multiple times, each being a variation of the physical form that changes depending on the surrounding context. Even the Greek sigma has two forms depending on whether it's at the end of a word or not -- so should it be two code points or one? If you say two, then you'd have a problem with how to search for sigma in Greek text, and you'd have to search for either medial sigma or final sigma. But if you say one, then you'd have a problem with having two different letterforms for a single codepoint. Besides, that still doesn't solve the problem of what "i".uppercase() should return. In most languages, it should return "I", but in Turkish it should not. And if we really went the route of encoding Cyrillic letters the same as their Latin lookalikes, we'd have a problem with what "m".uppercase() should return, because now it depends on which font is in effect (if it's a Cyrillic cursive font, the correct answer is "Т", if it's a Latin font, the correct answer is "M" -- the other combinations: who knows). That sounds far worse than what we have today. T -- Let's eat some disquits while we format the biskettes.