On Friday, 3 June 2016 at 20:53:32 UTC, H. S. Teoh wrote:

Even the Greek sigma has two forms depending on whether it's at the end of a word or not -- so should it be two code points or one? If you say two, then you'd have a problem with how to search for sigma in Greek text, and you'd have to search for either medial sigma or final sigma. But if you say one, then you'd have a problem with having two different letterforms for a single codepoint.

In Unicode there are 2 different codepoints for lower case sigma ς U+03C2 and σ U+3C3 but only one uppercase Σ U+3A3 sigma. Codepoint U+3A2 is undefined. So your objection is not hypothetic, it is actually an issue for uppercase() and lowercase() functions. Another difficulty besides dotless and dotted i of Turkic, the double letters used in latin transcription of cyrillic text in east and south europe dž, lj, nj and dz, which have an uppercase forme (DŽ, LJ, NJ, DZ) and a titlecase form (Dž, Lj, Nj, Dz).


Besides, that still doesn't solve the problem of what "i".uppercase() should return. In most languages, it should return "I", but in Turkish it should not. And if we really went the route of encoding Cyrillic letters the same as their Latin lookalikes, we'd have a problem with what "m".uppercase() should return, because now it depends on which font is in effect (if it's a Cyrillic cursive font, the correct answer is "Т", if it's a Latin font, the correct answer is "M" -- the other combinations: who knows). That sounds far worse than what we have today.

As an anecdote I can tell the story of the accession to the European Union of Romania and Bulgaria in 2007. The issue was that 3 letters used by Romanian and Bulgarian had been forgotten by the Unicode consortium (Ș U+0218, ș U+219, Ț U+21A, ț U+21B and 2 Cyrillic letters that I do not remember). The Romanian used as a replacement Ş, ş, Ţ and ţ (U+15D, U+15E and U+161 and U+162), which look a little bit alike. When the Commission finally managed to force Mirosoft to correct the fonts to include them, we could start to correct the data. The transition was finished in 2012 and was only possible because no other language we deal with uses the "wrong" codepoints (Turkish but fortunately we only have a handful of them in our db's). So 5 years of ad hoc processing for the substicion of 4 codepoints. BTW: using combining diacritics was out of the question at that time simply because Microsoft Word didn't support it at that time and many documents we encountered still only used codepages (one has also to remember that in big institution like the EC, the IT is always several years behind the open market, which means that when product is in release X, the Institution still might use release X-5 years).


Reply via email to