Re: The Case Against Autodecode

Patrick Schluter via Digitalmars-d Sat, 04 Jun 2016 00:27:50 -0700

On Friday, 3 June 2016 at 20:53:32 UTC, H. S. Teoh wrote:

Even the Greek sigma has two forms depending on whether it's atthe end of a word or not -- so should it be two code points orone? If you say two, then you'd have a problem with how tosearch for sigma in Greek text, and you'd have to search foreither medial sigma or final sigma. But if you say one, thenyou'd have a problem with having two different letterforms fora single codepoint.

In Unicode there are 2 different codepoints for lower case sigmaς U+03C2 and σ U+3C3 but only one uppercase Σ U+3A3 sigma.Codepoint U+3A2 is undefined. So your objection is nothypothetic, it is actually an issue for uppercase() andlowercase() functions.Another difficulty besides dotless and dotted i of Turkic, thedouble letters used in latin transcription of cyrillic text ineast and south europe ǆ, ǉ, ǌ and ǳ, which have an uppercaseforme (Ǆ, Ǉ, Ǌ, Ǳ) and a titlecase form (ǅ, ǈ, ǋ, ǲ).

Besides, that still doesn't solve the problem of what"i".uppercase() should return. In most languages, it shouldreturn "I", but in Turkish it should not. And if we reallywent the route of encoding Cyrillic letters the same as theirLatin lookalikes, we'd have a problem with what "m".uppercase()should return, because now it depends on which font is ineffect (if it's a Cyrillic cursive font, the correct answer is"Т", if it's a Latin font, the correct answer is "M" -- theother combinations: who knows). That sounds far worse thanwhat we have today.

As an anecdote I can tell the story of the accession to theEuropean Union of Romania and Bulgaria in 2007. The issue wasthat 3 letters used by Romanian and Bulgarian had been forgottenby the Unicode consortium (Ș U+0218, ș U+219, Ț U+21A, ț U+21Band 2 Cyrillic letters that I do not remember). The Romanian usedas a replacement Ş, ş, Ţ and ţ (U+15D, U+15E and U+161 andU+162), which look a little bit alike. When the Commissionfinally managed to force Mirosoft to correct the fonts to includethem, we could start to correct the data. The transition wasfinished in 2012 and was only possible because no other languagewe deal with uses the "wrong" codepoints (Turkish but fortunatelywe only have a handful of them in our db's). So 5 years of ad hocprocessing for the substicion of 4 codepoints.BTW: using combining diacritics was out of the question at thattime simply because Microsoft Word didn't support it at that timeand many documents we encountered still only used codepages (onehas also to remember that in big institution like the EC, the ITis always several years behind the open market, which means thatwhen product is in release X, the Institution still might userelease X-5 years).

Re: The Case Against Autodecode

Reply via email to