Kaixo! On Thu, Nov 18, 2004 at 01:50:28PM +0100, Danilo Segan wrote:
> The issue is whether you would care about being able to differentiate > in your text processor between languages or not. It is indeed a good feature to do so; but the *smallest* unit for which language information is usefull are *words*, not characters/letters. > So, "jota" would > still make sense in Spanish, whatever it was pronounced as, but not > much sense in English (since it's not a word there). I think this is > a good property to know. No, it is useless. The letter "j", alone, is the same letter on all languages using the latin script. There is absolutely no gain in creating differences based on language (plus, I know of no language where there is a word consisting of the single letter "j"). "disambiguating" letters depending on the language is a very bad idea, beacause it destroys the interexchangeability of documents. You have problems to do google searchs in Serbian because a text can be in two different scripts; now with your idea of disambiguating letters it means that the same problem will exist for almost all languages (minus the very few ones using a unique script), it would even be worst, as a same English text, for example, could be encoded in dozens of different (eg: in English-letters, Spanish-letters, Portuguese-letters, French-letters, German-letters, Italian-letters, Indonesian-letters, Polish-letters, Irish-letters, Welsh-letters, Danish-letters,...). > We must agree that these differences Unicode went after are > glyph-based, rather than character-based. They are character based. With a character defined as an atomic element of a script (there are of course a lot of exceptions due to historical reasons, but that is the basic idea). So, unicode is a collection of *scripts*, each script is separate and independent of the others, and each script is a collection of characters belonging to that script (there are some special characters, like generic puntuation and ascii digits, that can be used in conjunction with most scripts, but outside the shared puntuation characters, the different characters are exclusive to a given script, even if there are similarities in some cases with other characters of another script). The basic concept to encode writing is the script, that is so when electronically encoding text simply because that is so when writting text by hand or press. > I say that "a" and "Ð" are same characters in Serbian, They are not. They may be the same *letter* in Serbian. But a letter is not a character (in Spanish, "ch" is a letter (yes, I'm a traditionalist), as well in Serbian "lj" and "nj" are letters; however the involved characters are "c", "h", "l", "n", "j". Note also how in cyrillic script "Ñ" and "Ñ" are single caracters, note also that "ÐÑ" and "ÐÑ" are not single characters. Script changes are a bit like orthographic changes (an extreme orthographic changes); in French the word "acute" used to be written "aiguÃ", now it is "aigÃe"; it is impossible to encode it in a single way, you can attach as many language proprieties to characters, you cannot handle orthographic changes that way. You wrongly see latin and cyrillic variants of Serbian as simple differences in shape of the same characters; that is not so, you should instead look at it as two orthographic variants. -- Ki Ãa vos vÃye bÃn, Pablo Saratxaga http://chanae.walon.org/pablo/ PGP Key available, key ID: 0xD9B85466 [you can write me in Walloon, Spanish, French, English, Catalan or Esperanto] [min povas skribi en valona, esperanta, angla aux latinidaj lingvoj]
pgpCNwfPN2cza.pgp
Description: PGP signature