On 01/15/2011 08:51 PM, Steven Schveighoffer wrote:
More over, Even if you ignore Hebrew as a tiny insignificant minority
you cannot do the same for Arabic which has over one *billion* people
that use that language.

I hope that the medium type works 'good enough' for those languages,
with the high level type needed for advanced usages.  At a minimum,
comparison and substring should work for all languages.

Hello Steven,

How does an application know that a given text, which supposedly is written in a given natural language (as for instance indicated by an html header) does not also hold terms from other languages? There are various occasions for this: quotations, use of foreign words, pointers...

A side-issue is raised by precomposed codes for composite characters. For most languages of the world, I guess (but unsure), all "official" characters have single-code representations. Good, but unfortunately this is not enforced by the standard (instead, the decomposed form can sensibly be considered the base form, but this is another topic). So that even if ones knows for sure that all characters of all texts an app will ever deal with can be mapped to single codes, to be safe one would have to normalise to NFC anyway (Normalised Form Composed). Then, where is the actual gain? In fact, it is a loss because NFC is more costly than NFD (Decomposed) --actually, the standard NFC algo first decomposes to NFD to initially get an unique representation that can then be more easily (re)composed via simple mappings.

For further information:
Unicode's normalisation algos: http://unicode.org/reports/tr15/
list of technical reports: http://unicode.org/reports/
(Unicode's technical reports are far more readible than the standard itself, but unfortunately often refer to it.)

Denis
_________________
vita es estrany
spir.wikidot.com

Reply via email to