On Sat, Feb 20, 2016 at 11:10 PM, Asmus Freytag (t) <asmus-...@ix.netcom.com > wrote:
> Unicode, even CLDR, doesn't nearly have enough data for the purpose. > (and as a corollary of what Elias points out, it's likely to annoy users > of every language, in that it would fold essential and non-essential > distinctions indiscriminately). > > I've been working on this problem in the context of international > top-level domain names, where the aim of the project is to identify labels > that are seen as "the same" by users of a given script (but, in cases of > identical appearance, we also include those seen as identical by users > across scripts). > > None of the working groups in this project has felt like turning to CLDR > for this purpose, and so far, each has approached the issue in a way that > is not linked to sorting. > > Finally, none has seen folding of diacritics as useful; however, in the > case of Arabic, where optional combining marks simply are not supported (so > as to avoid having to define a folding). > > (see > https://www.icann.org/sites/default/files/lgr/lgr-1-arabic-script-01dec15-en.html > ) > It depends on what the folding is being used for: there are many different purposes. For some purposes, the goal of "is seen as the same" is appropriate, while for others a broader scope is appropriate—typically because someone wants a quick filter to get to a relatively small set of strings which can then be processed in a more CPU-intensive fashion. In whatever case, one can only get an approximation; the question is whether that approximation is sufficient for whatever the task is at hand. Mark