It should also be noted that some kind of "folding" described/desired by Elias will likely fail his expectations, even when using collation data in CLDR tailored per language.
Notably, this data, even if it is used as it weakest strength (the primary collation level only, discarding other differences at higher strength levels) will most often not collate many digrams/trigrams that are frequently used in the locale for which the data is designed. The reason for that is that most of these digrams/trigrams (used in the orthography to note a single phoneme) are highly context-dependant and could in fact cover several distinct phonemes. E.g. "on" in French is a digram for the nasal o. There are also mute letters (consonnants) following it in the same phoneme. But if the consonnant is followed by a vowel, then there's a posible syllable break between "on" and the following consonnant. However that vowel may also be mute (if it is a final "e"), in which case there's a single syllable.. If the digram "on" is followed by a vowel, it is no longer a digram and there's a syllable break between "o" and "n", but if "on" is followed by a mute vowel (final "e"), that syllable break disappears, but the digram "on" is still two distinct phonemes. "on" may also be followed by another "n" and a vowel (possibly mute) it which case "on" is never a single phoneme. There are similar issues with other digrams/trigrams in French such as "ein", "aint", un". Some distinct difficulties with "gu", "ge" and "qu". And more difficultes with "ch" (also in English and other languages). Different difficulties with "ai"... Determining which digrams/trigrams are a single phoneme requires parsing words for syllable breaks. But there are many exceptions (notably because languages are borrowing lots of words from other languages with their origin orthography, and the phonetic is only slightly altered. There exists some algorithms trying to use those weak "equivalences", based on their apparent orthography, trying to infer some basic phonetic from it. This is used for performing approxiamte searches in arbitrary plain text, even in cases where there may exist some orthographic typos in it. Look for example at the SOUNDEX function (you'll first need to detect word-breaks for some implementations). Trying to use dictionary data for determining the syllable breaks may be useful, but you need a lot of data (and all dictionaries are incomplete). For disambituating some cases, you'll need to determine in fact the actual phonetics by using a phonetic dictionary (data resources for that are difficult to find, even serious linguistic dictionnaries only include a part of the phonetic, and ignore the variants for derived orthographic forms) 2016-02-20 22:43 GMT+01:00 Doug Ewell <d...@ewellic.org>: > Eli Zaretskii wrote: > > What about language-independent character-folding: where in the >> Unicode database is the data for that? >> > > The OP kind of alluded to that: there is no such thing really as > language-independent character folding. > > About the closest approximation you can get using Unicode data alone (not > CLDR) is to normalize to NFD, then ignore the combining diacritics. But > that still doesn't work for a character like ø, which doesn't decompose to > o + anything, and more importantly, it still won't meet expectations > because of the n/ñ and o/ö/ø language-dependency problems. > > As Mark and Philippe said, the real solution is to use CLDR, because that > is where language-dependent information like this lives. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸 >