Eli Zaretskii wrote:

About the closest approximation you can get using Unicode data alone
(not CLDR) is to normalize to NFD, then ignore the combining
diacritics.

This is what Emacs currently does, IIUC what you say.  The NFD
normalization uses the decomposition data included with
UnicodeData.txt.  Is this what you mean?

Yes, the sixth field from the left. For 00F1 this is 006E 0303, so you ignore the 0303 and fold 00F1 to 006E.

Remember that the decompositions in UnicodeData.txt may contain other precomposed characters, so you have to apply this process iteratively:

1EA8 -> 00C2 0309
00C2 -> 0041 0302
so you fold 1EA8 to 0041.

But that still doesn't work for a character like ø, which doesn't
decompose to o + anything

Why doesn't it, btw?  Same question about ł.

I've heard an opinion that UnicodeData.txt only included
decompositions when the combining mark's glyphs don't overlap those of
the basic character.  Is that correct?

This sounds like a great question for Ken Whistler. ☺

and more importantly, it still won't meet expectations because of
the n/ñ and o/ö/ø language-dependency problems.

Given that the feature can be turned off easily, do you think that it
will nonetheless be useful, even though language-dependent parts are
not available?

It's probably a lot better than no folding. Just be prepared for the inevitable complaints from speakers of language X. Users tend to expect features like this to be perfect, even when you warn them.

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸

Reply via email to