Re: Character folding in text editors

Doug Ewell Sun, 21 Feb 2016 09:56:42 -0800

Eli Zaretskii wrote:

About the closest approximation you can get using Unicode data alone
(not CLDR) is to normalize to NFD, then ignore the combining
diacritics.


This is what Emacs currently does, IIUC what you say.  The NFD
normalization uses the decomposition data included with
UnicodeData.txt.  Is this what you mean?

Yes, the sixth field from the left. For 00F1 this is 006E 0303, so youignore the 0303 and fold 00F1 to 006E.

Remember that the decompositions in UnicodeData.txt may contain otherprecomposed characters, so you have to apply this process iteratively:


1EA8 -> 00C2 0309
00C2 -> 0041 0302
so you fold 1EA8 to 0041.

But that still doesn't work for a character like ø, which doesn't
decompose to o + anything


Why doesn't it, btw?  Same question about ł.

I've heard an opinion that UnicodeData.txt only included
decompositions when the combining mark's glyphs don't overlap those of
the basic character.  Is that correct?


This sounds like a great question for Ken Whistler. ☺

and more importantly, it still won't meet expectations because of
the n/ñ and o/ö/ø language-dependency problems.


Given that the feature can be turned off easily, do you think that it
will nonetheless be useful, even though language-dependent parts are
not available?

It's probably a lot better than no folding. Just be prepared for theinevitable complaints from speakers of language X. Users tend to expectfeatures like this to be perfect, even when you warn them.

--

Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸

Re: Character folding in text editors

Reply via email to