On Sun, 16 Aug 2015 18:53:52 +0200 Khaled Hosny <[email protected]> wrote:
> On Sun, Aug 16, 2015 at 09:31:25AM -0700, [email protected] > wrote: > > Now, the ä character has a precomposed form in Unicode, and if you > > couple that with the NFC normalisation form, you'd get the above > > _expression_ to return 1. > > So I'm not sure why the allowance was made for ä as well as other > > certain characters, but not for other things (under-bar > > characters) that face similar representation issues. > It was encoded for compatibility of pre-existing character sets AFAIK. Note that compatibility means allowing habits of treating the precomposed characters as single characters to continue. These habits allowed simple transition, but now cause confusion. Most rules work better in NFD than NFC. For string lengths in NFC, you immediately lose the rule len(a + b) = len(a) + len(b). For NFC, you don't even have len(a + b) <= len(a) + len(b). However, do note that for the corresponding 'string' algebra, the mathematical concept of a string no longer works - and this applies to both NFC and NFD. Instead, you have to allow for pairs of characters commuting, and so you get the concept of a 'trace'. If all combinations of base character and non-spacing marks were encoded, there'd be infinitely many. Polytonic Greek has 36 *precomposed* combinations of base character and 3 combining marks, and some languages frequently use base characters with 4 combining marks; unexceptional words with 5 combining marks are less frequent. Richard.

