On Thu, Nov 17, 2011 at 9:29 AM, Nick Wellnhofer <[email protected]> wrote: > On 17/11/2011 14:08, Robert Muir wrote: >> >> yeah, the problematic ones can be seen here: >> http://www.unicode.org/Public/5.0.0/ucd/DerivedNormalizationProps.txt >> >> # Derived Property: FC_NFKC_Closure >> # Generated from computing: b = NFKC(Fold(a)); c = NFKC(Fold(b)); >> # Then if (c != b) add the mapping from a to c to the set of >> # mappings that constitute the FC_NFKC_Closure list >> >> So from what I can tell at a glance: with the utf8proc algorithm, if >> you specify NFKC and casefolding, its not yet 'done' > > I just verified that the output utf8proc produces with the options STABLE, > COMPOSE, COMPAT, and CASEFOLD really matches the FC_NFKC mapping. See the > test program at https://gist.github.com/1373256 >
but the problem cannot be tested with single codepoints I think? I'm pretty sure the issue has to do with contextual normalization/casefolding (both of these are not 1-1)... especiallly involving things like greek diacritics. a simple test would just generate lots of random unicode strings, normalize with this option, and then normalize that result again and compare that they are the same. -- lucidimagination.com
