Re: [lucy-dev] Unicode integration

Robert Muir Thu, 17 Nov 2011 06:52:23 -0800

On Thu, Nov 17, 2011 at 9:29 AM, Nick Wellnhofer <[email protected]> wrote:
> On 17/11/2011 14:08, Robert Muir wrote:
>>
>> yeah, the problematic ones can be seen here:
>> http://www.unicode.org/Public/5.0.0/ucd/DerivedNormalizationProps.txt
>>
>> # Derived Property: FC_NFKC_Closure
>> #  Generated from computing: b = NFKC(Fold(a)); c = NFKC(Fold(b));
>> #  Then if (c != b) add the mapping from a to c to the set of
>> #  mappings that constitute the FC_NFKC_Closure list
>>
>> So from what I can tell at a glance: with the utf8proc algorithm, if
>> you specify NFKC and casefolding, its not yet 'done'
>
> I just verified that the output utf8proc produces with the options STABLE,
> COMPOSE, COMPAT, and CASEFOLD really matches the FC_NFKC mapping. See the
> test program at https://gist.github.com/1373256
>


but the problem cannot be tested with single codepoints I think? I'm
pretty sure the issue
has to do with contextual normalization/casefolding (both of these are
not 1-1)... especiallly
involving things like greek diacritics.

a simple test would just generate lots of random unicode strings,
normalize with this option,
and then normalize that result again and compare that they are the same.

-- 
lucidimagination.com

Re: [lucy-dev] Unicode integration

Reply via email to