Re: [lucy-dev] Unicode integration

Robert Muir Thu, 17 Nov 2011 04:58:31 -0800

On Thu, Nov 17, 2011 at 7:45 AM, Nick Wellnhofer <[email protected]> wrote:
> On 17/11/2011 13:37, Robert Muir wrote:
>>
>> The point of the derived property is that there are sneaky
>> interactions between these.
>
> Having a look at the utf8proc code, the function utf8proc_decompose_char
> calls itself recursively when substituting characters. So it looks like it
> does support NFKC_Casefold properly.
>
> Nick
>


I don't think so: it seems to only decompose the 'output' case folding
mapping. this is not enough.

If I remember, the problem is that normalization of course uses
context, so the algorithm must be done as stated in the standard:

 toNFKC_Casefold(X): Map each character C in X to NFKC_Casefold(C) and then
normalize the resulting string to NFC

doing the mappings: then normalizing the whole string.

in icu this is instead done as an additional normalization form, so
its single-pass/non-recursive there.

-- 
lucidimagination.com

Re: [lucy-dev] Unicode integration

Reply via email to