Re: [lucy-dev] Unicode integration

Nick Wellnhofer Thu, 17 Nov 2011 05:06:31 -0800

On 17/11/2011 06:09, Marvin Humphrey wrote:

I wonder: does either "common" or "simple" Unicode case folding preserve a
one-to-one relationship between num-code-points-in and num-code-points-out?


Yes, simple case folding does.

Because I believe that a case folding algorithm with that property would not
mess up the Highlighting data.

But then it looks like utf8proc only offers one CASEFOLD option.  I wonder
which one it is, or if it's configurable.


It only offers full case folding afaics.

Simple case folding would work before tokenization but I still don'tlike the idea of allowing certain analyzers before tokenization if theydon't add or remove codepoints. There might even be some long term gainsif we move tokenization completely out of the analysis chain. Theanalyzers could work directly on tokens instead of inversions and wecould employ a token cache, for example.


Nick

Re: [lucy-dev] Unicode integration

Reply via email to