On 17/11/2011 06:09, Marvin Humphrey wrote:
I wonder: does either "common" or "simple" Unicode case folding preserve a
one-to-one relationship between num-code-points-in and num-code-points-out?
Yes, simple case folding does.
Because I believe that a case folding algorithm with that property would not
mess up the Highlighting data.
But then it looks like utf8proc only offers one CASEFOLD option. I wonder
which one it is, or if it's configurable.
It only offers full case folding afaics.
Simple case folding would work before tokenization but I still don't
like the idea of allowing certain analyzers before tokenization if they
don't add or remove codepoints. There might even be some long term gains
if we move tokenization completely out of the analysis chain. The
analyzers could work directly on tokens instead of inversions and we
could employ a token cache, for example.
Nick