that was just a suggestion as a quick hack...

it still won't really fix the problem because some character + accent
combinations don't have composed forms.

even if you added entire combining diacritical marks block to the jflex
grammar, its still wrong... what needs to be supported is \p{Word_Break =
Extend} property, etc etc.


On Sat, Feb 21, 2009 at 11:23 AM, Philip Puffinburger <
[email protected]> wrote:

> That's something we can try.   I don't know how much it performance we'd
> lose doing that as our custom filter has to decompose the tokens to do its
> operations.   So instead of 0..1 conversions we'd be doing 1..2 conversions
> during indexing and searching.
>
> -----Original Message-----
> From: Robert Muir [mailto:[email protected]]
> Sent: Saturday, February 21, 2009 8:35 AM
> To: [email protected]
> Subject: Re: 2.3.2 -> 2.4.0 StandardTokenizer issue
>
> normalize your text to NFC. then it will be \u0043 \u00F3 \u006D \u006F and
> will work...
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Robert Muir
[email protected]

Reply via email to