Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

Robert Muir Sat, 21 Feb 2009 08:52:22 -0800

that was just a suggestion as a quick hack...

it still won't really fix the problem because some character + accent
combinations don't have composed forms.


even if you added entire combining diacritical marks block to the jflex
grammar, its still wrong... what needs to be supported is \p{Word_Break =
Extend} property, etc etc.


On Sat, Feb 21, 2009 at 11:23 AM, Philip Puffinburger <
ppuffinbur...@tlcdelivers.com> wrote:

> That's something we can try.   I don't know how much it performance we'd
> lose doing that as our custom filter has to decompose the tokens to do its
> operations.   So instead of 0..1 conversions we'd be doing 1..2 conversions
> during indexing and searching.
>
> -----Original Message-----
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Saturday, February 21, 2009 8:35 AM
> To: java-user@lucene.apache.org
> Subject: Re: 2.3.2 -> 2.4.0 StandardTokenizer issue
>
> normalize your text to NFC. then it will be \u0043 \u00F3 \u006D \u006F and
> will work...
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

Reply via email to