2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-16 Thread Philip Puffinburger
We have our own Analyzer which has the following Public final TokenStream tokenStream(String fieldname, Reader reader) { TokenStream result = new StandardTokenizer(reader); result = new StandardFilter(result); result = new MyAccentFilter(result); result = new LowerCaseFilter(result)

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-19 Thread Philip Puffinburger
ffinburger [mailto:ppuffinbur...@tlcdelivers.com] Sent: Monday, February 16, 2009 7:19 PM To: java-user@lucene.apache.org Subject: 2.3.2 -> 2.4.0 StandardTokenizer issue We have our own Analyzer which has the following Public final TokenStream tokenStream(String fieldname, Reader reader) { Token

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-20 Thread Chris Hostetter
: In 2.3.2 if the token �Co�mo� came through this it would get changed to : �como� by the time it made it through the filters.In 2.4.0 this isn�t : the case. It treats this one token as two so we get �co� and �mo�.So : instead of search �como� or �Co�mo� to get all the hits we now have t

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-20 Thread Philip Puffinburger
>some changes were made to the StandardTokenizer.jflex grammer (you can svn >diff the two URLs fairly trivially) to better deal with correctly >identifying >word characters, but from what i can tell that should have reduced the number >of splits, not increased them. > >it's hard to tell from you

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-21 Thread Robert Muir
normalize your text to NFC. then it will be \u0043 \u00F3 \u006D \u006F and will work... On Fri, Feb 20, 2009 at 11:16 PM, Philip Puffinburger < ppuffinbur...@tlcdelivers.com> wrote: > >some changes were made to the StandardTokenizer.jflex grammer (you can svn > diff the two URLs fairly trivially

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-21 Thread Philip Puffinburger
ge- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Saturday, February 21, 2009 8:35 AM To: java-user@lucene.apache.org Subject: Re: 2.3.2 -> 2.4.0 StandardTokenizer issue normalize your text to NFC. then it will be \u0043 \u00F3 \u00

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-21 Thread Robert Muir
e tokens to do its > operations. So instead of 0..1 conversions we'd be doing 1..2 conversions > during indexing and searching. > > -Original Message- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Saturday, February 21, 2009 8:35 AM > To: java-user@lu

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-21 Thread Philip Puffinburger
2.3.2 -> 2.4.0 StandardTokenizer issue that was just a suggestion as a quick hack... it still won't really fix the problem because some character + accent combinations don't have composed forms. even if you added entire combining diacritical marks block to the jflex grammar, its still wron