RE: 2.3.2 - 2.4.0 StandardTokenizer issue

2009-02-21 Thread Philip Puffinburger
Muir [mailto:rcm...@gmail.com] Sent: Saturday, February 21, 2009 8:35 AM To: java-user@lucene.apache.org Subject: Re: 2.3.2 - 2.4.0 StandardTokenizer issue normalize your text to NFC. then it will be \u0043 \u00F3 \u006D \u006F and will work

Re: 2.3.2 - 2.4.0 StandardTokenizer issue

2009-02-21 Thread Robert Muir
instead of 0..1 conversions we'd be doing 1..2 conversions during indexing and searching. -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Saturday, February 21, 2009 8:35 AM To: java-user@lucene.apache.org Subject: Re: 2.3.2 - 2.4.0 StandardTokenizer issue normalize

RE: 2.3.2 - 2.4.0 StandardTokenizer issue

2009-02-21 Thread Philip Puffinburger
- 2.4.0 StandardTokenizer issue that was just a suggestion as a quick hack... it still won't really fix the problem because some character + accent combinations don't have composed forms. even if you added entire combining diacritical marks block to the jflex grammar, its still wrong... what needs

Re: 2.3.2 - 2.4.0 StandardTokenizer issue

2009-02-20 Thread Chris Hostetter
: In 2.3.2 if the token �Co�mo� came through this it would get changed to : �como� by the time it made it through the filters.In 2.4.0 this isn�t : the case. It treats this one token as two so we get �co� and �mo�.So : instead of search �como� or �Co�mo� to get all the hits we now have

RE: 2.3.2 - 2.4.0 StandardTokenizer issue

2009-02-20 Thread Philip Puffinburger
some changes were made to the StandardTokenizer.jflex grammer (you can svn diff the two URLs fairly trivially) to better deal with correctly identifying word characters, but from what i can tell that should have reduced the number of splits, not increased them. it's hard to tell from your

RE: 2.3.2 - 2.4.0 StandardTokenizer issue

2009-02-19 Thread Philip Puffinburger
Actually, WhitespaceTokenizer won't work. Too many person names and it won't do anything with punctuation. Something had to have changed in StandardTokenizer, and we need some of the 2.4 fixes/features, so we are kind of stuck. -Original Message- From: Philip Puffinburger