Muir [mailto:rcm...@gmail.com]
Sent: Saturday, February 21, 2009 8:35 AM
To: java-user@lucene.apache.org
Subject: Re: 2.3.2 - 2.4.0 StandardTokenizer issue
normalize your text to NFC. then it will be \u0043 \u00F3 \u006D \u006F and
will work
instead of 0..1 conversions we'd be doing 1..2 conversions
during indexing and searching.
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Saturday, February 21, 2009 8:35 AM
To: java-user@lucene.apache.org
Subject: Re: 2.3.2 - 2.4.0 StandardTokenizer issue
normalize
- 2.4.0 StandardTokenizer issue
that was just a suggestion as a quick hack...
it still won't really fix the problem because some character + accent
combinations don't have composed forms.
even if you added entire combining diacritical marks block to the jflex
grammar, its still wrong... what needs
: In 2.3.2 if the token �Co�mo� came through this it would get changed to
: �como� by the time it made it through the filters.In 2.4.0 this isn�t
: the case. It treats this one token as two so we get �co� and �mo�.So
: instead of search �como� or �Co�mo� to get all the hits we now have
some changes were made to the StandardTokenizer.jflex grammer (you can svn
diff the two URLs fairly trivially) to better deal with correctly identifying
word characters, but from what i can tell that should have reduced the number
of splits, not increased them.
it's hard to tell from your
Actually, WhitespaceTokenizer won't work. Too many person names and it
won't do anything with punctuation. Something had to have changed in
StandardTokenizer, and we need some of the 2.4 fixes/features, so we are
kind of stuck.
-Original Message-
From: Philip Puffinburger