Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

Chris Hostetter Fri, 20 Feb 2009 17:11:31 -0800

: In 2.3.2 if the token �Co�mo� came through this it would get changed to
: �como� by the time it made it through the filters.    In 2.4.0 this isn�t
: the case.   It treats this one token as two so we get �co� and �mo�.    So
: instead of search �como� or �Co�mo� to get all the hits we now have to do
: them separately.


some changes were made to the StandardTokenizer.jflex grammer (you can svn 
diff the two URLs fairly trivially) to better deal with correctly 
identifying word characters, but from what i can tell that should have 
reduced the number of splits, not increased them.

it's hard to tell from your email (because it was sent in the windows-1252 
charset) but what exactly are the unicode characters you are putting 
through the tokenizer (ie: "\u0030") ?  knowing where it's splitting would 
help figure out what's happening.

worst case scenerio, you could probably use the StandardTokenizer from 
2.3.2 with the rest of the 2.4 code.

this will show you exactly what changed...
svn diff 
http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
 
http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex



-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

Reply via email to