RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

Philip Puffinburger Fri, 20 Feb 2009 20:17:33 -0800

>some changes were made to the StandardTokenizer.jflex grammer (you can svn 
>diff the two URLs fairly trivially) to better deal with correctly >identifying 
>word characters, but from what i can tell that should have reduced the number 
>of splits, not increased them.
>
>it's hard to tell from your email (because it was sent in the windows-1252
>charset) but what exactly are the unicode characters you are putting through 
>the tokenizer (ie: "\u0030") ?  knowing where it's splitting would >help 
>figure out what's happening.


These are the characters that of going through:

\u0043 \u006F \u0301 \u006D \u006F - C o <Combining Acute Accent> m o

It's splitting at the \u0301.

>worst case scenerio, you could probably use the StandardTokenizer from
>2.3.2 with the rest of the 2.4 code.

We've thought of that, but would be the last thing we did to get it back to 
working.

>this will show you exactly what changed...
>svn diff 
>>http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
> 
>>http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex

Thanks for the links.   I've never dealt with JFlex, so I'll have to do some 
reading to know what is going on in those files.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

Reply via email to