>some changes were made to the StandardTokenizer.jflex grammer (you can svn >diff the two URLs fairly trivially) to better deal with correctly >identifying >word characters, but from what i can tell that should have reduced the number >of splits, not increased them. > >it's hard to tell from your email (because it was sent in the windows-1252 >charset) but what exactly are the unicode characters you are putting through >the tokenizer (ie: "\u0030") ? knowing where it's splitting would >help >figure out what's happening.
These are the characters that of going through: \u0043 \u006F \u0301 \u006D \u006F - C o <Combining Acute Accent> m o It's splitting at the \u0301. >worst case scenerio, you could probably use the StandardTokenizer from >2.3.2 with the rest of the 2.4 code. We've thought of that, but would be the last thing we did to get it back to working. >this will show you exactly what changed... >svn diff >>http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_3/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex > >>http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex Thanks for the links. I've never dealt with JFlex, so I'll have to do some reading to know what is going on in those files. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org