On 7/14/2015 11:42 AM, Shawn Heisey wrote: > So the problem might be with the rulefile, or with some strange > combination of these analysis components. I did not build this > rulefile myself. It was built by another, eitherRobert Muir or Steve > Rowe if I remember right, when SOLR-4123 was underway. The normal > settings for ICUTokenizer eliminate most of the things that WDF uses > for making tokens, which is why I'm using this custom rulefile.
I found the place where I got that rulefile (named Latin-break-only-on-whitespace.rbbi). It's in the Lucene ICU source, in this directory: lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation The rbbi file that I'm using was slightly different than the one in the branch_5x source, so I copied the source file over. It didn't change the behavior. I'm using the ICU tokenizer with a custom rule file because I want tokenization on boundaries between different character sets (chinese, japanese, cyrillic, etc), but I want to handle internal punctuation with WordDelimiterFilter. Thanks, Shawn