On 7/14/2015 11:42 AM, Shawn Heisey wrote:
> So the problem might be with the rulefile, or with some strange
> combination of these analysis components. I did not build this
> rulefile myself. It was built by another, eitherRobert Muir or Steve
> Rowe if I remember right, when SOLR-4123 was underway. The normal
> settings for ICUTokenizer eliminate most of the things that WDF uses
> for making tokens, which is why I'm using this custom rulefile.  

I found the place where I got that rulefile (named
Latin-break-only-on-whitespace.rbbi).  It's in the Lucene ICU source, in
this directory:

lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation

The rbbi file that I'm using was slightly different than the one in the
branch_5x source, so I copied the source file over.  It didn't change
the behavior.

I'm using the ICU tokenizer with a custom rule file because I want
tokenization on boundaries between different character sets (chinese,
japanese, cyrillic, etc), but I want to handle internal punctuation with
WordDelimiterFilter.

Thanks,
Shawn

Reply via email to