Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

Shawn Heisey Tue, 14 Jul 2015 10:42:44 -0700

On 7/14/2015 10:46 AM, Alessandro Benedetti wrote:
> Furthermore I was checking with Solr 5.1 to find the WDFilter factory
> actually to work in a proper way.
> Is it possible to know what was the conclusion for this issue ?
> Is there an issue in the WordDelimiter token filter in the current Solr
> version? Has it been fixed ?
> Any update ?


It appears that the problem is not with WDF alone ... something about
the combination of filters that I have chosen is causing this, but only
with certain kinds of input.

If I set up a minimal fieldType with the keyword tokenizer, then I
cannot get the problem to reproduce:

    <fieldType name="testType" class="solr.TextField"
sortMissingLast="true" positionIncrementGap="100">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="0"
          preserveOriginal="1"
        />
      </analyzer>
    </fieldType>

I tried with inputs of "aaa-bbb ccc" and "aaa-bbb: ccc" and everything
worked as expected.

I then tried some other analysis combinations trying to find the minimal
problem fieldType, and I finally hit on the one that causes a problem. 
It's a combination of the ICUTokenizer with a custom rulefile, a pattern
replace filter that eats leading and trailing punctuation, and the WDF. 
That must be combined with input text that includes trailing
punctuation: "aaa-bbb: ccc"

    <fieldType name="testType" class="solr.TextField"
sortMissingLast="true" positionIncrementGap="100">
      <analyzer>
        <tokenizer
class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
        <filter class="solr.PatternReplaceFilterFactory"
          pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
          replacement="$2"
        />
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="0"
          preserveOriginal="1"
        />
      </analyzer>
    </fieldType>

If the rulefile is not specified, then the problem doesn't occur,
because the trailing punctuation is missing by the time it makes it to
the PRF.  If the PRF isn't there, then the problem doesn't occur.

So the problem might be with the rulefile, or with some strange
combination of these analysis components.  I did not build this rulefile
myself.  It was built by another, eitherRobert Muir or Steve Rowe if I
remember right, when SOLR-4123 was underway.  The normal settings for
ICUTokenizer eliminate most of the things that WDF uses for making
tokens, which is why I'm using this custom rulefile.

https://issues.apache.org/jira/browse/SOLR-4123

Any advice would be appreciated.  I can make the .rbbi file available.

Thanks,
Shawn

Re: Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

Reply via email to