Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

Shawn Heisey Wed, 08 Jul 2015 07:46:19 -0700

I'm not sure if this is a bug, but it does break searches that work fine
in 4.7.2if we put the same config and index on 4.9.1.


Here's a slightly redacted bit of text that's been sent to the index,
and is also used as a phrase query:

RRR-COLECCION: COLECCIÓN: Gracita Morales foobar

Here are the final positions and terms that 4.7.2 yields for this on
query analysis:

1 rrr-coleccion
1 rrr
2 coleccion
2 rrrcoleccion
3 coleccion
4 gracita
5 morales
6 foobar

This is what 4.9.1 does with it:

1 rrr-coleccion
2 rrr
2 coleccion
2 rrrcoleccion
3 coleccion
4 gracita
5 morales
6 foobar

In both versions, this is what the index analysis generates:

1 rrr
2 coleccion
3 coleccion
4 gracita
5 morales
6 bleh

Remember that it's a phrase query.  As you can see, only the query
analysis from 4.7.2 matches.  I'm not an expert, but the 4.9.1 WDF
position output seems wrong.

The difference in these positions happens on the WordDelimiterFilter
step.  I going to try my fieldType on the 5.2.1 to example to see what
it does, see if maybe the problem has already been fixed. 
Unfortunately, due to a third-party component that has not been tested
with anything newer, I cannot upgrade beyond 4.9.1 at this time.

This is the fieldType present in both versions.  The 4.7 config has a
luceneMatchVersion of LUCENE_47, the 4.9.1 has LUCENE_4_9.

    <fieldType name="genText" class="solr.TextField"
sortMissingLast="true" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.ICUTokenizerFactory"
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
        <filter class="solr.PatternReplaceFilterFactory"
          pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
          replacement="$2"
        />
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="0"
          preserveOriginal="1"
        />
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.CJKBigramFilterFactory" outputUnigrams="true"/>
        <filter class="solr.LengthFilterFactory" min="1" max="512"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.ICUTokenizerFactory"
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
        <filter class="solr.PatternReplaceFilterFactory"
          pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
          replacement="$2"
        />
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="0"
          catenateNumbers="0"
          catenateAll="0"
          preserveOriginal="0"
        />
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.CJKBigramFilterFactory" outputUnigrams="false"/>
        <filter class="solr.LengthFilterFactory" min="1" max="512"/>
      </analyzer>
    </fieldType>

Thanks,
Shawn

Difference in WordDelimiterFilter behavior between 4.7.2 and 4.9.1

Reply via email to