[jira] [Created] (LUCENE-6689) Odd analysis problem with WDF, appears to be triggered by preceding analysis components

Shawn Heisey (JIRA) Mon, 20 Jul 2015 08:39:25 -0700

Shawn Heisey created LUCENE-6689:
------------------------------------

             Summary: Odd analysis problem with WDF, appears to be triggered by 
preceding analysis components
                 Key: LUCENE-6689
                 URL: https://issues.apache.org/jira/browse/LUCENE-6689
             Project: Lucene - Core
          Issue Type: Bug
    Affects Versions: 4.8
            Reporter: Shawn Heisey



This problem shows up for me in Solr, but I believe the issue is down at the 
Lucene level, so I've opened the issue in the LUCENE project.  We can move it 
if necessary.

I've boiled the problem down to this minimum Solr fieldType:

{noformat}
    <fieldType name="testType" class="solr.TextField"
sortMissingLast="true" positionIncrementGap="100">
      <analyzer>
        <tokenizer
class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
        <filter class="solr.PatternReplaceFilterFactory"
          pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
          replacement="$2"
        />
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="0"
          preserveOriginal="1"
        />
      </analyzer>
    </fieldType>
{noformat}

On Solr 4.7, if this type is given the input "aaa-bbb: ccc" then aaa ends up at 
term position 1 and bbb at term position 2.  This seems perfectly reasonable to 
me.  In Solr 4.9, both terms end up at position 2.  This causes phrase queries 
which used to work to return zero hits.  The exact text of the phrase query is 
in the original documents that match on 4.7.

If the custom rbbi (which is included unmodified from the lucene icu analysis 
source code) is not used, then the problem doesn't happen, because the 
punctuation doesn't make it to the PRF.  If the PatternReplaceFilterFactory is 
not present, then the problem doesn't happen.

I can work around the problem by setting luceneMatchVersion to 4.7, but I think 
the behavior is a bug, and I would rather not continue to use 4.7 analysis when 
I upgrade to 5.x, which I hope to do soon.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-6689) Odd analysis problem with WDF, appears to be triggered by preceding analysis components

Reply via email to