[jira] [Commented] (LUCENE-6689) Odd analysis problem with WDF, appears to be triggered by preceding analysis components

Shawn Heisey (JIRA) Fri, 21 Aug 2015 13:23:42 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707408#comment-14707408
 ]


Shawn Heisey commented on LUCENE-6689:
--------------------------------------

I have simplified the analysis chain, removing the ICU tokenizer and replacing 
it with the whitespace tokenizer.  The root problem appears to be an 
interaction between PatternReplaceFilter and WordDelimiterFilter.

With the following Solr analysis chain, an indexed value of "aaa-bbb: ccc" will 
not be found by a phrase search of "aaa bbb" because the positions on the two 
query terms don't match what's in the index.  The positions go wrong on the 
WordDelimiterFilter step.

{code}
    <fieldType name="genText2" class="solr.TextField" sortMissingLast="true" 
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
          pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
          replacement="$2"
        />
        <filter class="solr.WordDelimiterFilterFactory"
          splitOnCaseChange="1"
          splitOnNumerics="1"
          stemEnglishPossessive="1"
          generateWordParts="1"
          generateNumberParts="1"
          catenateWords="1"
          catenateNumbers="1"
          catenateAll="0"
          preserveOriginal="1"
        />
      </analyzer>
    </fieldType>
{code}

If I remove PRFF from the above chain, the problem goes away.  This filter is 
in the chain so that leading and trailing punctuation are removed from all 
terms, leaving punctuation inside the term for WDF to handle.

An additional problem with the analysis quoted above is that the "aaabbb" term 
is indexed at position 2.  I believe it should be at position 1.  This problem 
is also fixed by removing PRFF.


> Odd analysis problem with WDF, appears to be triggered by preceding analysis 
> components
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6689
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.8
>            Reporter: Shawn Heisey
>
> This problem shows up for me in Solr, but I believe the issue is down at the 
> Lucene level, so I've opened the issue in the LUCENE project.  We can move it 
> if necessary.
> I've boiled the problem down to this minimum Solr fieldType:
> {noformat}
>     <fieldType name="testType" class="solr.TextField"
> sortMissingLast="true" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer
> class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>           pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
>           replacement="$2"
>         />
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1"
>           splitOnNumerics="1"
>           stemEnglishPossessive="1"
>           generateWordParts="1"
>           generateNumberParts="1"
>           catenateWords="1"
>           catenateNumbers="1"
>           catenateAll="0"
>           preserveOriginal="1"
>         />
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer
> class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>           pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
>           replacement="$2"
>         />
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1"
>           splitOnNumerics="1"
>           stemEnglishPossessive="1"
>           generateWordParts="1"
>           generateNumberParts="1"
>           catenateWords="0"
>           catenateNumbers="0"
>           catenateAll="0"
>           preserveOriginal="0"
>         />
>       </analyzer>
>     </fieldType>
> {noformat}
> On Solr 4.7, if this type is given the input "aaa-bbb: ccc" then index 
> analysis puts aaa at term position 1 and bbb at term position 2.  This seems 
> perfectly reasonable to me.  In Solr 4.9, both terms end up at position 2.  
> This causes phrase queries which used to work to return zero hits.  The exact 
> text of the phrase query is in the original documents that match on 4.7.
> If the custom rbbi (which is included unmodified from the lucene icu analysis 
> source code) is not used, then the problem doesn't happen, because the 
> punctuation doesn't make it to the PRF.  If the PatternReplaceFilterFactory 
> is not present, then the problem doesn't happen.
> I can work around the problem by setting luceneMatchVersion to 4.7, but I 
> think the behavior is a bug, and I would rather not continue to use 4.7 
> analysis when I upgrade to 5.x, which I hope to do soon.
> Whether luceneMatchversion is LUCENE_47 or LUCENE_4_9, query analysis puts 
> aaa at term position 1 and bbb at term position 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6689) Odd analysis problem with WDF, appears to be triggered by preceding analysis components

Reply via email to