[ https://issues.apache.org/jira/browse/LUCENE-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705543#comment-14705543 ]
Shawn Heisey commented on LUCENE-6689: -------------------------------------- I can work around the specific queries that caused the problem if I make index and query WDF analysis exactly the same ... but there's a problem even then. As a test, I entirely removed the query analysis above and removed the "type" attribute from the index analysis so it applies to both. I put this fieldType into Solr 5.2.1 and went to the analysis screen. A phrase search for "aaa bbb" when the indexed value was "aaa-bbb: ccc" does not match, because the positions are wrong. I believe that it *should* match. A user would most likely expect it to match. > Odd analysis problem with WDF, appears to be triggered by preceding analysis > components > --------------------------------------------------------------------------------------- > > Key: LUCENE-6689 > URL: https://issues.apache.org/jira/browse/LUCENE-6689 > Project: Lucene - Core > Issue Type: Bug > Affects Versions: 4.8 > Reporter: Shawn Heisey > > This problem shows up for me in Solr, but I believe the issue is down at the > Lucene level, so I've opened the issue in the LUCENE project. We can move it > if necessary. > I've boiled the problem down to this minimum Solr fieldType: > {noformat} > <fieldType name="testType" class="solr.TextField" > sortMissingLast="true" positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer > class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory" > rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/> > <filter class="solr.PatternReplaceFilterFactory" > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" > replacement="$2" > /> > <filter class="solr.WordDelimiterFilterFactory" > splitOnCaseChange="1" > splitOnNumerics="1" > stemEnglishPossessive="1" > generateWordParts="1" > generateNumberParts="1" > catenateWords="1" > catenateNumbers="1" > catenateAll="0" > preserveOriginal="1" > /> > </analyzer> > <analyzer type="query"> > <tokenizer > class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory" > rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/> > <filter class="solr.PatternReplaceFilterFactory" > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" > replacement="$2" > /> > <filter class="solr.WordDelimiterFilterFactory" > splitOnCaseChange="1" > splitOnNumerics="1" > stemEnglishPossessive="1" > generateWordParts="1" > generateNumberParts="1" > catenateWords="0" > catenateNumbers="0" > catenateAll="0" > preserveOriginal="0" > /> > </analyzer> > </fieldType> > {noformat} > On Solr 4.7, if this type is given the input "aaa-bbb: ccc" then index > analysis puts aaa at term position 1 and bbb at term position 2. This seems > perfectly reasonable to me. In Solr 4.9, both terms end up at position 2. > This causes phrase queries which used to work to return zero hits. The exact > text of the phrase query is in the original documents that match on 4.7. > If the custom rbbi (which is included unmodified from the lucene icu analysis > source code) is not used, then the problem doesn't happen, because the > punctuation doesn't make it to the PRF. If the PatternReplaceFilterFactory > is not present, then the problem doesn't happen. > I can work around the problem by setting luceneMatchVersion to 4.7, but I > think the behavior is a bug, and I would rather not continue to use 4.7 > analysis when I upgrade to 5.x, which I hope to do soon. > Whether luceneMatchversion is LUCENE_47 or LUCENE_4_9, query analysis puts > aaa at term position 1 and bbb at term position 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org