[ https://issues.apache.org/jira/browse/SOLR-7926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bjørn Hjelle updated SOLR-7926: ------------------------------- Description: Hit highlight highlights the whole word, not just the part that matches the search term when using EdgeNGramFilterFactory in the field type. In schema.xml I have field type text_ngram: <fieldType name="text_ngram" class="solr.TextField"> <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!--tokenizer class="solr.StandardTokenizerFactory"/--> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="3" luceneMatchVersion="4.3"/> <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" replacement="" replace="all"/> </analyzer> <analyzer type="query"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" replacement="" replace="all"/> <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?" replacement="$1" replace="all"/> </analyzer> </fieldType> In Solr Admin analyse, with index value "lucene" and query value "luc" it shows this: LENGTF text luc luce lucen lucene raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e] [6c 75 63 65 6e 65] start 0 0 0 0 end 6 6 6 6 positionLength 1 1 1 1 type word word word word position 1 1 1 1 Since the end position is 6 in this case the whole word ("lucene" is highlighted). If I change to use NGramFilterFactory it shows me this (for the first three items): LENGTF text luc uce cen raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e] start 0 1 2 end 3 4 5 positionLength 1 1 1 type word word word position 1 1 1 The end position is correct then and the highlighter highlights only the search term. Note that I have specified luceneMatchVersion="4.3". Without this, the end positions goes back to 6 also for the NGramFilterFactory. was: Hit highlight highlights the whole word, not just the part that matches the search term when using EdgeNGramFilterFactory in the field type. In schema.xml I have field type text_ngram: <fieldType name="text_ngram" class="solr.TextField"> <analyzer type="index"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <!--tokenizer class="solr.StandardTokenizerFactory"/--> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="3" luceneMatchVersion="4.3"/> <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" replacement="" replace="all"/> </analyzer> <analyzer type="query"> <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" replacement="" replace="all"/> <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?" replacement="$1" replace="all"/> </analyzer> </fieldType> And dynamic field: <dynamicField name="*_n" type="text_ngram" indexed="true" stored="true"/> In Solr Admin analyse, with index value "lucene" and query value "luc" it shows this: LENGTF text luc luce lucen lucene raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e] [6c 75 63 65 6e 65] start 0 0 0 0 end 6 6 6 6 positionLength 1 1 1 1 type word word word word position 1 1 1 1 Since the end position is 6 in this case the whole word ("lucene" is highlighted). If I change to use NGramFilterFactory it shows me this (for the first three items): LENGTF text luc uce cen raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e] start 0 1 2 end 3 4 5 positionLength 1 1 1 type word word word position 1 1 1 The end position is correct then and the highlighter highlights only the search term. Note that I have specified luceneMatchVersion="4.3". Without this, the end positions goes back to 6 also for the NGramFilterFactory. > Hit highlighting with EdgeNGramFilterFactory > -------------------------------------------- > > Key: SOLR-7926 > URL: https://issues.apache.org/jira/browse/SOLR-7926 > Project: Solr > Issue Type: Bug > Components: highlighter > Affects Versions: 5.1, 5.2.1 > Environment: CentOS 7 (5.2.1), OS X 10.10.5 (5.1) > Reporter: Bjørn Hjelle > Priority: Critical > Labels: EdgeNGramTokenFilter, highlighting > > Hit highlight highlights the whole word, not just the part that matches the > search term when using EdgeNGramFilterFactory in the field type. > In schema.xml I have field type text_ngram: > <fieldType name="text_ngram" class="solr.TextField"> > <analyzer type="index"> > <charFilter > class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> > <tokenizer > class="solr.WhitespaceTokenizerFactory"/> > <!--tokenizer > class="solr.StandardTokenizerFactory"/--> > <filter > class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" > splitOnCaseChange="1"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.EdgeNGramFilterFactory" > maxGramSize="20" minGramSize="3" luceneMatchVersion="4.3"/> > <filter > class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" > replacement="" replace="all"/> > </analyzer> > <analyzer type="query"> > <charFilter > class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/> > <tokenizer > class="solr.StandardTokenizerFactory"/> > <filter > class="solr.WordDelimiterFilterFactory" generateWordParts="0" > generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" > splitOnCaseChange="0"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter > class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æ?~F?~E])" > replacement="" replace="all"/> > <filter > class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?" > replacement="$1" replace="all"/> > </analyzer> > </fieldType> > In Solr Admin analyse, with index value "lucene" and query value "luc" it > shows this: > LENGTF text luc luce lucen lucene > raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e] [6c > 75 63 65 6e 65] > start 0 0 0 0 > end 6 6 6 6 > positionLength 1 1 1 1 > type word word word word > position 1 1 1 1 > Since the end position is 6 in this case the whole word ("lucene" is > highlighted). > > If I change to use NGramFilterFactory it shows me this (for the first three > items): > LENGTF text luc uce cen > raw_bytes [6c 75 63] [6c 75 63 65] [6c 75 63 65 6e] > start 0 1 2 > end 3 4 5 > positionLength 1 1 1 > type word word word > position 1 1 1 > The end position is correct then and the highlighter highlights only the > search term. Note that I have specified luceneMatchVersion="4.3". Without > this, the end positions goes back to 6 also for the NGramFilterFactory. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org