[
https://issues.apache.org/jira/browse/SOLR-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gunnlaugur Thor Briem updated SOLR-4851:
----------------------------------------
Description:
With original text {{Population 5.000 - 9.999}} indexed with {{termVectors}},
{{termPositions}} and {{termOffsets}}, the Highlighter produces snippets like
{{Population 5<em class="match">5.000</em> - 9.999}} for a query of {{5000}}.
Note the duplicated {{5}} before the {{<em}}; that's the bug.
This does not happen when {{useFastVectorHighlighter=true}}.
It also does not happen in a field without {{termVectors}}, {{termPositions}}
and {{termOffsets}}.
To reproduce, field definitions:
{code:xml}
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
...
<field name="name" type="text" indexed="true" stored="true" />
<field name="descr" type="text" indexed="true" stored="true"
termVectors="true" termOffsets="true" termPositions="true" />
{code}
All configured and explicit parameters, from {{echoParams=all}}:
{code:javascript}
{
"defType": "edismax",
"echoParams": "all",
"facet.mincount": "1",
"fl": "id",
"hl.fl": "id name tag cat descr dim dimvalue provider source_source text",
"hl.fragsize": "200",
"hl.mergeContiguous": "true",
"hl.simple.post": "</em>",
"hl.simple.pre": "<em class="match">",
"hl.snippets": "4",
"hl.usePhraseHighlighter": "true",
"hl": "true",
"q.alt": "*:*",
"q": "5000",
"qf": " id_a^10.0 name^6 granularity_a^5 tag^4 cat^3 descr^3 dim^2 dimvalue^2
provider^2 source_source^2 text^2 ",
"qt": "dismax",
"rows": "10",
"sort": "score desc"
}
{code}
and a document containing numbers with thousand separators, e.g.:
{code:javascript}
{
"name": "Demographics and income: Income distribution: Number of HHs earning >
US$5,000 p.a. (constant 2005 prices) by country"
"descr": "Number of households with disposable income of more than US$5,000 per
annum at constant 2005 prices"
}
{code}
The highlight snippets I get:
{code:javascript}
{
name: [
"Demographics and income: Income distribution: Number of HHs earning >
US$<em class="match">5,000</em> p.a. (constant 2005 prices) by country"
],
descr: [
"Number of households with disposable income of more than US$5<em
class="match">5,000</em> per annum at constant 2005 prices"
]
}
{code}
Note that the {{5}} gets duplicated only in the {{descr}} field snippet, not in
the {{name}} field snippet. The only difference between these fields is
termVectors, termPositions and termOffsets, so those settings are presumably
relevant.
was:
With original text {{Population 5.000 - 9.999}} indexed with termVectors,
termPositions and termOffsets, the Highlighter produces snippets like
{{Population 5<em class="match">5.000</em> - 9.999}} for a query of {{5000}}.
Note the duplicated {{5}} before the {{<em}}; that's the bug.
This does not happen when {{useFastVectorHighlighter=true}}.
It also does not happen in a field without termVectors, termPositions and
termOffsets.
To reproduce, field definitions:
{code:xml}
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
...
<field name="name" type="text" indexed="true" stored="true" />
<field name="descr" type="text" indexed="true" stored="true"
termVectors="true" termOffsets="true" termPositions="true" />
{code}
All configured and explicit parameters, from {{echoParams=all}}:
{code:javascript}
{
defType: "edismax",
echoParams: "all",
facet.mincount: "1",
fl: "id",
hl.fl: "id name tag cat descr dim dimvalue provider source_source text",
hl.fragsize: "200",
hl.mergeContiguous: "true",
hl.simple.post: "</em>",
hl.simple.pre: "<em class="match">",
hl.snippets: "4",
hl.usePhraseHighlighter: "true",
hl: "true",
q.alt: "*:*",
q: "5000",
qf: " id_a^10.0 name^6 granularity_a^5 tag^4 cat^3 descr^3 dim^2 dimvalue^2
provider^2 source_source^2 text^2 ",
qt: "dismax",
rows: "10",
sort: "score desc"
}
{code}
and a piece of text containing numbers with thousand separators, e.g.
“Demographics and income: Income distribution: Number of HHs earning >
US$5,000 p.a. (constant 2005 prices) by country”
The highlighting response I get:
{code:javascript}
{
name: [
"Demographics and income: Income distribution: Number of HHs earning >
US$<em class="match">5,000</em> p.a. (constant 2005 prices) by country"
],
descr: [
"Number of households with disposable income of more than US$5<em
class="match">5,000</em> per annum at constant 2005 prices"
]
}
{code}
Note that the {{5}} gets duplicated only in the {{descr}} field snippet, not in
the {{name}} field snippet. The only difference between these fields is
termVectors, termPositions and termOffsets, so those settings are presumably
relevant.
> Highlighter duplicates numeric token in snippet when term
> vectors/positions/offsets on
> --------------------------------------------------------------------------------------
>
> Key: SOLR-4851
> URL: https://issues.apache.org/jira/browse/SOLR-4851
> Project: Solr
> Issue Type: Bug
> Components: highlighter
> Affects Versions: 3.6.2
> Reporter: Gunnlaugur Thor Briem
>
> With original text {{Population 5.000 - 9.999}} indexed with {{termVectors}},
> {{termPositions}} and {{termOffsets}}, the Highlighter produces snippets like
> {{Population 5<em class="match">5.000</em> - 9.999}} for a query of {{5000}}.
> Note the duplicated {{5}} before the {{<em}}; that's the bug.
> This does not happen when {{useFastVectorHighlighter=true}}.
> It also does not happen in a field without {{termVectors}}, {{termPositions}}
> and {{termOffsets}}.
> To reproduce, field definitions:
> {code:xml}
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"
> splitOnCaseChange="1" splitOnNumerics="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="1" splitOnNumerics="0"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
> ...
> <field name="name" type="text" indexed="true" stored="true" />
> <field name="descr" type="text" indexed="true" stored="true"
> termVectors="true" termOffsets="true" termPositions="true" />
> {code}
> All configured and explicit parameters, from {{echoParams=all}}:
> {code:javascript}
> {
> "defType": "edismax",
> "echoParams": "all",
> "facet.mincount": "1",
> "fl": "id",
> "hl.fl": "id name tag cat descr dim dimvalue provider source_source text",
> "hl.fragsize": "200",
> "hl.mergeContiguous": "true",
> "hl.simple.post": "</em>",
> "hl.simple.pre": "<em class="match">",
> "hl.snippets": "4",
> "hl.usePhraseHighlighter": "true",
> "hl": "true",
> "q.alt": "*:*",
> "q": "5000",
> "qf": " id_a^10.0 name^6 granularity_a^5 tag^4 cat^3 descr^3 dim^2 dimvalue^2
> provider^2 source_source^2 text^2 ",
> "qt": "dismax",
> "rows": "10",
> "sort": "score desc"
> }
> {code}
> and a document containing numbers with thousand separators, e.g.:
> {code:javascript}
> {
> "name": "Demographics and income: Income distribution: Number of HHs earning
> > US$5,000 p.a. (constant 2005 prices) by country"
> "descr": "Number of households with disposable income of more than US$5,000
> per annum at constant 2005 prices"
> }
> {code}
> The highlight snippets I get:
> {code:javascript}
> {
> name: [
> "Demographics and income: Income distribution: Number of HHs earning >
> US$<em class="match">5,000</em> p.a. (constant 2005 prices) by country"
> ],
> descr: [
> "Number of households with disposable income of more than US$5<em
> class="match">5,000</em> per annum at constant 2005 prices"
> ]
> }
> {code}
> Note that the {{5}} gets duplicated only in the {{descr}} field snippet, not
> in the {{name}} field snippet. The only difference between these fields is
> termVectors, termPositions and termOffsets, so those settings are presumably
> relevant.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]