BooleanQuery’s extractTerms looks like this:

public void extractTerms(Set<Term> terms) {
  for (BooleanClause clause : clauses) {
    if (clause.isProhibited() == false) {
      clause.getQuery().extractTerms(terms);
    }
  }
}
that’s generally the method called by the Highlighter for what terms should be 
highlighted.  So even if a term didn’t match the document, the query that the 
term was in matched the document and it just blindly highlights all the terms 
(minus prohibited ones).   That at least explains the behavior you’re seeing, 
but it’s not ideal.  I’ve seen specialized highlighters that convert to spans, 
which are accurate to the exact matches within the document.  Been a while 
since I dug into the HighlightComponent, so maybe there’s some other options 
available out of the box?

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>




> On Feb 24, 2015, at 3:16 AM, Dmitry Kan <solrexp...@gmail.com> wrote:
> 
> Erick,
> 
> Our default operator is AND.
> 
> Both queries below parse the same:
> 
> a OR (b c) OR d
> a OR (b AND c) OR d
> 
> The parsed query:
> 
> <str name="parsedquery_toString">Contents:a (+Contents:b +Contents:c)
> Contents:d</str>
> 
> So this part is consistent with our expectation.
> 
> 
>>> I'm a bit puzzled by your statement that "c" didn't contribute to the
> score.
> what I meant was that the term c was not hit by the scorerer: the explain
> section does not refer to it. I'm using the made up terms here, but the
> concept holds.
> 
> The code suggests that we could benefit from storing term offsets and
> positions:
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.solr/solr-core/4.3.1/org/apache/solr/highlight/DefaultSolrHighlighter.java#470
> 
> Is it correct assumption?
> 
> On Mon, Feb 23, 2015 at 8:29 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
> 
>> Highlighting is such a pain...
>> 
>> what does the parsed query look like? If the default operator is OR,
>> then this seems correct as both 'd' and 'c' appear in the doc. So
>> I'm a bit puzzled by your statement that "c" didn't contribute to the
>> score.
>> 
>> If the parsed query is, indeed
>> a +b +c d
>> 
>> then it does look like something with the highlighter. Whether other
>> highlighters are better for this case.. no clue ;(
>> 
>> Best,
>> Erick
>> 
>> On Mon, Feb 23, 2015 at 9:36 AM, Dmitry Kan <solrexp...@gmail.com> wrote:
>>> Erick,
>>> 
>>> nope, we are using std lucene qparser with some customizations, that do
>> not
>>> affect the boolean query parsing logic.
>>> 
>>> Should we try some other highlighter?
>>> 
>>> On Mon, Feb 23, 2015 at 6:57 PM, Erick Erickson <erickerick...@gmail.com
>>> 
>>> wrote:
>>> 
>>>> Are you using edismax?
>>>> 
>>>> On Mon, Feb 23, 2015 at 3:28 AM, Dmitry Kan <solrexp...@gmail.com>
>> wrote:
>>>>> Hello!
>>>>> 
>>>>> In solr 4.3.1 there seem to be some inconsistency with the
>> highlighting
>>>> of
>>>>> the boolean query:
>>>>> 
>>>>> a OR (b c) OR d
>>>>> 
>>>>> This returns a proper hit, which shows that only d was included into
>> the
>>>>> document score calculation.
>>>>> 
>>>>> But the highlighter returns both d and c in <em> tags.
>>>>> 
>>>>> Is this a known issue of the standard highlighter? Can it be
>> mitigated?
>>>>> 
>>>>> 
>>>>> --
>>>>> Dmitry Kan
>>>>> Luke Toolbox: http://github.com/DmitryKey/luke
>>>>> Blog: http://dmitrykan.blogspot.com
>>>>> Twitter: http://twitter.com/dmitrykan
>>>>> SemanticAnalyzer: www.semanticanalyzer.info
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Dmitry Kan
>>> Luke Toolbox: http://github.com/DmitryKey/luke
>>> Blog: http://dmitrykan.blogspot.com
>>> Twitter: http://twitter.com/dmitrykan
>>> SemanticAnalyzer: www.semanticanalyzer.info
>> 
> 
> 
> 
> -- 
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info

Reply via email to