David Smiley created LUCENE-6375:
------------------------------------

             Summary: Inconsistent interpretation of maxDocCharsToAnalyze in 
Highlighter & WeightedSpanTermExtractor
                 Key: LUCENE-6375
                 URL: https://issues.apache.org/jira/browse/LUCENE-6375
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: David Smiley
            Priority: Minor


Way back in LUCENE-2939, the default/standard Highlighter's 
WeightedSpanTermExtractor (referenced by QueryScorer, used by Highlighter.java) 
got a performance feature maxDocCharsToAnalyze to set a limit on how much text 
to process when looking for phrase queries and wildcards (and some other 
advanced query types).  Highlighter itself also has a limit by the same name.  
They are not interpreted the same way!

Highlighter loops over tokens and halts early if the token's start offset >= 
maxDocCharsToAnalyze.  In this light, it's almost as if the input string was 
truncated to be this length, but a bit beyond to the next tokenization 
boundary.  The PostingsHighlighter also has a configurable limit it calls 
"maxLength" (or contentLength) that is conceptually similar but implemented 
differently because it doesn't tokenize; but it does have the inverted start & 
end offsets to check if it's reached the end with respect to this configured 
limit.  FYI Solr's hl.maxAnalyzedChars is supplied as a configured input to 
both highlighters in this manner; the FastVectorHighlighter doesn't have a 
limit.

Highlighter propagates it's configured maxAnalyzedChars to QueryScorer which in 
turn propagates it to WeightedSpanTermExtractor.  _WSTE doesn't interpret this 
the same way as Highlighter or PostingsHighlighter._  It uses an 
OffsetLimitTokenFilter which accumulates the deltas in start & end offsets of 
each token it sees.  That is:
{code:java}
      int offsetLength = offsetAttrib.endOffset() - offsetAttrib.startOffset();
      offsetCount += offsetLength;
{code}

So if you've got analysis which produces a lot of posInc-0 tokens (as I do), 
you will likely hit this limit earlier than when Highlighter will.  Or if you 
have very few tokens with tons of whitespace then WSTE will index terms that 
will never be highlighted.  This isn't a big deal but it should be fixed.  This 
filter should simply examine if the startOffset is >= a configured limit and 
return false from it's incrementToken if so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to