David Smiley created LUCENE-6375:
------------------------------------
Summary: Inconsistent interpretation of maxDocCharsToAnalyze in
Highlighter & WeightedSpanTermExtractor
Key: LUCENE-6375
URL: https://issues.apache.org/jira/browse/LUCENE-6375
Project: Lucene - Core
Issue Type: Improvement
Reporter: David Smiley
Priority: Minor
Way back in LUCENE-2939, the default/standard Highlighter's
WeightedSpanTermExtractor (referenced by QueryScorer, used by Highlighter.java)
got a performance feature maxDocCharsToAnalyze to set a limit on how much text
to process when looking for phrase queries and wildcards (and some other
advanced query types). Highlighter itself also has a limit by the same name.
They are not interpreted the same way!
Highlighter loops over tokens and halts early if the token's start offset >=
maxDocCharsToAnalyze. In this light, it's almost as if the input string was
truncated to be this length, but a bit beyond to the next tokenization
boundary. The PostingsHighlighter also has a configurable limit it calls
"maxLength" (or contentLength) that is conceptually similar but implemented
differently because it doesn't tokenize; but it does have the inverted start &
end offsets to check if it's reached the end with respect to this configured
limit. FYI Solr's hl.maxAnalyzedChars is supplied as a configured input to
both highlighters in this manner; the FastVectorHighlighter doesn't have a
limit.
Highlighter propagates it's configured maxAnalyzedChars to QueryScorer which in
turn propagates it to WeightedSpanTermExtractor. _WSTE doesn't interpret this
the same way as Highlighter or PostingsHighlighter._ It uses an
OffsetLimitTokenFilter which accumulates the deltas in start & end offsets of
each token it sees. That is:
{code:java}
int offsetLength = offsetAttrib.endOffset() - offsetAttrib.startOffset();
offsetCount += offsetLength;
{code}
So if you've got analysis which produces a lot of posInc-0 tokens (as I do),
you will likely hit this limit earlier than when Highlighter will. Or if you
have very few tokens with tons of whitespace then WSTE will index terms that
will never be highlighted. This isn't a big deal but it should be fixed. This
filter should simply examine if the startOffset is >= a configured limit and
return false from it's incrementToken if so.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]