[
https://issues.apache.org/jira/browse/LUCENE-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385110#comment-14385110
]
David Smiley commented on LUCENE-6375:
--------------------------------------
Furthermore, a little bit of refactoring will simplify the arrangement going on
here. maxDocCharsToAnalyze in QueryScorer & WSTE can be backed out, and
Highlighter can insert a fixed OffsetLimitTokenFilter before it gets to either,
and then it needn't check for the condition in its token loop either. For
back-wards compatibility sake, QueryScorer (& WSTE) can keep the option but
ignore it and mark as deprecated.
> Inconsistent interpretation of maxDocCharsToAnalyze in Highlighter &
> WeightedSpanTermExtractor
> ----------------------------------------------------------------------------------------------
>
> Key: LUCENE-6375
> URL: https://issues.apache.org/jira/browse/LUCENE-6375
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: David Smiley
> Priority: Minor
>
> Way back in LUCENE-2939, the default/standard Highlighter's
> WeightedSpanTermExtractor (referenced by QueryScorer, used by
> Highlighter.java) got a performance feature maxDocCharsToAnalyze to set a
> limit on how much text to process when looking for phrase queries and
> wildcards (and some other advanced query types). Highlighter itself also has
> a limit by the same name. They are not interpreted the same way!
> Highlighter loops over tokens and halts early if the token's start offset >=
> maxDocCharsToAnalyze. In this light, it's almost as if the input string was
> truncated to be this length, but a bit beyond to the next tokenization
> boundary. The PostingsHighlighter also has a configurable limit it calls
> "maxLength" (or contentLength) that is conceptually similar but implemented
> differently because it doesn't tokenize; but it does have the inverted start
> & end offsets to check if it's reached the end with respect to this
> configured limit. FYI Solr's hl.maxAnalyzedChars is supplied as a configured
> input to both highlighters in this manner; the FastVectorHighlighter doesn't
> have a limit.
> Highlighter propagates it's configured maxAnalyzedChars to QueryScorer which
> in turn propagates it to WeightedSpanTermExtractor. _WSTE doesn't interpret
> this the same way as Highlighter or PostingsHighlighter._ It uses an
> OffsetLimitTokenFilter which accumulates the deltas in start & end offsets of
> each token it sees. That is:
> {code:java}
> int offsetLength = offsetAttrib.endOffset() -
> offsetAttrib.startOffset();
> offsetCount += offsetLength;
> {code}
> So if you've got analysis which produces a lot of posInc-0 tokens (as I do),
> you will likely hit this limit earlier than when Highlighter will. Or if you
> have very few tokens with tons of whitespace then WSTE will index terms that
> will never be highlighted. This isn't a big deal but it should be fixed.
> This filter should simply examine if the startOffset is >= a configured limit
> and return false from it's incrementToken if so.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]