[jira] [Commented] (LUCENE-6375) Inconsistent interpretation of maxDocCharsToAnalyze in Highlighter & WeightedSpanTermExtractor

David Smiley (JIRA) Fri, 27 Mar 2015 21:12:07 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385110#comment-14385110
 ]


David Smiley commented on LUCENE-6375:
--------------------------------------

Furthermore, a little bit of refactoring will simplify the arrangement going on 
here.  maxDocCharsToAnalyze in QueryScorer & WSTE can be backed out, and 
Highlighter can insert a fixed OffsetLimitTokenFilter before it gets to either, 
and then it needn't check for the condition in its token loop either.  For 
back-wards compatibility sake, QueryScorer (& WSTE) can keep the option but 
ignore it and mark as deprecated.

> Inconsistent interpretation of maxDocCharsToAnalyze in Highlighter & 
> WeightedSpanTermExtractor
> ----------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6375
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6375
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: David Smiley
>            Priority: Minor
>
> Way back in LUCENE-2939, the default/standard Highlighter's 
> WeightedSpanTermExtractor (referenced by QueryScorer, used by 
> Highlighter.java) got a performance feature maxDocCharsToAnalyze to set a 
> limit on how much text to process when looking for phrase queries and 
> wildcards (and some other advanced query types).  Highlighter itself also has 
> a limit by the same name.  They are not interpreted the same way!
> Highlighter loops over tokens and halts early if the token's start offset >= 
> maxDocCharsToAnalyze.  In this light, it's almost as if the input string was 
> truncated to be this length, but a bit beyond to the next tokenization 
> boundary.  The PostingsHighlighter also has a configurable limit it calls 
> "maxLength" (or contentLength) that is conceptually similar but implemented 
> differently because it doesn't tokenize; but it does have the inverted start 
> & end offsets to check if it's reached the end with respect to this 
> configured limit.  FYI Solr's hl.maxAnalyzedChars is supplied as a configured 
> input to both highlighters in this manner; the FastVectorHighlighter doesn't 
> have a limit.
> Highlighter propagates it's configured maxAnalyzedChars to QueryScorer which 
> in turn propagates it to WeightedSpanTermExtractor.  _WSTE doesn't interpret 
> this the same way as Highlighter or PostingsHighlighter._  It uses an 
> OffsetLimitTokenFilter which accumulates the deltas in start & end offsets of 
> each token it sees.  That is:
> {code:java}
>       int offsetLength = offsetAttrib.endOffset() - 
> offsetAttrib.startOffset();
>       offsetCount += offsetLength;
> {code}
> So if you've got analysis which produces a lot of posInc-0 tokens (as I do), 
> you will likely hit this limit earlier than when Highlighter will.  Or if you 
> have very few tokens with tons of whitespace then WSTE will index terms that 
> will never be highlighted.  This isn't a big deal but it should be fixed.  
> This filter should simply examine if the startOffset is >= a configured limit 
> and return false from it's incrementToken if so.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6375) Inconsistent interpretation of maxDocCharsToAnalyze in Highlighter & WeightedSpanTermExtractor

Reply via email to