[ https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15628861#comment-15628861 ]
ASF GitHub Bot commented on LUCENE-7526: ---------------------------------------- Github user dsmiley commented on the issue: https://github.com/apache/lucene-solr/pull/105 _(I wrote this as you made your last comments, but rather than delete it I'll comment any way)_ The documentation for `PostingsEnum.nextPosition()` states that calling it more than `freq()` times is undefined. Thus it's quite valid to throw an IllegalStateException. > Btw, we've seen other needs for something like a CompositePostingsEnum that abstracts over a set of terms, but since this is still internal, dropping the house-keeping will also make this code do less. I don't think I quite get what you're saying. By "other needs" do you mean Bloomberg internally? If so, how would that relate this this one inside the UH? Are you advocating a general purpose Multi-PosrtingsEnum? On the latter... a highlighter wouldn't be where to introduce such a thing. There is a `org.apache.lucene.index.MultiPostingsEnum` which I looked at while at the Lucene hackday code sprint as it got my curiosity. Unfortunately, it doesn't seem quite general purpose enough for us to use -- it demands a MultiTermsEnum parent. Perhaps that could be improved to be useful without demanding a MultiTermsEnum parent... but that seems like a separate issue. > Improvements to UnifiedHighlighter OffsetStrategies > --------------------------------------------------- > > Key: LUCENE-7526 > URL: https://issues.apache.org/jira/browse/LUCENE-7526 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter > Reporter: Timothy M. Rodriguez > Assignee: David Smiley > Priority: Minor > Fix For: 6.4 > > > This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies > by reducing reliance on creating or re-creating TokenStreams. > The primary changes are as follows: > * AnalysisOffsetStrategy - split into two offset strategies > ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a > MemoryIndex for producing Offsets > ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a > MemoryIndex. Can only be used if the query distills down to terms and > automata. > * TokenStream removal > ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill > the memory index and then once consumed a new one was generated by > uninverting the MemoryIndex back into a TokenStream if there were automata > (wildcard/mtq queries) involved. Now this is avoided, which should save > memory and avoid a second pass over the data. > ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid > generating a TokenStream if automata are involved. > ** PostingsWithTermVectorsOffsetStrategy - similar refactoring > * CompositePostingsEnum - aggregates several underlying PostingsEnums for > wildcard/mtq queries. This should improve relevancy by providing unified > metrics for a wildcard across all it's term matches > * Added a HighlightFlag for enabling the newly separated > TokenStreamOffsetStrategy since it can adversely affect passage relevancy -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org