[ https://issues.apache.org/jira/browse/LUCENE-6034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley updated LUCENE-6034: --------------------------------- Attachment: LUCENE-6034.patch Thanks for the review, Alan. I updated the patch to throw IAE if offsets are expected but not present in the term vector. > MemoryIndex should be able to wrap TermVector Terms > --------------------------------------------------- > > Key: LUCENE-6034 > URL: https://issues.apache.org/jira/browse/LUCENE-6034 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter > Reporter: David Smiley > Assignee: David Smiley > Fix For: 5.0 > > Attachments: LUCENE-6034.patch, LUCENE-6034.patch, LUCENE-6034.patch > > > The default highlighter has a "WeightedSpanTermExtractor" that uses > MemoryIndex for certain queries -- basically phrases, SpanQueries, and the > like. For lots of text, this aspect of highlighting is time consuming and > consumes a fair amount of memory. What also consumes memory is that it wraps > the tokenStream in CachingTokenFilter in this case. But if the underlying > TokenStream is actually from TokenSources (wrapping TermVector Terms), this > is all needless! Furthermore, MemoryIndex doesn't support payloads. > The patch here has 3 aspects to it: > * Internal refactoring to MemoryIndex to simplify it by maintaining the > fields in a sorted state using a TreeMap. The ramifications of this led to > reduced LOC for this file, even with the other features I added. It also > puts the FieldInfo on the Info, and thus there's one less data structure to > keep around. I suppose if there are a huge variety of fields in MemoryIndex, > the aggregated N*Log(N) field lookup could add up, but that seems very > unlikely. I also brought in the MemoryIndexNormDocValues as a simple > anonymous inner class - it's super-simple after all, not worth having in a > separate file. > * New MemoryIndex.addField(String fieldName, Terms) method. In this case, > MemoryIndex is providing the supporting wrappers around the underlying Terms > so that it appears as an Index. In so doing, MemoryIndex supports payloads > for such fields. > * WeightedSpanTermExtractor now detects TokenSources' wrapping of Terms and > it supplies this to MemoryIndex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org