[jira] [Updated] (LUCENE-6034) MemoryIndex should be able to wrap TermVector Terms

David Smiley (JIRA) Mon, 01 Dec 2014 05:06:07 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-6034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Smiley updated LUCENE-6034:
---------------------------------
    Attachment: LUCENE-6034.patch

Thanks for the review, Alan.

I updated the patch to throw IAE if offsets are expected but not present in the 
term vector.

> MemoryIndex should be able to wrap TermVector Terms
> ---------------------------------------------------
>
>                 Key: LUCENE-6034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6034
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>             Fix For: 5.0
>
>         Attachments: LUCENE-6034.patch, LUCENE-6034.patch, LUCENE-6034.patch
>
>
> The default highlighter has a "WeightedSpanTermExtractor" that uses 
> MemoryIndex for certain queries -- basically phrases, SpanQueries, and the 
> like.  For lots of text, this aspect of highlighting is time consuming and 
> consumes a fair amount of memory.  What also consumes memory is that it wraps 
> the tokenStream in CachingTokenFilter in this case.  But if the underlying 
> TokenStream is actually from TokenSources (wrapping TermVector Terms), this 
> is all needless!  Furthermore, MemoryIndex doesn't support payloads.
> The patch here has 3 aspects to it:
> * Internal refactoring to MemoryIndex to simplify it by maintaining the 
> fields in a sorted state using a TreeMap.  The ramifications of this led to 
> reduced LOC for this file, even with the other features I added.  It also 
> puts the FieldInfo on the Info, and thus there's one less data structure to 
> keep around.  I suppose if there are a huge variety of fields in MemoryIndex, 
> the aggregated N*Log(N) field lookup could add up, but that seems very 
> unlikely.  I also brought in the MemoryIndexNormDocValues as a simple 
> anonymous inner class - it's super-simple after all, not worth having in a 
> separate file.
> * New MemoryIndex.addField(String fieldName, Terms) method.  In this case, 
> MemoryIndex is providing the supporting wrappers around the underlying Terms 
> so that it appears as an Index.  In so doing, MemoryIndex supports payloads 
> for such fields.
> * WeightedSpanTermExtractor now detects TokenSources' wrapping of Terms and 
> it supplies this to MemoryIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6034) MemoryIndex should be able to wrap TermVector Terms

Reply via email to