[jira] [Commented] (LUCENE-6034) MemoryIndex should be able to wrap TermVector Terms

Robert Muir (JIRA) Wed, 03 Dec 2014 07:07:39 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14233087#comment-14233087
 ]


Robert Muir commented on LUCENE-6034:
-------------------------------------

{quote}
After having implemented that LeafReader subclass, I do tend to think 
LeafReader should have fewer abstract methods. It would be nice if the only 
abstract methods were were fields(), getFieldInfos(), and maxDoc(). FieldInfos 
feels like something that should be retrieved from Fields, not LeafReader.
{quote}

How can you think that? You act as  if the inverted index is the only thing 
going on. Maybe we should just remove term vectors then if they aren't very 
important? and stored fields too? and docvalues and norms? This would certainly 
be less code to maintain. And we wouldnt have to store all that stuff in 
fieldinfos thats unrelated to postings lists either.

> MemoryIndex should be able to wrap TermVector Terms
> ---------------------------------------------------
>
>                 Key: LUCENE-6034
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6034
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>             Fix For: 5.0
>
>         Attachments: LUCENE-6034.patch, LUCENE-6034.patch, LUCENE-6034.patch, 
> LUCENE-6034.patch, LUCENE-6034_Simplify_MemoryIndex.patch
>
>
> The default highlighter has a "WeightedSpanTermExtractor" that uses 
> MemoryIndex for certain queries -- basically phrases, SpanQueries, and the 
> like.  For lots of text, this aspect of highlighting is time consuming and 
> consumes a fair amount of memory.  What also consumes memory is that it wraps 
> the tokenStream in CachingTokenFilter in this case.  But if the underlying 
> TokenStream is actually from TokenSources (wrapping TermVector Terms), this 
> is all needless!  Furthermore, MemoryIndex doesn't support payloads.
> The patch here has 3 aspects to it:
> * Internal refactoring to MemoryIndex to simplify it by maintaining the 
> fields in a sorted state using a TreeMap.  The ramifications of this led to 
> reduced LOC for this file, even with the other features I added.  It also 
> puts the FieldInfo on the Info, and thus there's one less data structure to 
> keep around.  I suppose if there are a huge variety of fields in MemoryIndex, 
> the aggregated N*Log(N) field lookup could add up, but that seems very 
> unlikely.  I also brought in the MemoryIndexNormDocValues as a simple 
> anonymous inner class - it's super-simple after all, not worth having in a 
> separate file.
> * New MemoryIndex.addField(String fieldName, Terms) method.  In this case, 
> MemoryIndex is providing the supporting wrappers around the underlying Terms 
> so that it appears as an Index.  In so doing, MemoryIndex supports payloads 
> for such fields.
> * WeightedSpanTermExtractor now detects TokenSources' wrapping of Terms and 
> it supplies this to MemoryIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6034) MemoryIndex should be able to wrap TermVector Terms

Reply via email to