[ 
https://issues.apache.org/jira/browse/LUCENE-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-6031:
---------------------------------
    Summary: TokenSources optimization, avoid sort  (was: TokenSources )

> TokenSources optimization, avoid sort
> -------------------------------------
>
>                 Key: LUCENE-6031
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6031
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/termvectors
>            Reporter: David Smiley
>            Assignee: David Smiley
>             Fix For: 5.0
>
>
> TokenSources.java, in the highlight module, is a facade that returns a 
> TokenStream for a field by either un-inverting & converting the TermVector 
> Terms, or by text re-analysis if TermVectors are unavailable or don't have 
> the right options.  TokenSources is used by the default highlighter, which is 
> the most accurate highlighter we've got.  When documents are large (say 
> hundreds of kilobytes on up), I found that most of the highlighter's activity 
> was up-front spent un-inverting & converting the term vector to a 
> TokenStream, not on the actual/real highlighting that follows.  Much of that 
> time was on a huge sort of hundreds of thousands of Tokens.  Time was also 
> spent doing lots of String conversion and char copying, and it used a lot of 
> memory, too.
> In this patch, I overhauled TokenStreamFromTermPositionVector.java, and I 
> removed similar logic in TokenSources that was used in circumstances when 
> positions weren't available but offsets were.  This class can un-invert term 
> vectors that have positions *and/or* offsets (at least one).  It doesn't 
> sort.  It places Tokens _directly_ into an array of tokens directly indexed 
> by position.  When positions aren't available, the startOffset/8 is a 
> substitute.  I've got a more light-weight Token inner class used in place of 
> the former and deprecated Token that ultimately forms a linked-list when the 
> process is done.  There is no string conversion; character copying is 
> minimized.  The Token array is GC'ed after initialization, it's only needed 
> during construction.
> Misc:
> * It implements reset() efficiently so it need not be wrapped in 
> CachingTokenFilter (I'll supply a patch later on this).
> * It only fetches payloads if you ask for them by adding the attribute (the 
> default highlighter won't add the attribute).  
> * It exposes the underlying TermVector terms via a getter too, which is 
> needed by another patch to follow later.
> A key assumption is that the position increment gap or first position isn't 
> gigantic, as that will create wasted space and the linked-list formation 
> ultimately has to visit all the slots.  We also assume that there aren't a 
> ton of tokens at the same position, since inserting new tokens in sorted 
> order is O(N^2) where 'N' is the average co-occurring token length.
> My performance testing using Lucene's benchmark module on a megabyte document 
> showed >5x speedup, in conjunction with some other patches to be posted 
> separately. This patch made the most difference.
> As an aside, our JIRA "Components" ought to be updated to reflect our Lucene 
> modules.  There should be a component for highlighting, and not for term 
> vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to