[jira] [Updated] (LUCENE-6031) TokenSources optimization, avoid sort

David Smiley (JIRA) Thu, 30 Oct 2014 09:16:49 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Smiley updated LUCENE-6031:
---------------------------------
    Component/s:     (was: core/termvectors)
                 modules/highlighter
    Description: 
TokenSources.java, in the highlight module, is a facade that returns a 
TokenStream for a field by either un-inverting & converting the TermVector 
Terms, or by text re-analysis if TermVectors are unavailable or don't have the 
right options.  TokenSources is used by the default highlighter, which is the 
most accurate highlighter we've got.  When documents are large (say hundreds of 
kilobytes on up), I found that most of the highlighter's activity was up-front 
spent un-inverting & converting the term vector to a TokenStream, not on the 
actual/real highlighting that follows.  Much of that time was on a huge sort of 
hundreds of thousands of Tokens.  Time was also spent doing lots of String 
conversion and char copying, and it used a lot of memory, too.

In this patch, I overhauled TokenStreamFromTermPositionVector.java, and I 
removed similar logic in TokenSources that was used in circumstances when 
positions weren't available but offsets were.  This class can un-invert term 
vectors that have positions *and/or* offsets (at least one).  It doesn't sort.  
It places Tokens _directly_ into an array of tokens directly indexed by 
position.  When positions aren't available, the startOffset/8 is a substitute.  
I've got a more light-weight Token inner class used in place of the former and 
deprecated Token that ultimately forms a linked-list when the process is done.  
There is no string conversion; character copying is minimized.  The Token array 
is GC'ed after initialization, it's only needed during construction.

Misc:
* It implements reset() efficiently so it need not be wrapped in 
CachingTokenFilter (I'll supply a patch later on this).
* It only fetches payloads if you ask for them by adding the attribute (the 
default highlighter won't add the attribute).  
* It exposes the underlying TermVector terms via a getter too, which is needed 
by another patch to follow later.

A key assumption is that the position increment gap or first position isn't 
gigantic, as that will create wasted space and the linked-list formation 
ultimately has to visit all the slots.  We also assume that there aren't a ton 
of tokens at the same position, since inserting new tokens in sorted order is 
O(N^2) where 'N' is the average co-occurring token length.

My performance testing using Lucene's benchmark module on a megabyte document 
showed >5x speedup, in conjunction with some other patches to be posted 
separately. This patch made the most difference.

  was:
TokenSources.java, in the highlight module, is a facade that returns a 
TokenStream for a field by either un-inverting & converting the TermVector 
Terms, or by text re-analysis if TermVectors are unavailable or don't have the 
right options.  TokenSources is used by the default highlighter, which is the 
most accurate highlighter we've got.  When documents are large (say hundreds of 
kilobytes on up), I found that most of the highlighter's activity was up-front 
spent un-inverting & converting the term vector to a TokenStream, not on the 
actual/real highlighting that follows.  Much of that time was on a huge sort of 
hundreds of thousands of Tokens.  Time was also spent doing lots of String 
conversion and char copying, and it used a lot of memory, too.

In this patch, I overhauled TokenStreamFromTermPositionVector.java, and I 
removed similar logic in TokenSources that was used in circumstances when 
positions weren't available but offsets were.  This class can un-invert term 
vectors that have positions *and/or* offsets (at least one).  It doesn't sort.  
It places Tokens _directly_ into an array of tokens directly indexed by 
position.  When positions aren't available, the startOffset/8 is a substitute.  
I've got a more light-weight Token inner class used in place of the former and 
deprecated Token that ultimately forms a linked-list when the process is done.  
There is no string conversion; character copying is minimized.  The Token array 
is GC'ed after initialization, it's only needed during construction.

Misc:
* It implements reset() efficiently so it need not be wrapped in 
CachingTokenFilter (I'll supply a patch later on this).
* It only fetches payloads if you ask for them by adding the attribute (the 
default highlighter won't add the attribute).  
* It exposes the underlying TermVector terms via a getter too, which is needed 
by another patch to follow later.

A key assumption is that the position increment gap or first position isn't 
gigantic, as that will create wasted space and the linked-list formation 
ultimately has to visit all the slots.  We also assume that there aren't a ton 
of tokens at the same position, since inserting new tokens in sorted order is 
O(N^2) where 'N' is the average co-occurring token length.

My performance testing using Lucene's benchmark module on a megabyte document 
showed >5x speedup, in conjunction with some other patches to be posted 
separately. This patch made the most difference.

As an aside, our JIRA "Components" ought to be updated to reflect our Lucene 
modules.  There should be a component for highlighting, and not for term 
vectors.


> TokenSources optimization, avoid sort
> -------------------------------------
>
>                 Key: LUCENE-6031
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6031
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Assignee: David Smiley
>             Fix For: 5.0
>
>         Attachments: LUCENE-6031.patch
>
>
> TokenSources.java, in the highlight module, is a facade that returns a 
> TokenStream for a field by either un-inverting & converting the TermVector 
> Terms, or by text re-analysis if TermVectors are unavailable or don't have 
> the right options.  TokenSources is used by the default highlighter, which is 
> the most accurate highlighter we've got.  When documents are large (say 
> hundreds of kilobytes on up), I found that most of the highlighter's activity 
> was up-front spent un-inverting & converting the term vector to a 
> TokenStream, not on the actual/real highlighting that follows.  Much of that 
> time was on a huge sort of hundreds of thousands of Tokens.  Time was also 
> spent doing lots of String conversion and char copying, and it used a lot of 
> memory, too.
> In this patch, I overhauled TokenStreamFromTermPositionVector.java, and I 
> removed similar logic in TokenSources that was used in circumstances when 
> positions weren't available but offsets were.  This class can un-invert term 
> vectors that have positions *and/or* offsets (at least one).  It doesn't 
> sort.  It places Tokens _directly_ into an array of tokens directly indexed 
> by position.  When positions aren't available, the startOffset/8 is a 
> substitute.  I've got a more light-weight Token inner class used in place of 
> the former and deprecated Token that ultimately forms a linked-list when the 
> process is done.  There is no string conversion; character copying is 
> minimized.  The Token array is GC'ed after initialization, it's only needed 
> during construction.
> Misc:
> * It implements reset() efficiently so it need not be wrapped in 
> CachingTokenFilter (I'll supply a patch later on this).
> * It only fetches payloads if you ask for them by adding the attribute (the 
> default highlighter won't add the attribute).  
> * It exposes the underlying TermVector terms via a getter too, which is 
> needed by another patch to follow later.
> A key assumption is that the position increment gap or first position isn't 
> gigantic, as that will create wasted space and the linked-list formation 
> ultimately has to visit all the slots.  We also assume that there aren't a 
> ton of tokens at the same position, since inserting new tokens in sorted 
> order is O(N^2) where 'N' is the average co-occurring token length.
> My performance testing using Lucene's benchmark module on a megabyte document 
> showed >5x speedup, in conjunction with some other patches to be posted 
> separately. This patch made the most difference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-6031) TokenSources optimization, avoid sort

Reply via email to