[ https://issues.apache.org/jira/browse/LUCENE-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Smiley updated LUCENE-6031: --------------------------------- Summary: TokenSources optimization, avoid sort (was: TokenSources ) > TokenSources optimization, avoid sort > ------------------------------------- > > Key: LUCENE-6031 > URL: https://issues.apache.org/jira/browse/LUCENE-6031 > Project: Lucene - Core > Issue Type: Improvement > Components: core/termvectors > Reporter: David Smiley > Assignee: David Smiley > Fix For: 5.0 > > > TokenSources.java, in the highlight module, is a facade that returns a > TokenStream for a field by either un-inverting & converting the TermVector > Terms, or by text re-analysis if TermVectors are unavailable or don't have > the right options. TokenSources is used by the default highlighter, which is > the most accurate highlighter we've got. When documents are large (say > hundreds of kilobytes on up), I found that most of the highlighter's activity > was up-front spent un-inverting & converting the term vector to a > TokenStream, not on the actual/real highlighting that follows. Much of that > time was on a huge sort of hundreds of thousands of Tokens. Time was also > spent doing lots of String conversion and char copying, and it used a lot of > memory, too. > In this patch, I overhauled TokenStreamFromTermPositionVector.java, and I > removed similar logic in TokenSources that was used in circumstances when > positions weren't available but offsets were. This class can un-invert term > vectors that have positions *and/or* offsets (at least one). It doesn't > sort. It places Tokens _directly_ into an array of tokens directly indexed > by position. When positions aren't available, the startOffset/8 is a > substitute. I've got a more light-weight Token inner class used in place of > the former and deprecated Token that ultimately forms a linked-list when the > process is done. There is no string conversion; character copying is > minimized. The Token array is GC'ed after initialization, it's only needed > during construction. > Misc: > * It implements reset() efficiently so it need not be wrapped in > CachingTokenFilter (I'll supply a patch later on this). > * It only fetches payloads if you ask for them by adding the attribute (the > default highlighter won't add the attribute). > * It exposes the underlying TermVector terms via a getter too, which is > needed by another patch to follow later. > A key assumption is that the position increment gap or first position isn't > gigantic, as that will create wasted space and the linked-list formation > ultimately has to visit all the slots. We also assume that there aren't a > ton of tokens at the same position, since inserting new tokens in sorted > order is O(N^2) where 'N' is the average co-occurring token length. > My performance testing using Lucene's benchmark module on a megabyte document > showed >5x speedup, in conjunction with some other patches to be posted > separately. This patch made the most difference. > As an aside, our JIRA "Components" ought to be updated to reflect our Lucene > modules. There should be a component for highlighting, and not for term > vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org