[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

Steven Tamm (JIRA) Fri, 03 Mar 2006 10:59:03 -0800

    [ 
http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368775 ]


Steven Tamm commented on LUCENE-502:
------------------------------------

The main point is this:  When you are using TermScorer to score one document, 
it is doing a lot of extra work.  It's reading 31 extra documents from the disk 
and calculating the weight factors for 31 documents.   The question is how does 
the caching help when you have multiple documents.  My analysis is that (with a 
modern VM) it helps you only if the docFreq of a term is 16-31 and you are 
using a ConjunctiveScorer (i.e. not Wildcard searches).  I would imagine this 
is a use case that is not uncommon.  Anyone using Wildcard searches will have 
*immediate* benefit from installing this patch.

So I'm going to analyze this from the "amount of work to do" perspective.
TermScorer.next():  If you are calling TermScorer.next() there is no real 
difference.  SegmentTermDocs.read(int[], float[]) is no different from calling 
SegmentTermDocs.next() 32 times.  The change in the patch switches 
TermScorer.next() to always calling next on the underlying SegmentTermDocs.  
The only cost I'm removing is the caching and I'm not adding any new ones.  
Therefore there's no change, with the exception of adding the cache for use in 
skipTo().

TermScorer.skipTo():  The only case where my patch is worse is if the frequency 
of the term is greater than the skip interval (i.e >= 16 documents per term).  
In this case, if you are retrieving more than 16 documents (but less than 32), 
you can avoid accessing the skipStream entirely.  If you are retrieving more 
than 32 documents, then you need to access the skipStream anyway, and since 
both of the underlying IndexInput's are cached, repositioning the freqStream 
will be only pointer manipulation.

TermScorer.score():
"In some cases JVM's may have evolved so that some of them are no longer 
required."  I can imagine that the scoreCache made a lot of sense in JDK 1.1 
when the cost of Math.sqrt would be high.  However, if the TermScorer is only 
going to be used for a single document, this is obviously wrong.   Like I said 
before, caching DefaultSimilarity.tf(int) inside DefaultSimilarity would end up 
inlined by the HotSpot compiler, but Math.sqrt is inlined into a processor 
trap, so it's not a big deal.

I want other people to test this and tell me any problems with it.  Whether or 
not you accept the patches into are less important to me than providing them to 
other people that have similar performance problems.  Perhaps I should have 
created a parallel structure to TermScorer that you can use when you have a low 
hit/term ratio. 

> TermScorer caches values unnecessarily
> --------------------------------------
>
>          Key: LUCENE-502
>          URL: http://issues.apache.org/jira/browse/LUCENE-502
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: 1.9
>     Reporter: Steven Tamm
>  Attachments: TermScorer.patch
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for 
> each term scored.  When querying for a lot of terms, this causes a lot of 
> garbage to be created that's unnecessary.  The SegmentTermDocs from which it 
> retrieves its information doesn't have any optimizations for bulk loading, 
> and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching 
> the result of a sqrt that should be placed in DefaultSimilarity, and if 
> you're only scoring a few documents that contain those terms, there's no need 
> to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not 
> cache the docs or feqs.  In the case of a lot of queries, that saves 196 
> bytes/term, the unnecessary disk IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

Reply via email to