[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

paul.elschot (JIRA) Fri, 03 Mar 2006 12:56:02 -0800

    [ 
http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368792 ]


paul.elschot commented on LUCENE-502:
-------------------------------------

>> The question is how does the caching help when you have multiple documents. 
>> My analysis is that (with a modern VM) it helps you only if the docFreq of a 
>> term is 16-31 and you are using a ConjunctiveScorer (i.e. not Wildcard 
>> searches). 
 
> The conjunctive scorer does not call score(HitCollector,int). This is only 
> called in a few cases anymore. It can help a lot with a single-term query for 
> a very common term, or for disjunctive queries involving very common terms, 
> although BooleanScorer2 no longer uses it in this case. That's too bad. If 
> all clauses to a query are optional, then the old BooleanScorer was faster. 
> But it didn't always return documents in order... So it indeed may be time to 
> retire this method. 

With BooleanScorer2 It is quite possible to use different versions of 
DisjunctionScorer:
one for query top level that does not need skipTo(), and one for lower level 
that allows
skipTo(). The top level one can be implemented just like the "old" 
BooleanScorer.

Iirc the method to implement such different behaviour are already in place (for 
scoring a range of documents),
it only needs to be implemented for DisjunctionScorer, and the top level 
BooleanScorer2 should then
use it when appropriate.

Regards,
Paul Elschot


> TermScorer caches values unnecessarily
> --------------------------------------
>
>          Key: LUCENE-502
>          URL: http://issues.apache.org/jira/browse/LUCENE-502
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: 1.9
>     Reporter: Steven Tamm
>  Attachments: TermScorer.patch
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for 
> each term scored.  When querying for a lot of terms, this causes a lot of 
> garbage to be created that's unnecessary.  The SegmentTermDocs from which it 
> retrieves its information doesn't have any optimizations for bulk loading, 
> and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching 
> the result of a sqrt that should be placed in DefaultSimilarity, and if 
> you're only scoring a few documents that contain those terms, there's no need 
> to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not 
> cache the docs or feqs.  In the case of a lot of queries, that saves 196 
> bytes/term, the unnecessary disk IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

Reply via email to