[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

Doug Cutting (JIRA) Fri, 03 Mar 2006 11:53:01 -0800

    [ 
http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368782 ]


Doug Cutting commented on LUCENE-502:
-------------------------------------

> The question is how does the caching help when you have multiple documents.  
> My analysis is that (with a modern VM) it helps you only if the docFreq of a 
> term is 16-31 and you are using a ConjunctiveScorer (i.e. not Wildcard 
> searches).

The conjunctive scorer does not call score(HitCollector,int).  This is only 
called in a few cases anymore.  It can help a lot with a single-term query for 
a very common term, or for disjunctive queries involving very common terms, 
although BooleanScorer2 no longer uses it in this case.  That's too bad.  If 
all clauses to a query are optional, then the old BooleanScorer was faster.  
But it didn't always return documents in order...  So it indeed may be time to 
retire this method.

>SegmentTermDocs.read(int[], int[]) is no different from calling 
>SegmentTermDocs.next() 32 times.

If that were the case, then then termDocs(int[], int[]) method would never have 
been added!  Benchmarking showed this to be much faster.   There's also 
optimized C++ code that implements this method in src/gcj.  In C++, with a 
memory-mapped index, the i/o completely inlines.  When I last benchmarked this 
in GCJ, it was twice as fast as anything HotSpot could do.

But without score(HitCollector,int), TermDocs.read(int[], int[]) will never be 
called.  Sigh.

As for the scoreCache, this is certainly useful for terms that occur in 
thousands of documents, and useless for terms that occur only once.  Perhaps we 
should have two TermScorer implementations, one for common terms and one for 
rare terms, and have TermWeight select which to use.

> TermScorer caches values unnecessarily
> --------------------------------------
>
>          Key: LUCENE-502
>          URL: http://issues.apache.org/jira/browse/LUCENE-502
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: 1.9
>     Reporter: Steven Tamm
>  Attachments: TermScorer.patch
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for 
> each term scored.  When querying for a lot of terms, this causes a lot of 
> garbage to be created that's unnecessary.  The SegmentTermDocs from which it 
> retrieves its information doesn't have any optimizations for bulk loading, 
> and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching 
> the result of a sqrt that should be placed in DefaultSimilarity, and if 
> you're only scoring a few documents that contain those terms, there's no need 
> to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not 
> cache the docs or feqs.  In the case of a lot of queries, that saves 196 
> bytes/term, the unnecessary disk IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

Reply via email to