[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

Doug Cutting (JIRA) Fri, 03 Mar 2006 13:29:06 -0800

    [ 
http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368797 ]


Doug Cutting commented on LUCENE-502:
-------------------------------------

>  Which is true? Or, as it seems likely, TermScorer was optimized for a case 
> that is no longer valid (i.e. ConjunctiveScorer). 

No, it was optimized for BooleanScorer's *disjunctive* scoring algorithm, which 
is no longer used by default, but is faster than BooleanScorer2's disjunctive 
scoring algorithm.  This applies to a very common type of query: classic 
vector-space searches.  So this optimization may not be leveraged much in the 
current codebase, but that does not mean that it is no longer valid.  But it 
may slow other sorts of searches, like your wildcards.  The challenge is not 
just how to figure out how to make your application as fast as possible, but 
how to do this without making other's and future applications slower.

> In short, we should have two TermScorer implementations. One for low 
> documents/term, and one for high documents/term.

Yes, I think that would be useful.  Classically, total query processing time is 
dominated by common terms, so that's an important case to optimize.  But It 
seems that with wildcard queries over smaller collections that these 
optimizations become costly.  So two implementations seems like it would make 
everyone happy.

> TermScorer caches values unnecessarily
> --------------------------------------
>
>          Key: LUCENE-502
>          URL: http://issues.apache.org/jira/browse/LUCENE-502
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: 1.9
>     Reporter: Steven Tamm
>  Attachments: TermScorer.patch
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for 
> each term scored.  When querying for a lot of terms, this causes a lot of 
> garbage to be created that's unnecessary.  The SegmentTermDocs from which it 
> retrieves its information doesn't have any optimizations for bulk loading, 
> and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching 
> the result of a sqrt that should be placed in DefaultSimilarity, and if 
> you're only scoring a few documents that contain those terms, there's no need 
> to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not 
> cache the docs or feqs.  In the case of a lot of queries, that saves 196 
> bytes/term, the unnecessary disk IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily

Reply via email to