[ http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368775 ]
Steven Tamm commented on LUCENE-502: ------------------------------------ The main point is this: When you are using TermScorer to score one document, it is doing a lot of extra work. It's reading 31 extra documents from the disk and calculating the weight factors for 31 documents. The question is how does the caching help when you have multiple documents. My analysis is that (with a modern VM) it helps you only if the docFreq of a term is 16-31 and you are using a ConjunctiveScorer (i.e. not Wildcard searches). I would imagine this is a use case that is not uncommon. Anyone using Wildcard searches will have *immediate* benefit from installing this patch. So I'm going to analyze this from the "amount of work to do" perspective. TermScorer.next(): If you are calling TermScorer.next() there is no real difference. SegmentTermDocs.read(int[], float[]) is no different from calling SegmentTermDocs.next() 32 times. The change in the patch switches TermScorer.next() to always calling next on the underlying SegmentTermDocs. The only cost I'm removing is the caching and I'm not adding any new ones. Therefore there's no change, with the exception of adding the cache for use in skipTo(). TermScorer.skipTo(): The only case where my patch is worse is if the frequency of the term is greater than the skip interval (i.e >= 16 documents per term). In this case, if you are retrieving more than 16 documents (but less than 32), you can avoid accessing the skipStream entirely. If you are retrieving more than 32 documents, then you need to access the skipStream anyway, and since both of the underlying IndexInput's are cached, repositioning the freqStream will be only pointer manipulation. TermScorer.score(): "In some cases JVM's may have evolved so that some of them are no longer required." I can imagine that the scoreCache made a lot of sense in JDK 1.1 when the cost of Math.sqrt would be high. However, if the TermScorer is only going to be used for a single document, this is obviously wrong. Like I said before, caching DefaultSimilarity.tf(int) inside DefaultSimilarity would end up inlined by the HotSpot compiler, but Math.sqrt is inlined into a processor trap, so it's not a big deal. I want other people to test this and tell me any problems with it. Whether or not you accept the patches into are less important to me than providing them to other people that have similar performance problems. Perhaps I should have created a parallel structure to TermScorer that you can use when you have a low hit/term ratio. > TermScorer caches values unnecessarily > -------------------------------------- > > Key: LUCENE-502 > URL: http://issues.apache.org/jira/browse/LUCENE-502 > Project: Lucene - Java > Type: Improvement > Components: Search > Versions: 1.9 > Reporter: Steven Tamm > Attachments: TermScorer.patch > > TermScorer aggressively caches the doc and freq of 32 documents at a time for > each term scored. When querying for a lot of terms, this causes a lot of > garbage to be created that's unnecessary. The SegmentTermDocs from which it > retrieves its information doesn't have any optimizations for bulk loading, > and it's unnecessary. > In addition, it has a SCORE_CACHE, that's of limited benefit. It's caching > the result of a sqrt that should be placed in DefaultSimilarity, and if > you're only scoring a few documents that contain those terms, there's no need > to precalculate the SQRT, especially on modern VMs. > Enclosed is a patch that replaces TermScorer with a version that does not > cache the docs or feqs. In the case of a lot of queries, that saves 196 > bytes/term, the unnecessary disk IO, and extra SQRTs which adds up. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
