[ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Busch updated LUCENE-1195: ---------------------------------- Attachment: lucene-1195.patch Changes in the patch: - the used cache is thread-safe now - added a ThreadLocal to TermInfosReader, so that each thread has its own cache of size 1024 now - SegmentTermEnum.scanTo() returns now the number of invocations of next(). TermInfosReader only puts TermInfo objects into the cache if scanTo() has called next() more than once. Thus, if e. g. a WildcardQuery or RangeQuery iterates over terms in order, only the first term will be put into the cache. This is an addition to the ThreadLocal that prevents one thread from wiping out its own cache with such a query. - added a new package org/apache/lucene/util/cache that has a SimpleMapCache (taken from LUCENE-831) and the SimpleLRUCache that was part of the previous patch here. I decided to put the caches in a separate package, because we can reuse them for different things like LUCENE-831 or e. g. after deprecating Hits as LRU cache for recently loaded stored documents. I reran the same performance experiments and it turns out that the speedup is still the same and the overhead of the ThreadLocal is in the noise. So I think this should be a good approach now? I also ran similar performance tests on a bigger index with about 4.3 million documents. The speedup with 50k AND queries was, as expected, not as significant anymore. However, the speedup was still about 7%. I haven't run the OR queries on the bigger index yet, but most likely the speedup will not be very significant anymore. All unit tests pass. > Performance improvement for TermInfosReader > ------------------------------------------- > > Key: LUCENE-1195 > URL: https://issues.apache.org/jira/browse/LUCENE-1195 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.4 > > Attachments: lucene-1195.patch, lucene-1195.patch > > > Currently we have a bottleneck for multi-term queries: the dictionary lookup > is being done > twice for each term. The first time in Similarity.idf(), where > searcher.docFreq() is called. > The second time when the posting list is opened (TermDocs or TermPositions). > The dictionary lookup is not cheap, that's why a significant performance > improvement is > possible here if we avoid the second lookup. An easy way to do this is to add > a small LRU > cache to TermInfosReader. > I ran some performance experiments with an LRU cache size of 20, and an > mid-size index of > 500,000 documents from wikipedia. Here are some test results: > 50,000 AND queries with 3 terms each: > old: 152 secs > new (with LRU cache): 112 secs (26% faster) > 50,000 OR queries with 3 terms each: > old: 175 secs > new (with LRU cache): 133 secs (24% faster) > For bigger indexes this patch will probably have less impact, for smaller > once more. > I will attach a patch soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]