[ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599606#action_12599606 ]
Yonik Seeley commented on LUCENE-1195: -------------------------------------- {quote}SegmentTermEnum.scanTo() returns now the number of invocations of next(). TermInfosReader only puts TermInfo objects into the cache if scanTo() has called next() more than once. Thus, if e. g. a WildcardQuery or RangeQuery iterates over terms in order, only the first term will be put into the cache. This is an addition to the ThreadLocal that prevents one thread from wiping out its own cache with such a query. {quote} Hmmm, clever, and pretty much free. It doesn't seem like it would eliminate something like a RangeQuery adding to the cache, but does reduce the amount of pollution. Seems like about 1/64th of the terms would be added to the cache? (every 128th term and the term following that... due to "numScans > 1" check). Still, it would take a range query covering 64K terms to completely wipe the cache, and as long as that range query is slow relative to the term lookups, I suppose it doesn't matter much if the cache gets wiped anyway. A single additional hash lookup per term probably shouldn't slow the execution of something like a range query that much either. > Performance improvement for TermInfosReader > ------------------------------------------- > > Key: LUCENE-1195 > URL: https://issues.apache.org/jira/browse/LUCENE-1195 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 2.4 > > Attachments: lucene-1195.patch, lucene-1195.patch, lucene-1195.patch > > > Currently we have a bottleneck for multi-term queries: the dictionary lookup > is being done > twice for each term. The first time in Similarity.idf(), where > searcher.docFreq() is called. > The second time when the posting list is opened (TermDocs or TermPositions). > The dictionary lookup is not cheap, that's why a significant performance > improvement is > possible here if we avoid the second lookup. An easy way to do this is to add > a small LRU > cache to TermInfosReader. > I ran some performance experiments with an LRU cache size of 20, and an > mid-size index of > 500,000 documents from wikipedia. Here are some test results: > 50,000 AND queries with 3 terms each: > old: 152 secs > new (with LRU cache): 112 secs (26% faster) > 50,000 OR queries with 3 terms each: > old: 175 secs > new (with LRU cache): 133 secs (24% faster) > For bigger indexes this patch will probably have less impact, for smaller > once more. > I will attach a patch soon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]