[jira] Updated: (LUCENE-1195) Performance improvement for TermInfosReader

Michael Busch (JIRA) Wed, 21 May 2008 01:38:22 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael Busch updated LUCENE-1195:
----------------------------------

    Attachment: lucene-1195.patch

Changes in the patch:
- the used cache is thread-safe now
- added a ThreadLocal to TermInfosReader, so that each thread has its own cache 
of size 1024 now
- SegmentTermEnum.scanTo() returns now the number of invocations of next(). 
TermInfosReader only
  puts TermInfo objects into the cache if scanTo() has called next() more than 
once. Thus, if e. g.
  a WildcardQuery or RangeQuery iterates over terms in order, only the first 
term will be put into
  the cache. This is an addition to the ThreadLocal that prevents one thread 
from wiping out its
  own cache with such a query. 
- added a new package org/apache/lucene/util/cache that has a SimpleMapCache 
(taken from LUCENE-831)
  and the SimpleLRUCache that was part of the previous patch here. I decided to 
put the caches in
  a separate package, because we can reuse them for different things like 
LUCENE-831 or e. g. after
  deprecating Hits as LRU cache for recently loaded stored documents.
  
I reran the same performance experiments and it turns out that the speedup is 
still the same and
the overhead of the ThreadLocal is in the noise. So I think this should be a 
good approach now?

I also ran similar performance tests on a bigger index with about 4.3 million 
documents. The 
speedup with 50k AND queries was, as expected, not as significant anymore. 
However, the speedup
was still about 7%. I haven't run the OR queries on the bigger index yet, but 
most likely the
speedup will not be very significant anymore.

All unit tests pass.

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch, lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup 
> is being done
> twice for each term. The first time in Similarity.idf(), where 
> searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance 
> improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add 
> a small LRU 
> cache to TermInfosReader. 
> I ran some performance experiments with an LRU cache size of 20, and an 
> mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller 
> once more.
> I will attach a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-1195) Performance improvement for TermInfosReader

Reply via email to