[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Yonik Seeley (JIRA) Sat, 24 May 2008 08:18:20 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599606#action_12599606
 ]


Yonik Seeley commented on LUCENE-1195:
--------------------------------------

{quote}SegmentTermEnum.scanTo() returns now the number of invocations of 
next(). TermInfosReader only
puts TermInfo objects into the cache if scanTo() has called next() more than 
once. Thus, if e. g.
a WildcardQuery or RangeQuery iterates over terms in order, only the first term 
will be put into
the cache. This is an addition to the ThreadLocal that prevents one thread from 
wiping out its
own cache with such a query.
{quote}

Hmmm, clever, and pretty much free.

It doesn't seem like it would eliminate something like a RangeQuery adding to 
the cache, but does reduce the amount of pollution.  Seems like about 1/64th of 
the terms would be added to the cache?  (every 128th term and the term 
following that... due to "numScans > 1" check).

Still, it would take a range query covering 64K terms to completely wipe the 
cache, and as long as that range query is slow relative to the term lookups, I 
suppose it doesn't matter much if the cache gets wiped anyway.  A single 
additional hash lookup per term probably shouldn't slow the execution of 
something like a range query that much either.



> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch, lucene-1195.patch, lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup 
> is being done
> twice for each term. The first time in Similarity.idf(), where 
> searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance 
> improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add 
> a small LRU 
> cache to TermInfosReader. 
> I ran some performance experiments with an LRU cache size of 20, and an 
> mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller 
> once more.
> I will attach a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1195) Performance improvement for TermInfosReader

Reply via email to