[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

Michael McCandless (JIRA) Sat, 21 Nov 2009 02:49:05 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780944#action_12780944
 ]


Michael McCandless commented on LUCENE-2075:
--------------------------------------------


bq. a singe numeric range query will be doing many term seeks (one at the start 
of each enumeration). It doesn't look like these will currently utilize the 
cache - can someone refresh my memory on why this is?

You're right -- here's the code/comment:

{code}
  /** Returns an enumeration of terms starting at or after the named term. */
  public SegmentTermEnum terms(Term term) throws IOException {
    // don't use the cache in this call because we want to reposition the
    // enumeration
    get(term, false);
    return (SegmentTermEnum)getThreadResources().termEnum.clone();
  }
{code}

I think this is because "useCache" (the 2nd arg to get) is overloaded
-- if you look at get(), if useCache is true and you have a cache hit,
it doesn't do it's "normal" side-effect of repositioning the
thread-private TermEnum.  So you'd get incorrect results.

If get had a 2nd arg "repositionTermEnum", to decouple caching from
repositioning, then we could make use of the cache for NRQ (& soon
AutomatonTermEnum as well), though, this isn't so simple because the
cache entry (just a TermInfo) doesn't store the term's ord.  And we
don't want to add ord to TermInfo since, eg, this sucks up alot of
extra RAM storing the terms index.  Probably we should make a new
class that's used for caching, and not reuse TermInfo.

This was also done before NumericRangeQuery, ie, all MTQs before NRQ
did a single seek.

BTW the flex branch fixes this -- TermsEnum.seek always checks the
cache.


> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
>                 Key: LUCENE-2075
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2075
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

Reply via email to