[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

Michael McCandless (JIRA) Tue, 24 Nov 2009 03:05:05 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781863#action_12781863
 ]


Michael McCandless commented on LUCENE-2075:
--------------------------------------------

Well, I just kept 1024 since that's what we currently do ;)

OK I just did a rough tally -- I think we're looking at ~100 bytes (on
32 bit JRE) per entry, including CHMs HashEntry, array in CHM,
TermInfoAndOrd, Term & its String text.

Not to mention DBLRU has 2X multiplier at peak, so 200 bytes.

So at 1024 we're looking at ~200KB peak used by this cache already,
per segment which is able to saturate that cache... so for a 20
segment index you're at ~4MB additional RAM consumed... so I don't
think we should increase this default.

Also, I don't think this cache is/should be attempting to achieve a
high hit rate *across* queries, only *within* a single query when that
query resolves the Term more than once.

I think caches that wrap more CPU, like Solr's query cache, are where
the app should aim for high hit rate.

Maybe we should even decrease the default size here -- what's
important is preventing in-fligh queries from evicting one another's
cache entries.

For NRQ, 1024 is apparently already plenty big for that (relatively
few seeks occur).

For automaton query, which does lots of seeking, once flex branch
lands there is no need for the cache (each lookup is done only once,
because the TermsEnum actualEnum is able to seek).  Before flex lands,
the cache is important, but only for automaton query I think.

And honestly I'm still tempted to do away with this cache altogether
and create a "query scope", private to each query while it's running,
where terms dict (and other places that need to, over time) could
store stuff.  That'd give a perfect within-query hit rate and wouldn't
tie up any long term RAM...


> Share the Term -> TermInfo cache across threads
> -----------------------------------------------
>
>                 Key: LUCENE-2075
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2075
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: ConcurrentLRUCache.java, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, LUCENE-2075.patch, 
> LUCENE-2075.patch
>
>
> Right now each thread creates its own (thread private) SimpleLRUCache,
> holding up to 1024 terms.
> This is rather wasteful, since if there are a high number of threads
> that come through Lucene, you're multiplying the RAM usage.  You're
> also cutting way back on likelihood of a cache hit (except the known
> multiple times we lookup a term within-query, which uses one thread).
> In NRT search we open new SegmentReaders (on tiny segments) often
> which each thread must then spend CPU/RAM creating & populating.
> Now that we are on 1.5 we can use java.util.concurrent.*, eg
> ConcurrentHashMap.  One simple approach could be a double-barrel LRU
> cache, using 2 maps (primary, secondary).  You check the cache by
> first checking primary; if that's a miss, you check secondary and if
> you get a hit you promote it to primary.  Once primary is full you
> clear secondary and swap them.
> Or... any other suggested approach?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2075) Share the Term -> TermInfo cache across threads

Reply via email to