[ 
https://issues.apache.org/jira/browse/LUCENE-6077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-6077:
---------------------------------
    Attachment: LUCENE-6077.patch

Here is a patch. It divides the work into 2 pieces:
 - FilterCache whose responsibility is to act as a per-segment cache for 
filters but doesn't make any decision about which filters should be cached
 - FilterCachingPolicy, whose responsibility is to decide about whether a 
filter is worth caching given the filter itself, the current segment and the 
produced (uncached) DocIdSet.

FilterCache has an implementation called LRUFilterCache that accepts a maximum 
size (number of cached filters) and ram usage and is going to evict 
least-recently-used filters first. It has some protected methods that allow to 
configure which impl should be used to cache DocIdSets (RoaringDocIdSet by 
default), and how to measure ram usage of filters (the default impl uses 
Accountable#ramBytesUsed if the filter implements Accountable, and falls back 
to an arbitrary constant (1024) otherwise).

FilterCachingPolicy has an implementation called 
UsageTrackingFilterCachingPolicy that tries to provide sensible defaults:
 - it tracks the 256 most recently used filters (through their hash codes) 
globally (not per segment)
 - it only caches on segments whose source is a merge or addIndexes (not 
flushes)
 - it uses some heuristics to decide how many times a filter should appear in 
the history of 256 filters in order to be cached.

The filter caching policy can be configured on a per-filter basis, so that even 
if there are filters that you want to cache more aggressively than others, it 
is possible to cache them all in a single FilterCache instance.

> Add a filter cache
> ------------------
>
>                 Key: LUCENE-6077
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6077
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: 5.0
>
>         Attachments: LUCENE-6077.patch
>
>
> Lucene already has filter caching abilities through CachingWrapperFilter, but 
> CachingWrapperFilter requires you to know which filters you want to cache 
> up-front.
> Caching filters is not trivial. If you cache too aggressively, then you slow 
> things down since you need to iterate over all documents that match the 
> filter in order to load it into an in-memory cacheable DocIdSet. On the other 
> hand, if you don't cache at all, you are potentially missing interesting 
> speed-ups on frequently-used filters.
> Something that would be nice would be to have a generic filter cache that 
> would track usage for individual filters and make the decision to cache or 
> not a filter on a given segments based on usage statistics and various 
> heuristics, such as:
>  - the overhead to cache the filter (for instance some filters produce 
> DocIdSets that are already cacheable)
>  - the cost to build the DocIdSet (the getDocIdSet method is very expensive 
> on some filters such as MultiTermQueryWrapperFilter that potentially need to 
> merge lots of postings lists)
>  - the segment we are searching on (flush segments will likely be merged 
> right away so it's probably not worth building a cache on such segments)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to