MemoryCachedRangeFilter to boost performance of Range queries
-------------------------------------------------------------

                 Key: LUCENE-855
                 URL: https://issues.apache.org/jira/browse/LUCENE-855
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Search
    Affects Versions: 2.1
            Reporter: Andy Liu


Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
within the specified range.  This requires iterating through every single term 
in the index and can get rather slow for large document sets.

MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, sorts 
by value, and stores in a SortedFieldCache.  During bits(), binary searches are 
used to find the start and end indices of the lower and upper bound values.  
The BitSet is populated by all the docId values that fall in between the start 
and end indices.

TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index 
with random date values within a 5 year range.  Executing bits() 1000 times on 
standard RangeQuery using random date intervals took 63904ms.  Using 
MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic 
when you have less unique terms in a field or using less number of documents.

Currently MemoryCachedRangeFilter only works with numeric values (values are 
stored in a long[] array) but it can be easily changed to support Strings.  A 
side "benefit" of storing the values are stored as longs, is that there's no 
longer the need to make the values lexographically comparable, i.e. padding 
numeric values with zeros.

The downside of using MemoryCachedRangeFilter is there's a fairly significant 
memory requirement.  So it's designed to be used in situations where range 
filter performance is critical and memory consumption is not an issue.  The 
memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  

MemoryCachedRangeFilter also requires a warmup step which can take a while to 
run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can 
be called explicitly or is automatically called the first time 
MemoryCachedRangeFilter is applied using a given field.

So in summery, MemoryCachedRangeFilter can be useful when:
- Performance is critical
- Memory is not an issue
- Field contains many unique numeric values
- Index contains large amount of documents


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to