[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Matt Ericson (JIRA) Fri, 06 Apr 2007 13:54:53 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matt Ericson updated LUCENE-855:
--------------------------------

    Attachment: FieldCacheRangeFilter.patch

Here is my Version of a FieldCacheRangeFilter

I used different class names so both patch can be applied to lucene but they do 
almost the same thing so I do not think they both should be committed. 

This Filter uses the Field Cache to to get values out of the index and then 
creates BitSets that are proxies to the field cache.  So when you do a 
BitSet.get(int bitIndex) It will check the Field Cache 

One think to note is that since this is a proxy to the field cache it will not 
work with the current version of ChainedFilters.java (But I have a fix for that 
also) Since the Chained Filter will make a copy of the bit set and flip the 
bits. This bit set will not work. 

This version will use less memory since there is only 1 copy if the data and 
the BitSet is just a proxy it has no data in it. 

I want to thank Andy as I have uses/ stolen all of your tests and modified them 
just a but so they work with my version. And since we have the same performance 
tests here are the numbers

Using Andy's code 

    [junit] ------------- Standard Output ---------------
    [junit] Start interval: Sun Apr 07 13:49:15 PDT 2002
    [junit] End interval: Fri Apr 06 13:49:15 PDT 2007
    [junit] Creating RAMDirectory index...
    [junit] Reader opened with 100000 documents.  Creating RangeFilters...
    [junit] Standard RangeFilter finished in 58585ms
    [junit] MemoryCachedRangeFilter inished in 825ms
    [junit] ------------- ---------------- ---------------

Using My code 

    [junit] ------------- Standard Output ---------------
    [junit] Start interval: Sun Apr 07 13:40:52 PDT 2002
    [junit] End interval: Fri Apr 06 13:40:52 PDT 2007
    [junit] Creating RAMDirectory index...
    [junit] Reader opened with 100000 documents.  Creating RangeFilters...
    [junit] Standard RangeFilter finished in 58528ms
    [junit] FieldCacheRangeFilter inished in 30ms
    [junit] ------------- ---------------- ---------------


> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Attachments: FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

Reply via email to