[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Ericson updated LUCENE-855:
--------------------------------

    Attachment: TestRangeFilterPerformanceComparison.java

Andy thank you for that test 

I took at Moved it to contrib/miscellaneous and added a few more tests 
including the Chained Filter test. Here is my version. Also I fixed a few bugs 
in my code that I will be attaching next .

I also reformatted my results I think they are a little easer to read. 
Here is what I get and your right if you use a MatchAllDocsQuery our 2 version 
of the code are about the same 

    [junit] ------------- Standard Output ---------------
    [junit] Start interval: Thu Apr 11 10:55:02 PDT 2002
    [junit] End interval: Tue Apr 10 10:55:02 PDT 2007
    [junit] Creating RAMDirectory index...
    [junit] Reader opened with 100000 documents.  Creating RangeFilters...

    [junit] TermQuery

    [junit] FieldCacheRangeFilter
    [junit]   * Total: 13ms
    [junit]   * Bits: 0ms
    [junit]   * Search: 9ms
    [junit] MemoryCachedRangeFilter
    [junit]   * Total: 209ms
    [junit]   * Bits: 90ms
    [junit]   * Search: 115ms
    [junit] RangeFilter
    [junit]   * Total: 12068ms
    [junit]   * Bits: 6009ms
    [junit]   * Search: 6051ms
    [junit] Chained FieldCacheRangeFilter
    [junit]   * Total: 15ms
    [junit]   * Bits: 1ms
    [junit]   * Search: 10ms
    [junit] Chained MemoryCachedRangeFilter
    [junit]   * Total: 177ms
    [junit]   * Bits: 83ms
    [junit]   * Search: 90ms

    [junit] ConstantScoreQuery

    [junit] FieldCacheRangeFilter
    [junit]   * Total: 480ms
    [junit]   * Bits: 1ms
    [junit]   * Search: 474ms
    [junit] MemoryCachedRangeFilter
    [junit]   * Total: 757ms
    [junit]   * Bits: 90ms
    [junit]   * Search: 663ms
    [junit] RangeFilter
    [junit]   * Total: 18749ms
    [junit]   * Bits: 6083ms
    [junit]   * Search: 12655ms
    [junit] Chained FieldCacheRangeFilter
    [junit]   * Total: 11ms
    [junit]   * Bits: 0ms
    [junit]   * Search: 8ms
    [junit] Chained MemoryCachedRangeFilter
    [junit]   * Total: 776ms
    [junit]   * Bits: 87ms
    [junit]   * Search: 682ms

    [junit] MatchAllDocsQuery

    [junit] FieldCacheRangeFilter
    [junit]   * Total: 1344ms
    [junit]   * Bits: 5ms
    [junit]   * Search: 1334ms
    [junit] MemoryCachedRangeFilter
    [junit]   * Total: 1468ms
    [junit]   * Bits: 81ms
    [junit]   * Search: 1381ms
    [junit] RangeFilter
    [junit]   * Total: 13360ms
    [junit]   * Bits: 6091ms
    [junit]   * Search: 7254ms
    [junit] Chained FieldCacheRangeFilter
    [junit]   * Total: 924ms
    [junit]   * Bits: 4ms
    [junit]   * Search: 916ms
    [junit] Chained MemoryCachedRangeFilter
    [junit]   * Total: 1507ms
    [junit]   * Bits: 84ms
    [junit]   * Search: 1415ms
    [junit] ------------- ---------------- ---------------


> MemoryCachedRangeFilter to boost performance of Range queries
> -------------------------------------------------------------
>
>                 Key: LUCENE-855
>                 URL: https://issues.apache.org/jira/browse/LUCENE-855
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.1
>            Reporter: Andy Liu
>         Assigned To: Otis Gospodnetic
>         Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to