[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matt Ericson updated LUCENE-855: -------------------------------- Attachment: TestRangeFilterPerformanceComparison.java Andy thank you for that test I took at Moved it to contrib/miscellaneous and added a few more tests including the Chained Filter test. Here is my version. Also I fixed a few bugs in my code that I will be attaching next . I also reformatted my results I think they are a little easer to read. Here is what I get and your right if you use a MatchAllDocsQuery our 2 version of the code are about the same [junit] ------------- Standard Output --------------- [junit] Start interval: Thu Apr 11 10:55:02 PDT 2002 [junit] End interval: Tue Apr 10 10:55:02 PDT 2007 [junit] Creating RAMDirectory index... [junit] Reader opened with 100000 documents. Creating RangeFilters... [junit] TermQuery [junit] FieldCacheRangeFilter [junit] * Total: 13ms [junit] * Bits: 0ms [junit] * Search: 9ms [junit] MemoryCachedRangeFilter [junit] * Total: 209ms [junit] * Bits: 90ms [junit] * Search: 115ms [junit] RangeFilter [junit] * Total: 12068ms [junit] * Bits: 6009ms [junit] * Search: 6051ms [junit] Chained FieldCacheRangeFilter [junit] * Total: 15ms [junit] * Bits: 1ms [junit] * Search: 10ms [junit] Chained MemoryCachedRangeFilter [junit] * Total: 177ms [junit] * Bits: 83ms [junit] * Search: 90ms [junit] ConstantScoreQuery [junit] FieldCacheRangeFilter [junit] * Total: 480ms [junit] * Bits: 1ms [junit] * Search: 474ms [junit] MemoryCachedRangeFilter [junit] * Total: 757ms [junit] * Bits: 90ms [junit] * Search: 663ms [junit] RangeFilter [junit] * Total: 18749ms [junit] * Bits: 6083ms [junit] * Search: 12655ms [junit] Chained FieldCacheRangeFilter [junit] * Total: 11ms [junit] * Bits: 0ms [junit] * Search: 8ms [junit] Chained MemoryCachedRangeFilter [junit] * Total: 776ms [junit] * Bits: 87ms [junit] * Search: 682ms [junit] MatchAllDocsQuery [junit] FieldCacheRangeFilter [junit] * Total: 1344ms [junit] * Bits: 5ms [junit] * Search: 1334ms [junit] MemoryCachedRangeFilter [junit] * Total: 1468ms [junit] * Bits: 81ms [junit] * Search: 1381ms [junit] RangeFilter [junit] * Total: 13360ms [junit] * Bits: 6091ms [junit] * Search: 7254ms [junit] Chained FieldCacheRangeFilter [junit] * Total: 924ms [junit] * Bits: 4ms [junit] * Search: 916ms [junit] Chained MemoryCachedRangeFilter [junit] * Total: 1507ms [junit] * Bits: 84ms [junit] * Search: 1415ms [junit] ------------- ---------------- --------------- > MemoryCachedRangeFilter to boost performance of Range queries > ------------------------------------------------------------- > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Affects Versions: 2.1 > Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, > TestRangeFilterPerformanceComparison.java, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all <docId, value> pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]