[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653450#action_12653450 ] Andy Liu commented on LUCENE-855: - Yes, it looks the same. Glad this will finally make it to the source! > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, > TestRangeFilterPerformanceComparison.java, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527134 ] Andy Liu commented on LUCENE-794: - Ah, I wasn't crazy. I had the test data wrong. Here's the code I'm using to produce the failing result: String text = "y z x y z a b"; Analyzer analyzer = new StandardAnalyzer(); QueryParser parser = new QueryParser("body", analyzer); Query query = parser.parse("\"x y z\""); CachingTokenFilter tokenStream = new CachingTokenFilter(analyzer.tokenStream("body", new StringReader(text))); Highlighter highlighter = new Highlighter(new SpanScorer(query, "body", tokenStream)); highlighter.setTextFragmenter(new NullFragmenter()); tokenStream.reset(); String result = highlighter.getBestFragments(tokenStream, text, 1, "..."); System.out.println(result); This produces: y z x y z a b The beginning y and z shouldn't be highlighted. If I change the the beginning y and z to x and y, I get the correct result: "x y x y z a b" => x y x y z a b Here's a couple other failing results: "z x y z a b" => z x y z a b "z a x y z a b" => z a x y z a b FYI, I'm using the latest version of Lucene. > Extend contrib Highlighter to properly support phrase queries and span queries > -- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller >Priority: Minor > Attachments: CachedTokenStream.java, CachedTokenStream.java, > CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, > Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, > Highlighter.java, HighlighterTest.java, HighlighterTest.java, > HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, > QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, > QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, > spanhighlighter10.patch, spanhighlighter2.patch, spanhighlighter3.patch, > spanhighlighter5.patch, spanhighlighter6.patch, spanhighlighter7.patch, > spanhighlighter8.patch, spanhighlighter9.patch, spanhighlighter_patch_4.zip, > SpanHighlighterTest.java, SpanHighlighterTest.java, SpanScorer.java, > SpanScorer.java, WeightedSpanTerm.java > > > This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter > package that scores just like QueryScorer, but scores a 0 for Terms that did > not cause the Query hit. This gives 'actual' hit highlighting for the range > of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts > to fragment without breaking up Spans. > See http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526847 ] Andy Liu commented on LUCENE-794: - Hmm, I tried it again and now it's working correctly. Maybe I had interpreted the output incorrectly. Sorry for the false alarm. > Extend contrib Highlighter to properly support phrase queries and span queries > -- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller >Priority: Minor > Attachments: CachedTokenStream.java, CachedTokenStream.java, > CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, > Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, > Highlighter.java, HighlighterTest.java, HighlighterTest.java, > HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, > QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, > QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, > spanhighlighter10.patch, spanhighlighter2.patch, spanhighlighter3.patch, > spanhighlighter5.patch, spanhighlighter6.patch, spanhighlighter7.patch, > spanhighlighter8.patch, spanhighlighter9.patch, spanhighlighter_patch_4.zip, > SpanHighlighterTest.java, SpanHighlighterTest.java, SpanScorer.java, > SpanScorer.java, WeightedSpanTerm.java > > > This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter > package that scores just like QueryScorer, but scores a 0 for Terms that did > not cause the Query hit. This gives 'actual' hit highlighting for the range > of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts > to fragment without breaking up Spans. > See http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries
[ https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526803 ] Andy Liu commented on LUCENE-794: - I gave this patch a whirl, and it looks great. I do see one problem. Say a document contains: x y z a b y z and the query is: "x y z" the highlighter will return (with terms in brackets denoting highlighted terms): [x] [y] [z] a b [y] [z] Since the last y and z are not part of the full phrase, they should not be highlighted. > Extend contrib Highlighter to properly support phrase queries and span queries > -- > > Key: LUCENE-794 > URL: https://issues.apache.org/jira/browse/LUCENE-794 > Project: Lucene - Java > Issue Type: Improvement > Components: Other >Reporter: Mark Miller >Priority: Minor > Attachments: CachedTokenStream.java, CachedTokenStream.java, > CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, > Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, > Highlighter.java, HighlighterTest.java, HighlighterTest.java, > HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, > QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, > QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, > spanhighlighter10.patch, spanhighlighter2.patch, spanhighlighter3.patch, > spanhighlighter5.patch, spanhighlighter6.patch, spanhighlighter7.patch, > spanhighlighter8.patch, spanhighlighter9.patch, spanhighlighter_patch_4.zip, > SpanHighlighterTest.java, SpanHighlighterTest.java, SpanScorer.java, > SpanScorer.java, WeightedSpanTerm.java > > > This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter > package that scores just like QueryScorer, but scores a 0 for Terms that did > not cause the Query hit. This gives 'actual' hit highlighting for the range > of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts > to fragment without breaking up Spans. > See http://issues.apache.org/jira/browse/LUCENE-403 for some background. > There is a dependency on MemoryIndex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Liu updated LUCENE-855: Attachment: contrib-filters.tar.gz I made a few changes to MemoryCachedRangeFilter: - SortedFieldCache's values[] now contains only sorted unique values, while docId[] has been changed to a ragged 2D array with an array of docId's corresponding to each unique value. Since there's no longer repeated values in values[]. forward() and rewind() are no longer required. This also addresses the O(n) special case that Hoss brought up where every value is identical. - bits() now returns OpenBitSetWrapper, a subclass of BitSet that uses Solr's OpenBitSet as a delegate. Wrapping OpenBitSet presents some challenges. Since the internal bits store of BitSet is private, it's difficult to perform operations between BitSet and OpenBitSet (like or, and, etc). - An in-memory OpenBitSet cache is kept. During warmup, the global range is partitioned and OpenBitSet instances are created for each partition. During bits(), these cached OpenBitSet instances that fall in between the lower and upper ranges are used. - Moved MCRF to contrib/ due to the Solr dependancy Using the current (and incomplete) benchmark, MemoryCachedRangeFilter is slightly faster than FCRF when used in conjuction with ConstantRangeQuery and MatchAllDocsQuery: Reader opened with 10 documents. Creating RangeFilters... TermQuery FieldCacheRangeFilter * Total: 88ms * Bits: 0ms * Search: 14ms MemoryCachedRangeFilter * Total: 89ms * Bits: 17ms * Search: 31ms RangeFilter * Total: 9034ms * Bits: 4483ms * Search: 4521ms Chained FieldCacheRangeFilter * Total: 33ms * Bits: 3ms * Search: 9ms Chained MemoryCachedRangeFilter * Total: 77ms * Bits: 19ms * Search: 30ms ConstantScoreQuery FieldCacheRangeFilter * Total: 541ms * Bits: 2ms * Search: 485ms MemoryCachedRangeFilter * Total: 473ms * Bits: 23ms * Search: 390ms RangeFilter * Total: 13777ms * Bits: 4451ms * Search: 9298ms Chained FieldCacheRangeFilter * Total: 12ms * Bits: 2ms * Search: 5ms Chained MemoryCachedRangeFilter * Total: 80ms * Bits: 16ms * Search: 44ms MatchAllDocsQuery FieldCacheRangeFilter * Total: 1231ms * Bits: 3ms * Search: 1115ms MemoryCachedRangeFilter * Total: 1222ms * Bits: 53ms * Search: 1149ms RangeFilter * Total: 10689ms * Bits: 4954ms * Search: 5583ms Chained FieldCacheRangeFilter * Total: 937ms * Bits: 1ms * Search: 862ms Chained MemoryCachedRangeFilter * Total: 921ms * Bits: 19ms * Search: 894ms Hoss, those were great comments you made. I'd be happy to continue on and make those changes, although if the feeling around town is that Matt's range filter is the preferred implementation, I'll stop here. > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, > TestRangeFilterPerformanceComparison.java, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. >
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487897 ] Andy Liu commented on LUCENE-855: - Hey Matt, I get this exception when running your newest FCRF with the performance test. Can you check to see if you get this also? java.lang.ArrayIndexOutOfBoundsException: 10 at org.apache.lucene.search.FieldCacheRangeFilter$5.get(FieldCacheRangeFilter.java:231) at org.apache.lucene.search.IndexSearcher$1.collect(IndexSearcher.java:136) at org.apache.lucene.search.Scorer.score(Scorer.java:49) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74) at org.apache.lucene.search.Hits.(Hits.java:53) at org.apache.lucene.search.Searcher.search(Searcher.java:46) at org.apache.lucene.misc.TestRangeFilterPerformanceComparison$Benchmark.go(TestRangeFilterPerformanceComparison.java:312) at org.apache.lucene.misc.TestRangeFilterPerformanceComparison.testPerformance(TestRangeFilterPerformanceComparison.java:201) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at junit.framework.TestCase.runTest(TestCase.java:154) at junit.framework.TestCase.runBare(TestCase.java:127) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:128) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, > MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The
[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Liu updated LUCENE-855: Attachment: TestRangeFilterPerformanceComparison.java Here's my new benchmark. > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, > TestRangeFilterPerformanceComparison.java > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487595 ] Andy Liu commented on LUCENE-855: - In your updated benchmark, you're combining the range filter with a term query that matches one document. I don't believe that's the typical use case for a range filter. Usually the user employs a range to filter a large document set. I created a different benchmark to compare standard range filter, MemoryCachedRangeFilter, and Matt's FieldCacheRangeFilter using MatchAllDocsQuery, ConstantScoreQuery, and TermQuery (matching one doc like the last benchmark). Here are the results: Reader opened with 10 documents. Creating RangeFilters... RangeFilter w/MatchAllDocsQuery: * Bits: 4421 * Search: 5285 RangeFilter w/ConstantScoreQuery: * Bits: 4200 * Search: 8694 RangeFilter w/TermQuery: * Bits: 4088 * Search: 4133 MemoryCachedRangeFilter w/MatchAllDocsQuery: * Bits: 80 * Search: 1142 MemoryCachedRangeFilter w/ConstantScoreQuery: * Bits: 79 * Search: 482 MemoryCachedRangeFilter w/TermQuery: * Bits: 73 * Search: 95 FieldCacheRangeFilter w/MatchAllDocsQuery: * Bits: 0 * Search: 1146 FieldCacheRangeFilter w/ConstantScoreQuery: * Bits: 1 * Search: 356 FieldCacheRangeFilter w/TermQuery: * Bits: 0 * Search: 19 Here's some points: 1. When searching in a filter, bits() is called, so the search time includes bits() time. 2. Matt's FieldCacheRangeFilter is faster for ConstantScoreQuery, although not by much. Using MatchAllDocsQuery, they run neck-and-neck. FCRF is much faster for TermQuery since MCRF has to create the BItSet for the range before the search is executed. 3. I get less document hits when running FieldCacheRangeFilter with ConstantScoreQuery. Matt, there may be a bug in getNextSetBit(). Not sure if this would affect the benchmark. 4. I'd be interested to see performance numbers when FieldCacheRangeFilter is used with ChainedFilter. I suspect that MCRF would be faster in this case, since I'm assuming that FCRF has to reconstruct a standard BitSet during clone(). > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Assigned To: Otis Gospodnetic > Attachments: FieldCacheRangeFilter.patch, > FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: >
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487378 ] Andy Liu commented on LUCENE-855: - Hey Matt, The way you implemented FieldCacheRangeFilter is very simple and clever! Here's a couple comments: 1. My performance test that we both used is no longer valid, since FieldCacheRangeFilter.bits() only returns a wrapper around a BitSet. The test only calls bits() . Since you're wrapping BitSet, there's some overhead incurred when applying it to an actual search. I reran the performance test applying the Filter to a search, and your implementation is still faster, although only slightly. 2. Your filter currently doesn't work with ConstantRangeQuery. CRQ calls bits.nextSetBit() which fails in your wrapped BitSet implementation. Your incomplete implementation of BitSet may cause problems elsewhere. If you can fix #2 I'd vote for your implementation since it's cleaner and faster, although I might take another stab at trying to improve my implementation. > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Attachments: FieldCacheRangeFilter.patch, > MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Liu updated LUCENE-855: Attachment: MemoryCachedRangeFilter_1.4.patch Here's a patch that should compile in Java 1.4 . It includes src/java/org/apache/lucene/search/MemoryCachedRangeFilter.java src/test/org/apache/lucene/search/TestMemoryCachedRangeFilter.java src/test/org/apache/lucene/search/TestMemoryCachedRangeFilterPerformance.java You can try using TestMemoryCachedRangeFilterPerformance to compare runtime speed numbers. Let me know if you have any problem running these. > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Attachments: MemoryCachedRangeFilter.patch, > MemoryCachedRangeFilter_1.4.patch > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486791 ] Andy Liu commented on LUCENE-855: - Ah, you're right. I didn't read closely enough! > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Attachments: MemoryCachedRangeFilter.patch > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486767 ] Andy Liu commented on LUCENE-855: - Otis, looking forward to your colleague's patch. LUCENE-798 caches RangeFilters so that if the same exact range is executed again, the cached RangeFilter is used. However, the first time a range is encountered, you'll still have to calculate the RangeFilter, which can be slow. I haven't looked at the patch, but I'm sure LUCENE-798 can be used in conjunction with MemoryCachedRangeFilter to further boost performance for repeated range queries. > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Attachments: MemoryCachedRangeFilter.patch > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Liu updated LUCENE-855: Attachment: MemoryCachedRangeFilter.patch Patch produced from latest from SVN > MemoryCachedRangeFilter to boost performance of Range queries > - > > Key: LUCENE-855 > URL: https://issues.apache.org/jira/browse/LUCENE-855 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.1 >Reporter: Andy Liu > Attachments: MemoryCachedRangeFilter.patch > > > Currently RangeFilter uses TermEnum and TermDocs to find documents that fall > within the specified range. This requires iterating through every single > term in the index and can get rather slow for large document sets. > MemoryCachedRangeFilter reads all pairs of a given field, > sorts by value, and stores in a SortedFieldCache. During bits(), binary > searches are used to find the start and end indices of the lower and upper > bound values. The BitSet is populated by all the docId values that fall in > between the start and end indices. > TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed > index with random date values within a 5 year range. Executing bits() 1000 > times on standard RangeQuery using random date intervals took 63904ms. Using > MemoryCachedRangeFilter, it took 876ms. Performance increase is less > dramatic when you have less unique terms in a field or using less number of > documents. > Currently MemoryCachedRangeFilter only works with numeric values (values are > stored in a long[] array) but it can be easily changed to support Strings. A > side "benefit" of storing the values are stored as longs, is that there's no > longer the need to make the values lexographically comparable, i.e. padding > numeric values with zeros. > The downside of using MemoryCachedRangeFilter is there's a fairly significant > memory requirement. So it's designed to be used in situations where range > filter performance is critical and memory consumption is not an issue. The > memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. > MemoryCachedRangeFilter also requires a warmup step which can take a while to > run in large datasets (it took 40s to run on a 3M document corpus). Warmup > can be called explicitly or is automatically called the first time > MemoryCachedRangeFilter is applied using a given field. > So in summery, MemoryCachedRangeFilter can be useful when: > - Performance is critical > - Memory is not an issue > - Field contains many unique numeric values > - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]