[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653900#action_12653900 ] Otis Gospodnetic commented on LUCENE-855: - Hi Matt! :) Tim, want to benchmark the two? (since you already benchmarked 1461, you should be able to plug in Matt's thing and see how it compares) MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653414#action_12653414 ] Tim Sturge commented on LUCENE-855: --- Matt, Andy, Please take a look at LUCENE-1461. As far as I can tell it is identical in purpose and design to this patch. Matt, I would like to add you to the CHANGES.txt credits for LUCENE-1461. Are you OK with that? MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653450#action_12653450 ] Andy Liu commented on LUCENE-855: - Yes, it looks the same. Glad this will finally make it to the source! MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
I can't seem to post to Jira, so I am attaching here...I attached QueryFilter.java.In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct?The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level.The MyMultiReader is a subclass that allows access to the underlying SegmentReaders.The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values.Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ?is there any interest in this? QueryFilter.java Description: Binary data On Dec 4, 2008, at 2:10 PM, Andy Liu (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653450#action_12653450 ] Andy Liu commented on LUCENE-855:-Yes, it looks the same. Glad this will finally make it to the source! MemoryCachedRangeFilter to boost performance of Range queries- Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.javaCurrently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets.MemoryCachedRangeFilter reads all docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices.TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.So in summery, MemoryCachedRangeFilter can be useful when:- Performance is critical- Memory is not an issue- Field contains many unique numeric values- Index contains large amount of documents -- This message is automatically generated by JIRA.-You can reply to this email to add a comment to the issue online.-To unsubscribe, e-mail: [EMAIL PROTECTED]For additional commands, e-mail: [EMAIL PROTECTED]
RE: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
I am looking all the time to LUCENE-831, which is a new version of FieldCache that is compatible with IndexReader.reopen() and invalidates only reloaded segments. In each release of Lucene I am very unhappy, because it is still not in. The same problem like yours is if you have a one million documents index that is updated by adding a few documents each half hour. If you use sorting by a field, whenever the index is reopened and you really only a very small segment is added, nevertheless the complete FieldCache is rebuild, very bad :(. So I think the ultimative fix would be to hopefully apply LUCENE-831 soon and also use LUCENE-1461 as RangeFilter cache. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:39 PM To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries I can't seem to post to Jira, so I am attaching here... I attached QueryFilter.java. In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct? The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level. The MyMultiReader is a subclass that allows access to the underlying SegmentReaders. The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values. Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ? is there any interest in this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
Lucene-831 is far more comprehensive. I also think that by exposing access to the sub-readers it can be far simpler (closer to what I have provided). In the mean-time, you should be able to use the provided class with a few modifications. The reload the entire cache was a deal breaker for us, so I came up the attached. Works very well. On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote: I am looking all the time to LUCENE-831, which is a new version of FieldCache that is compatible with IndexReader.reopen() and invalidates only reloaded segments. In each release of Lucene I am very unhappy, because it is still not in. The same problem like yours is if you have a one million documents index that is updated by adding a few documents each half hour. If you use sorting by a field, whenever the index is reopened and you really only a very small segment is added, nevertheless the complete FieldCache is rebuild, very bad :(. So I think the ultimative fix would be to hopefully apply LUCENE-831 soon and also use LUCENE-1461 as RangeFilter cache. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:39 PM To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries I can't seem to post to Jira, so I am attaching here... I attached QueryFilter.java. In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct? The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level. The MyMultiReader is a subclass that allows access to the underlying SegmentReaders. The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values. Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ? is there any interest in this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
The biggest benefit I see of using the field cache to do filter caching, is that the same cache can be used for sorting - thereby improving the performance and memory usage. The downside I see is that if you have a common filter that is built from many fields, you are going to use a lot more memory, as every field used needs to be cached. With my code you would only have a single bitset for the filter. On Dec 4, 2008, at 4:00 PM, robert engels wrote: Lucene-831 is far more comprehensive. I also think that by exposing access to the sub-readers it can be far simpler (closer to what I have provided). In the mean-time, you should be able to use the provided class with a few modifications. The reload the entire cache was a deal breaker for us, so I came up the attached. Works very well. On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote: I am looking all the time to LUCENE-831, which is a new version of FieldCache that is compatible with IndexReader.reopen() and invalidates only reloaded segments. In each release of Lucene I am very unhappy, because it is still not in. The same problem like yours is if you have a one million documents index that is updated by adding a few documents each half hour. If you use sorting by a field, whenever the index is reopened and you really only a very small segment is added, nevertheless the complete FieldCache is rebuild, very bad :(. So I think the ultimative fix would be to hopefully apply LUCENE-831 soon and also use LUCENE-1461 as RangeFilter cache. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:39 PM To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries I can't seem to post to Jira, so I am attaching here... I attached QueryFilter.java. In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct? The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level. The MyMultiReader is a subclass that allows access to the underlying SegmentReaders. The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values. Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ? is there any interest in this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
Op Thursday 04 December 2008 23:03:40 schreef robert engels: The biggest benefit I see of using the field cache to do filter caching, is that the same cache can be used for sorting - thereby improving the performance and memory usage. Would it be possible to build such Filter caching into CachingWrapperFilter instead of into QueryFilter? Both filter caching and the field value caching will need access to the underlying (segment?) readers. The downside I see is that if you have a common filter that is built from many fields, you are going to use a lot more memory, as every field used needs to be cached. With my code you would only have a single bitset for the filter. But with many ranges that would mean many bitsets, and MemoryCachedRangeFilter only needs to cache the field values once for any number of ranges. It's a tradeoff. Regards, Paul Elschot On Dec 4, 2008, at 4:00 PM, robert engels wrote: Lucene-831 is far more comprehensive. I also think that by exposing access to the sub-readers it can be far simpler (closer to what I have provided). In the mean-time, you should be able to use the provided class with a few modifications. The reload the entire cache was a deal breaker for us, so I came up the attached. Works very well. On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote: I am looking all the time to LUCENE-831, which is a new version of FieldCache that is compatible with IndexReader.reopen() and invalidates only reloaded segments. In each release of Lucene I am very unhappy, because it is still not in. The same problem like yours is if you have a one million documents index that is updated by adding a few documents each half hour. If you use sorting by a field, whenever the index is reopened and you really only a very small segment is added, nevertheless the complete FieldCache is rebuild, very bad :(. So I think the ultimative fix would be to hopefully apply LUCENE-831 soon and also use LUCENE-1461 as RangeFilter cache. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:39 PM To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries I can't seem to post to Jira, so I am attaching here... I attached QueryFilter.java. In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct? The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level. The MyMultiReader is a subclass that allows access to the underlying SegmentReaders. The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values. Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ? is there any interest in this? -- --- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
It would be cool to be able to explicitly list subreaders that were added/removed as a result of reopen(), or have some kind of notification mechanism. We have filter caches, custom field/sort caches here and they are all reader-bound. Currently warm-up delay is negated by reopening and warming up in background, before switching to the new reader/caches, but it still limits our minimum between-reopens delay. On Fri, Dec 5, 2008 at 01:03, robert engels [EMAIL PROTECTED] wrote: The biggest benefit I see of using the field cache to do filter caching, is that the same cache can be used for sorting - thereby improving the performance and memory usage. The downside I see is that if you have a common filter that is built from many fields, you are going to use a lot more memory, as every field used needs to be cached. With my code you would only have a single bitset for the filter. On Dec 4, 2008, at 4:00 PM, robert engels wrote: Lucene-831 is far more comprehensive. I also think that by exposing access to the sub-readers it can be far simpler (closer to what I have provided). In the mean-time, you should be able to use the provided class with a few modifications. The reload the entire cache was a deal breaker for us, so I came up the attached. Works very well. On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote: I am looking all the time to LUCENE-831, which is a new version of FieldCache that is compatible with IndexReader.reopen() and invalidates only reloaded segments. In each release of Lucene I am very unhappy, because it is still not in. The same problem like yours is if you have a one million documents index that is updated by adding a few documents each half hour. If you use sorting by a field, whenever the index is reopened and you really only a very small segment is added, nevertheless the complete FieldCache is rebuild, very bad :(. So I think the ultimative fix would be to hopefully apply LUCENE-831 soon and also use LUCENE-1461 as RangeFilter cache. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] From: robert engels [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:39 PM To: java-dev@lucene.apache.org Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries I can't seem to post to Jira, so I am attaching here... I attached QueryFilter.java. In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct? The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level. The MyMultiReader is a subclass that allows access to the underlying SegmentReaders. The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values. Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ? is there any interest in this? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED]) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653635#action_12653635 ] Matt Ericson commented on LUCENE-855: - Looks similar to what I wrote but it uses a more data structures. I liked the what I built as it just has direct access to the Field Cache and there are no other data structures and that if once you load the data in the FC you can do any other search on that field and not have to rebuild anything you can just re-use the data. But I think all 3 are improvements on what's there but as I am prejudiced and I really like they one I wrote and I think it will stack up faster then the 1461 if you do load tests on it. Just my $0.02 Matt MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651837#action_12651837 ] Paul Elschot commented on LUCENE-855: - On the face of it, this has some overlap with the recent FieldCacheRangeFilter of LUCENE-1461 . Any comments? MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12560912#action_12560912 ] vivek commented on LUCENE-855: -- Any plans to have this part of Lucene 2.3? MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assignee: Otis Gospodnetic Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491677 ] Matt Ericson commented on LUCENE-855: - Can someone take a look at the code I attached and let me know if there is anything we need to change? Or did it get added to lucene? I don't really know how long this should take? Matt MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assigned To: Otis Gospodnetic Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488291 ] Yiqing Jin commented on LUCENE-855: --- Hi, Matt As i tried the FieldCacheRangeFilter i have got problem. I added a test block at the end of TestFieldCacheRangeFilter FieldCacheRangeFilter f1 = new FieldCacheRangeFilter(id, (float)minIP, (float)maxIP, T, F); FieldCacheRangeFilter f2 = new FieldCacheRangeFilter(id, (float)minIP, (float)maxIP, F, T); ChainedFilter f = new ChainedFilter(new Filter[]{f1,f2},ChainedFilter.AND); result = search.search(q, f); assertEquals(all but ends, numDocs-2, result.length()); This could not pass and in fact the result.length() is 0; Nothing could be found. I checked my code and traced the running but still can't get result expected. It seems the Filter won't work with the ChainedFilter. after the doChain the BitSet seems to be empty.(Either 'and' or 'or' operation). CODE: [ case AND: BitSet bit = filter.bits(reader); result.and(bit); ] The bit is already empty before it's added to the result. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assigned To: Otis Gospodnetic Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488297 ] Yiqing Jin commented on LUCENE-855: --- After i changed the code in ChainedFilter#doChain to case AND: BitSet bit = (BitSet)filter.bits(reader).clone(); result.and(bit); break; the result is fine. but i know that's a bad way. Since the FieldCacheBitSet is not a real BitSet and uses a fake get() method just get value from the FieldCache. I think the current imp is still not fit for the ChainedFilter because FieldCacheBitSet do not have a good implementation of the logical cperotion such as 'and'. Maybe we could make the FieldCacheBitSet public and implement all the methods in it's own way instead of having a convertToBitSet() to make things messed. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assigned To: Otis Gospodnetic Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488412 ] Matt Ericson commented on LUCENE-855: - I have done a little research and I do not think I can get my bit set to act the same as a normal bit set so this will not work with ChainedFilter as ChainedFilter calls BitSet.and() or BitSet.or() I looked at these functions and they access private varables inside of the BitSet and do the 'and', 'or', 'xor' on the bits in memory. Since my BitSet is just a proxy for the field cache ChainedFilter will not work unless we also change ChainedFilter Matt MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assigned To: Otis Gospodnetic Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488547 ] Yiqing Jin commented on LUCENE-855: --- That's true you can't do the ''and ' or 'or' as usual. but i am thingking the FieldCacheBitSet may hold some private varables to store the range and field infomation and we do the 'and', 'or', 'xor' in a tricky way by setting the value of the varables. And we implement the #get() using the varables as a judgement . Changing the ChainedFilter is a good way, maybe we could have a special FieldCaheChainedFilter ^_^. i'm having a busy day but i'll try to do some experiment on it if had time. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assigned To: Otis Gospodnetic Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488125 ] Hoss Man commented on LUCENE-855: - Another thing that occurred to me this morning is that the comparison test doesn't consider the performance of the various Filter's when cached and reused (with something like CacheWrappingFilter) ... you may actually see the stock RangeFilter be faster then either implementation when you can reuse the same exact Filter over and over on the same IndexReader -- a fairly common use case. In general the numbers that really need to be conpared are... 1) the time overhead of an implementation when opening a new IndexReader (and whether that overhead is per field) 2) the time overhead of an implementation the first time a specific Filter is used on an IndexReader 3) the time on average that it takes to use a Filter MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assigned To: Otis Gospodnetic Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487897 ] Andy Liu commented on LUCENE-855: - Hey Matt, I get this exception when running your newest FCRF with the performance test. Can you check to see if you get this also? java.lang.ArrayIndexOutOfBoundsException: 10 at org.apache.lucene.search.FieldCacheRangeFilter$5.get(FieldCacheRangeFilter.java:231) at org.apache.lucene.search.IndexSearcher$1.collect(IndexSearcher.java:136) at org.apache.lucene.search.Scorer.score(Scorer.java:49) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74) at org.apache.lucene.search.Hits.init(Hits.java:53) at org.apache.lucene.search.Searcher.search(Searcher.java:46) at org.apache.lucene.misc.TestRangeFilterPerformanceComparison$Benchmark.go(TestRangeFilterPerformanceComparison.java:312) at org.apache.lucene.misc.TestRangeFilterPerformanceComparison.testPerformance(TestRangeFilterPerformanceComparison.java:201) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at junit.framework.TestCase.runTest(TestCase.java:154) at junit.framework.TestCase.runBare(TestCase.java:127) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:128) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196) MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assigned To: Otis Gospodnetic Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487962 ] Hoss Man commented on LUCENE-855: - On Mon, 9 Apr 2007, Otis Gospodnetic (JIRA) wrote: : I'd love to know what Hoss and other big Filter users think about this. : Solr makes a lof of use of (Range?)Filters, I believe. This is one of those Jira issues that i didn't really have time to follow when it was first opened, and so the Jira emails have just been piling up waiting ofr me to read. Here's the raw notes i took as i read through the patches... FieldCacheRangeFilter.patch from 10/Apr/07 01:52 PM * javadoc cut/paste errors (FieldCache) * FieldCacheRangeFilter should work with simple strings (using FieldCache.getStrings or FieldCache.getStringIndex) just like regular RangeFilter * it feels like the various parser versions should be in seperate subclasses (common abstract base class?) * why does clone need to construct a raw BitSet? what exactly didn't work about ChainedFilter without this? (could cause other BitSet usage problems) * or/and/andNot/xor can all be implemented using convertToBitSet * need FieldCacheBitSet methods: cardinality, get(int,int) * need equals and hashCode methods in all new classes * FieldCacheBitSet.clear should be UnsuppOp * convertToBitSet can be cached. * FieldCacheBitSet should be abstract, requiring get(int) be implemented MemoryCachedRangeFilter_1.4.patch from 06/Apr/07 06:14 AM * tuples should be initialized to fieldCache.length ... serious ArrayList resizing going on there (why is it an ArrayList, why not just Tules[] ?) * doesn't cache need synchronization? ... seems like the same CreationPlaceholder pattern used in FieldCache might make sense here. * this looks wrong... } else if ( (!includeLower) (lowerIndex = 0) ) { ...consider case where lower==5, includeLower==false, and all values in index are 5, binary search could leave us in the middle of hte index, so we still need for move forward to the end? * ditto above concern for finding upperIndex * what is pathological worst case for rewind/forward when *lots* of duplicate values in index? should another binarySearch be used? * a lot of code in MemoryCachedRangeFilter.bits for finding lowerIndex/upperIndex would probably make more sense as methods in SortedFieldCache * only seems to handle longs, at a minimum should deal with arbitrary strings, with optional add ons for longs/ints/etc... * I can't help but wonder how MemoryCachedRangeFilter would compare if it used Solr's OpenBitSet (facaded to implement the BitSet API) TestRangeFilterPerformanceComparison.java from 10/Apr/07 * I can't help but wonder how RangeFilter would compare if it used Solr's OpenBitSet (facaded to implement the BitSet API) * no test of includeLower==false or includeUpper==false * i don't think the ranges being compared are the same for RangeFilter as they are for the other Filters ... note the use of DateTools when building the index, vs straight string usage in RangeFilter, vs Long.parseLong in MemoryCachedRangeFilter and FieldCacheRangeFilter * is it really a fair comparison to call MemoryCachedRangeFilter.warmup or FieldCacheRangeFilter.bits outside of the timing code? for indexes where the IndexReader is reopened periodicaly this may be a significant number to be aware of. Questions about the legitimacy of the testing aside... In general, I like the approach of FieldCacheBitSet -- but it should be generalized into an AbstractReadOnlyBitSet where all methods are implemented via get(int) in subclasses -- we should make sure that every method in the BitSet API works as advertised in Java1.4. I don't really like the various hoops FieldCacheRangeFilter has to jump through to support int/float/long ... I think at it's core it should support simple Strings, with alternate/sub classes for dealing with other FieldCache formats ... i just really dislike all the crazy nested ifs to deal with the different Parser types, if there's going to be separate constructors for longs/floats/ints, they might as well be separate sub-classes. the really nice thing this has over RangeFilter is that people can index raw numeric values without needing to massage them into lexicographically ordered Strings (since the FieldCache will take care of parsing them appropriately) My gut tells me that the MemoryCachedRangeFilter approach will never ever be able to compete with the FieldCacheRangeFilter facading BitSet approach since it needs to build the FieldCache, then the SortedFieldCache, then a BitSet ...it seems like any optimization into that pipeline can always be beaten by using the same logic, but then facading the BitSet MemoryCachedRangeFilter to boost performance of Range queries
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487587 ] Otis Gospodnetic commented on LUCENE-855: - OK. I'll wait for the new performance numbers before committing. Andy, if you see anything funky in Matt's patch or if you managed to make your version faster, let us know, please. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487590 ] Otis Gospodnetic commented on LUCENE-855: - Comments about the patch so far: Cosmetics: - You don't want to refer to Andy's class in javadocs, as that class won't go in unless Andy makes it faster. - I see some incorrect (copy/paste error) javadocs and javadocs/comments with typos in both the test classes and non-test classes. - Please configure your Lucene project in Eclipse to use 2 spaces instead of 4. In general, once you get the code formatting settings right, it's a good practise to format your code with that setting before submitting a patch. Testing: - You can put the testPerformance() code from TestFieldCacheRangeFilterPerformance in the other unit test class you have there. - Your testPerformance() doesn't actually assert...() anything, just prints out numbers to stdout. You can keep the printing, but it would be better to also do some asserts, so we can always test that the FCRangerFilter beats the vanilla RangeFilter without looking at the stdout. - You may want to close that searcher in testPerformance() before opening a new one. Probably won't make any difference, but still. - You may also want to just close the searcher at the end of the method. Impl: - In the inner FieldCacheBitSet class, I see: +public boolean intersects(BitSet set) { +for (int i =0; i length; i++) { +if (get(i) set.get(i)) { +return true; +} +} +return false; +} Is there room for a small optimization? What if BitSets are not of equal size? Wouldn't it make sense to loop through a smaller BitSet then? Sorry if I'm off, I hardly ever work with BitSets. - I see you made *_PARSERs in FCImpl public (were private). Is that really needed? Would ackage protected be enough? - Make sure ASL is in all test and non-test classes, I don't see it there now. Overall, I like it - slick and elegant usage of FC! I'd love to know what Hoss and other big Filter users think about this. Solr makes a lof of use of (Range?)Filters, I believe. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487595 ] Andy Liu commented on LUCENE-855: - In your updated benchmark, you're combining the range filter with a term query that matches one document. I don't believe that's the typical use case for a range filter. Usually the user employs a range to filter a large document set. I created a different benchmark to compare standard range filter, MemoryCachedRangeFilter, and Matt's FieldCacheRangeFilter using MatchAllDocsQuery, ConstantScoreQuery, and TermQuery (matching one doc like the last benchmark). Here are the results: Reader opened with 10 documents. Creating RangeFilters... RangeFilter w/MatchAllDocsQuery: * Bits: 4421 * Search: 5285 RangeFilter w/ConstantScoreQuery: * Bits: 4200 * Search: 8694 RangeFilter w/TermQuery: * Bits: 4088 * Search: 4133 MemoryCachedRangeFilter w/MatchAllDocsQuery: * Bits: 80 * Search: 1142 MemoryCachedRangeFilter w/ConstantScoreQuery: * Bits: 79 * Search: 482 MemoryCachedRangeFilter w/TermQuery: * Bits: 73 * Search: 95 FieldCacheRangeFilter w/MatchAllDocsQuery: * Bits: 0 * Search: 1146 FieldCacheRangeFilter w/ConstantScoreQuery: * Bits: 1 * Search: 356 FieldCacheRangeFilter w/TermQuery: * Bits: 0 * Search: 19 Here's some points: 1. When searching in a filter, bits() is called, so the search time includes bits() time. 2. Matt's FieldCacheRangeFilter is faster for ConstantScoreQuery, although not by much. Using MatchAllDocsQuery, they run neck-and-neck. FCRF is much faster for TermQuery since MCRF has to create the BItSet for the range before the search is executed. 3. I get less document hits when running FieldCacheRangeFilter with ConstantScoreQuery. Matt, there may be a bug in getNextSetBit(). Not sure if this would affect the benchmark. 4. I'd be interested to see performance numbers when FieldCacheRangeFilter is used with ChainedFilter. I suspect that MCRF would be faster in this case, since I'm assuming that FCRF has to reconstruct a standard BitSet during clone(). MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assigned To: Otis Gospodnetic Attachments: FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical -
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487378 ] Andy Liu commented on LUCENE-855: - Hey Matt, The way you implemented FieldCacheRangeFilter is very simple and clever! Here's a couple comments: 1. My performance test that we both used is no longer valid, since FieldCacheRangeFilter.bits() only returns a wrapper around a BitSet. The test only calls bits() . Since you're wrapping BitSet, there's some overhead incurred when applying it to an actual search. I reran the performance test applying the Filter to a search, and your implementation is still faster, although only slightly. 2. Your filter currently doesn't work with ConstantRangeQuery. CRQ calls bits.nextSetBit() which fails in your wrapped BitSet implementation. Your incomplete implementation of BitSet may cause problems elsewhere. If you can fix #2 I'd vote for your implementation since it's cleaner and faster, although I might take another stab at trying to improve my implementation. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487380 ] Matt Ericson commented on LUCENE-855: - I will be happy to fix #2 or a to try to fix #2 The test had the real work done out side the Timing The other thing I like about is is that there is less data saved in cache. Some of our indexes are 10 Gigs so every bite counts at least in my application. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487108 ] Matt Ericson commented on LUCENE-855: - I am almost done with my patch and I wanted to test it against this patch so see who has the faster version But the MemoryCachedRangeFilter is written using Java 1.5 And as far as I know Lucene is still on java 1.4 Lines like private static WeakHashMapIndexReader, MapString,SortedFieldCache cache = new WeakHashMapIndexReader, MapString, SortedFieldCache(); Will not compile in java 1.4 Andy I would love to see who has the faster patch if you would convert your patch to use java 1.4 I would be happy to put them side by side MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: MemoryCachedRangeFilter.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486758 ] Otis Gospodnetic commented on LUCENE-855: - A colleague of mine is working on something similar, but possibly more efficient (less sorting and binary searching). He'll probably attach his patch to this issue. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: MemoryCachedRangeFilter.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486763 ] Yonik Seeley commented on LUCENE-855: - There is also something from Mark Harwood: https://issues.apache.org/jira/browse/LUCENE-798 MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: MemoryCachedRangeFilter.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486767 ] Andy Liu commented on LUCENE-855: - Otis, looking forward to your colleague's patch. LUCENE-798 caches RangeFilters so that if the same exact range is executed again, the cached RangeFilter is used. However, the first time a range is encountered, you'll still have to calculate the RangeFilter, which can be slow. I haven't looked at the patch, but I'm sure LUCENE-798 can be used in conjunction with MemoryCachedRangeFilter to further boost performance for repeated range queries. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: MemoryCachedRangeFilter.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486788 ] Yonik Seeley commented on LUCENE-855: - LUCENE-798 caches RangeFilters so that if the same exact range is executed again [...] It's not just the exact same range though... it can reuse parts of ranges AFAIK. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: MemoryCachedRangeFilter.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486791 ] Andy Liu commented on LUCENE-855: - Ah, you're right. I didn't read closely enough! MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Attachments: MemoryCachedRangeFilter.patch Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]