[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488291 ] Yiqing Jin commented on LUCENE-855: --- Hi, Matt As i tried the FieldCacheRangeFilter i have got problem. I added a test block at the end of TestFieldCacheRangeFilter FieldCacheRangeFilter f1 = new FieldCacheRangeFilter(id, (float)minIP, (float)maxIP, T, F); FieldCacheRangeFilter f2 = new FieldCacheRangeFilter(id, (float)minIP, (float)maxIP, F, T); ChainedFilter f = new ChainedFilter(new Filter[]{f1,f2},ChainedFilter.AND); result = search.search(q, f); assertEquals(all but ends, numDocs-2, result.length()); This could not pass and in fact the result.length() is 0; Nothing could be found. I checked my code and traced the running but still can't get result expected. It seems the Filter won't work with the ChainedFilter. after the doChain the BitSet seems to be empty.(Either 'and' or 'or' operation). CODE: [ case AND: BitSet bit = filter.bits(reader); result.and(bit); ] The bit is already empty before it's added to the result. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assigned To: Otis Gospodnetic Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488297 ] Yiqing Jin commented on LUCENE-855: --- After i changed the code in ChainedFilter#doChain to case AND: BitSet bit = (BitSet)filter.bits(reader).clone(); result.and(bit); break; the result is fine. but i know that's a bad way. Since the FieldCacheBitSet is not a real BitSet and uses a fake get() method just get value from the FieldCache. I think the current imp is still not fit for the ChainedFilter because FieldCacheBitSet do not have a good implementation of the logical cperotion such as 'and'. Maybe we could make the FieldCacheBitSet public and implement all the methods in it's own way instead of having a convertToBitSet() to make things messed. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assigned To: Otis Gospodnetic Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-861) Contrib queries package Query implementations do not override equals()
Contrib queries package Query implementations do not override equals() -- Key: LUCENE-861 URL: https://issues.apache.org/jira/browse/LUCENE-861 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.1 Environment: All Reporter: Antony Bowesman Priority: Minor Query implementations should override equals() so that Query instances can be cached and that Filters can know if a Query has been used before. See the discussion in this thread. http://www.mail-archive.com/[EMAIL PROTECTED]/msg13061.html Test cases below show the problem. package com.teamware.office.lucene.search; import static org.junit.Assert.*; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.Term; import org.apache.lucene.search.BoostingQuery; import org.apache.lucene.search.FuzzyLikeThisQuery; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.similar.MoreLikeThisQuery; import org.junit.After; import org.junit.Before; import org.junit.Test; public class ContribQueriesEqualsTest { /** * @throws java.lang.Exception */ @Before public void setUp() throws Exception { } /** * @throws java.lang.Exception */ @After public void tearDown() throws Exception { } /** * Show that the BoostingQuery in the queries contrib package * does not implement equals() correctly. */ @Test public void testBoostingQueryEquals() { TermQuery q1 = new TermQuery(new Term(subject:, java)); TermQuery q2 = new TermQuery(new Term(subject:, java)); assertEquals(Two TermQueries with same attributes should be equal, q1, q2); BoostingQuery bq1 = new BoostingQuery(q1, q2, 0.1f); BoostingQuery bq2 = new BoostingQuery(q1, q2, 0.1f); assertEquals(BoostingQuery with same attributes is not equal, bq1, bq2); } /** * Show that the MoreLikeThisQuery in the queries contrib package * does not implement equals() correctly. */ @Test public void testMoreLikeThisQueryEquals() { String moreLikeFields[] = new String[] {subject, body}; MoreLikeThisQuery mltq1 = new MoreLikeThisQuery(java, moreLikeFields, new StandardAnalyzer()); MoreLikeThisQuery mltq2 = new MoreLikeThisQuery(java, moreLikeFields, new StandardAnalyzer()); assertEquals(MoreLikeThisQuery with same attributes is not equal, mltq1, mltq2); } /** * Show that the FuzzyLikeThisQuery in the queries contrib package * does not implement equals() correctly. */ @Test public void testFuzzyLikeThisQueryEquals() { FuzzyLikeThisQuery fltq1 = new FuzzyLikeThisQuery(10, new StandardAnalyzer()); fltq1.addTerms(javi, subject, 0.5f, 2); FuzzyLikeThisQuery fltq2 = new FuzzyLikeThisQuery(10, new StandardAnalyzer()); fltq2.addTerms(javi, subject, 0.5f, 2); assertEquals(FuzzyLikeThisQuery with same attributes is not equal, fltq1, fltq2); } } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-862) Contrib query org.apache.lucene.search.BoostingQuery sets boost on constructor Query, not cloned copy
Contrib query org.apache.lucene.search.BoostingQuery sets boost on constructor Query, not cloned copy - Key: LUCENE-862 URL: https://issues.apache.org/jira/browse/LUCENE-862 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.1 Environment: All Reporter: Antony Bowesman Priority: Minor BoostingQuery sets the boost value on the passed context Query public BoostingQuery(Query match, Query context, float boost) { this.match = match; this.context = (Query)context.clone();// clone before boost this.boost = boost; context.setBoost(0.0f); // ignore context-only matches } This should be this.context.setBoost(0.0f); // ignore context-only matches Also, boost value of 0.0 may have wrong effect - see discussion at http://www.mail-archive.com/[EMAIL PROTECTED]/msg12243.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488412 ] Matt Ericson commented on LUCENE-855: - I have done a little research and I do not think I can get my bit set to act the same as a normal bit set so this will not work with ChainedFilter as ChainedFilter calls BitSet.and() or BitSet.or() I looked at these functions and they access private varables inside of the BitSet and do the 'and', 'or', 'xor' on the bits in memory. Since my BitSet is just a proxy for the field cache ChainedFilter will not work unless we also change ChainedFilter Matt MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assigned To: Otis Gospodnetic Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries
[ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488547 ] Yiqing Jin commented on LUCENE-855: --- That's true you can't do the ''and ' or 'or' as usual. but i am thingking the FieldCacheBitSet may hold some private varables to store the range and field infomation and we do the 'and', 'or', 'xor' in a tricky way by setting the value of the varables. And we implement the #get() using the varables as a judgement . Changing the ChainedFilter is a good way, maybe we could have a special FieldCaheChainedFilter ^_^. i'm having a busy day but i'll try to do some experiment on it if had time. MemoryCachedRangeFilter to boost performance of Range queries - Key: LUCENE-855 URL: https://issues.apache.org/jira/browse/LUCENE-855 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.1 Reporter: Andy Liu Assigned To: Otis Gospodnetic Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.java Currently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range. This requires iterating through every single term in the index and can get rather slow for large document sets. MemoryCachedRangeFilter reads all docId, value pairs of a given field, sorts by value, and stores in a SortedFieldCache. During bits(), binary searches are used to find the start and end indices of the lower and upper bound values. The BitSet is populated by all the docId values that fall in between the start and end indices. TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range. Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms. Using MemoryCachedRangeFilter, it took 876ms. Performance increase is less dramatic when you have less unique terms in a field or using less number of documents. Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings. A side benefit of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros. The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement. So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue. The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs. MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus). Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field. So in summery, MemoryCachedRangeFilter can be useful when: - Performance is critical - Memory is not an issue - Field contains many unique numeric values - Index contains large amount of documents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: svn commit: r528298 - /lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java
Thanks Hoss! I thought I committed that, but svn log shows I did not. I must have left the fix local on my other computer. Too many computers, too few brain cells. Otis - Original Message From: [EMAIL PROTECTED] [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, April 12, 2007 8:59:29 PM Subject: svn commit: r528298 - /lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java Author: hossman Date: Thu Apr 12 17:59:28 2007 New Revision: 528298 URL: http://svn.apache.org/viewvc?view=revrev=528298 Log: minor followup to LUCENE-857, fixing a small mistake in Otis's orriginal commit to ensure that the deprecated QueryFilter still cache's Modified: lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java Modified: lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java URL: http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java?view=diffrev=528298r1=528297r2=528298 == --- lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java (original) +++ lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java Thu Apr 12 17:59:28 2007 @@ -25,13 +25,13 @@ * @version $Id$ * @deprecated use a CachingWrapperFilter with QueryWrapperFilter */ -public class QueryFilter extends QueryWrapperFilter { +public class QueryFilter extends CachingWrapperFilter { /** Constructs a filter which only matches documents matching * codequery/code. */ public QueryFilter(Query query) { -super(query); +super(new QueryWrapperFilter(query)); } public boolean equals(Object o) { - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: svn commit: r528298 - /lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java
: Thanks Hoss! I thought I committed that, but svn log shows I did not. : I must have left the fix local on my other computer. Too many i figured as much, i was going to email you and then i realized it would take less typing to commit the change :) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]