[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-12 Thread Yiqing Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488291
 ] 

Yiqing Jin commented on LUCENE-855:
---

Hi, Matt
As i tried the FieldCacheRangeFilter i have got problem.

I added a test block at the end of TestFieldCacheRangeFilter

FieldCacheRangeFilter f1 =  new FieldCacheRangeFilter(id, 
(float)minIP, (float)maxIP, T, F);
FieldCacheRangeFilter f2 =  new FieldCacheRangeFilter(id, 
(float)minIP, (float)maxIP, F, T);
  
ChainedFilter f = new ChainedFilter(new 
Filter[]{f1,f2},ChainedFilter.AND);
result = search.search(q, f);
assertEquals(all but ends, numDocs-2, result.length());

This could not pass and in fact the result.length() is 0; Nothing could be 
found. 


I checked my code and traced the running but still can't get result expected. 
It seems the Filter won't work with the ChainedFilter. 
after the doChain the BitSet seems to be empty.(Either 'and' or 'or' 
operation). 
CODE:
[
case AND:
BitSet bit = filter.bits(reader);
result.and(bit);
]
The bit is already empty before it's added to the result.


 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Assigned To: Otis Gospodnetic
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-12 Thread Yiqing Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488297
 ] 

Yiqing Jin commented on LUCENE-855:
---

After i changed the code in ChainedFilter#doChain to
case AND:
BitSet bit = (BitSet)filter.bits(reader).clone();
result.and(bit);
break;
the result is fine.  but i know that's a bad way.
Since the FieldCacheBitSet is not a real BitSet and uses a fake get() method 
just get value from the FieldCache. I think the current imp is still not fit 
for the ChainedFilter because FieldCacheBitSet  do not have a good 
implementation of the logical cperotion such as 'and'. 
Maybe we could make the FieldCacheBitSet  public and implement all the methods 
in it's own way instead of having a convertToBitSet() to make things messed.

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Assigned To: Otis Gospodnetic
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-861) Contrib queries package Query implementations do not override equals()

2007-04-12 Thread Antony Bowesman (JIRA)
Contrib queries package Query implementations do not override equals()
--

 Key: LUCENE-861
 URL: https://issues.apache.org/jira/browse/LUCENE-861
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.1
 Environment: All
Reporter: Antony Bowesman
Priority: Minor


Query implementations should override equals() so that Query instances can be 
cached and that Filters can know if a Query has been used before.  See the 
discussion in this thread.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg13061.html

Test cases below show the problem.

package com.teamware.office.lucene.search;

import static org.junit.Assert.*;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BoostingQuery;
import org.apache.lucene.search.FuzzyLikeThisQuery;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.similar.MoreLikeThisQuery;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class ContribQueriesEqualsTest
{
/**
 * @throws java.lang.Exception
 */
@Before
public void setUp() throws Exception
{
}

/**
 * @throws java.lang.Exception
 */
@After
public void tearDown() throws Exception
{
}

/**
 *  Show that the BoostingQuery in the queries contrib package 
 *  does not implement equals() correctly.
 */
@Test
public void testBoostingQueryEquals()
{
TermQuery q1 = new TermQuery(new Term(subject:, java));
TermQuery q2 = new TermQuery(new Term(subject:, java));
assertEquals(Two TermQueries with same attributes should be equal, 
q1, q2);
BoostingQuery bq1 = new BoostingQuery(q1, q2, 0.1f);
BoostingQuery bq2 = new BoostingQuery(q1, q2, 0.1f);
assertEquals(BoostingQuery with same attributes is not equal, bq1, 
bq2);
}

/**
 *  Show that the MoreLikeThisQuery in the queries contrib package 
 *  does not implement equals() correctly.
 */
@Test
public void testMoreLikeThisQueryEquals()
{
String moreLikeFields[] = new String[] {subject, body};

MoreLikeThisQuery mltq1 = new MoreLikeThisQuery(java, moreLikeFields, 
new StandardAnalyzer());
MoreLikeThisQuery mltq2 = new MoreLikeThisQuery(java, moreLikeFields, 
new StandardAnalyzer());
assertEquals(MoreLikeThisQuery with same attributes is not equal, 
mltq1, mltq2);
}
/**
 *  Show that the FuzzyLikeThisQuery in the queries contrib package 
 *  does not implement equals() correctly.
 */
@Test
public void testFuzzyLikeThisQueryEquals()
{
FuzzyLikeThisQuery fltq1 = new FuzzyLikeThisQuery(10, new 
StandardAnalyzer());
fltq1.addTerms(javi, subject, 0.5f, 2);
FuzzyLikeThisQuery fltq2 = new FuzzyLikeThisQuery(10, new 
StandardAnalyzer());
fltq2.addTerms(javi, subject, 0.5f, 2);
assertEquals(FuzzyLikeThisQuery with same attributes is not equal, 
fltq1, fltq2);
}
}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-862) Contrib query org.apache.lucene.search.BoostingQuery sets boost on constructor Query, not cloned copy

2007-04-12 Thread Antony Bowesman (JIRA)
Contrib query org.apache.lucene.search.BoostingQuery sets boost on constructor 
Query, not cloned copy
-

 Key: LUCENE-862
 URL: https://issues.apache.org/jira/browse/LUCENE-862
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.1
 Environment: All
Reporter: Antony Bowesman
Priority: Minor


BoostingQuery sets the boost value on the passed context Query

public BoostingQuery(Query match, Query context, float boost) {
  this.match = match;
  this.context = (Query)context.clone();// clone before boost
  this.boost = boost;

  context.setBoost(0.0f);  // ignore context-only 
matches
}

This should be 
  this.context.setBoost(0.0f);  // ignore context-only 
matches

Also, boost value of 0.0 may have wrong effect - see discussion at

http://www.mail-archive.com/[EMAIL PROTECTED]/msg12243.html 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-12 Thread Matt Ericson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488412
 ] 

Matt Ericson commented on LUCENE-855:
-

I have done a little research and I do not think I can get my bit set to act
the same as a normal bit set so this will not work with  ChainedFilter as
ChainedFilter calls BitSet.and() or BitSet.or()

I looked at these functions and they access private varables inside of the
BitSet and do the 'and', 'or', 'xor' on the bits in memory. Since my BitSet
is just a proxy for the field cache ChainedFilter will  not work unless we
also change ChainedFilter

Matt



 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Assigned To: Otis Gospodnetic
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-12 Thread Yiqing Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488547
 ] 

Yiqing Jin commented on LUCENE-855:
---

That's true you can't do the ''and '  or 'or'  as usual. but i am thingking  
the FieldCacheBitSet  may hold some private varables to store the range and 
field infomation and we do the 'and', 'or', 'xor'  in a tricky way by setting 
the value of the varables.  And we implement the #get() using the varables as a 
judgement .

Changing the ChainedFilter is  a good way, maybe we could have a special 
FieldCaheChainedFilter ^_^. 

i'm having a busy day but i'll try to do some experiment on it if had time.

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Assigned To: Otis Gospodnetic
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r528298 - /lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java

2007-04-12 Thread Otis Gospodnetic
Thanks Hoss!  I thought I committed that, but svn log shows I did not.  I must 
have left the fix local on my other computer.  Too many computers, too few 
brain cells.

Otis


- Original Message 
From: [EMAIL PROTECTED] [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, April 12, 2007 8:59:29 PM
Subject: svn commit: r528298 - 
/lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java

Author: hossman
Date: Thu Apr 12 17:59:28 2007
New Revision: 528298

URL: http://svn.apache.org/viewvc?view=revrev=528298
Log:
minor followup to LUCENE-857, fixing a small mistake in Otis's orriginal commit 
to ensure that the deprecated QueryFilter still cache's

Modified:
lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java

Modified: lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java
URL: 
http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java?view=diffrev=528298r1=528297r2=528298
==
--- lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java 
(original)
+++ lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java Thu 
Apr 12 17:59:28 2007
@@ -25,13 +25,13 @@
  * @version $Id$
  * @deprecated use a CachingWrapperFilter with QueryWrapperFilter
  */
-public class QueryFilter extends QueryWrapperFilter {
+public class QueryFilter extends CachingWrapperFilter {
 
   /** Constructs a filter which only matches documents matching
* codequery/code.
*/
   public QueryFilter(Query query) {
-super(query);
+super(new QueryWrapperFilter(query));
   }
 
   public boolean equals(Object o) {






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: svn commit: r528298 - /lucene/java/trunk/src/java/org/apache/lucene/search/QueryFilter.java

2007-04-12 Thread Chris Hostetter

: Thanks Hoss!  I thought I committed that, but svn log shows I did not.
: I must have left the fix local on my other computer.  Too many

i figured as much, i was going to email you and then i realized it would
take less typing to commit the change :)


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]