[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653450#action_12653450
 ] 

Andy Liu commented on LUCENE-855:
-

Yes, it looks the same.  Glad this will finally make it to the source!

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

2007-09-13 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12527134
 ] 

Andy Liu commented on LUCENE-794:
-

Ah, I wasn't crazy.  I had the test data wrong.  Here's the code I'm using to 
produce the failing result:

String text = "y z x y z a b";

Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser("body", analyzer);
Query query = parser.parse("\"x y z\"");

CachingTokenFilter tokenStream = new 
CachingTokenFilter(analyzer.tokenStream("body", new StringReader(text)));
Highlighter highlighter = new Highlighter(new SpanScorer(query, "body", 
tokenStream));
highlighter.setTextFragmenter(new NullFragmenter());
tokenStream.reset();

String result = highlighter.getBestFragments(tokenStream, text, 1, 
"...");
System.out.println(result);

This produces:

y z x y z a b

The beginning y and z shouldn't be highlighted.

If I change the the beginning y and z to x and y, I get the correct result:

"x y x y z a b" => x y x y z a b

Here's a couple other failing results:

"z x y z a b" => z x y z a b
"z a x y z a b" => z a x y z a b

FYI, I'm using the latest version of Lucene.

> Extend contrib Highlighter to properly support phrase queries and span queries
> --
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: CachedTokenStream.java, CachedTokenStream.java, 
> CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, 
> Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, 
> Highlighter.java, HighlighterTest.java, HighlighterTest.java, 
> HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, 
> QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, 
> QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, 
> spanhighlighter10.patch, spanhighlighter2.patch, spanhighlighter3.patch, 
> spanhighlighter5.patch, spanhighlighter6.patch, spanhighlighter7.patch, 
> spanhighlighter8.patch, spanhighlighter9.patch, spanhighlighter_patch_4.zip, 
> SpanHighlighterTest.java, SpanHighlighterTest.java, SpanScorer.java, 
> SpanScorer.java, WeightedSpanTerm.java
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

2007-09-12 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526847
 ] 

Andy Liu commented on LUCENE-794:
-

Hmm, I tried it again and now it's working correctly.  Maybe I had interpreted 
the output incorrectly.  Sorry for the false alarm.

> Extend contrib Highlighter to properly support phrase queries and span queries
> --
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: CachedTokenStream.java, CachedTokenStream.java, 
> CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, 
> Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, 
> Highlighter.java, HighlighterTest.java, HighlighterTest.java, 
> HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, 
> QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, 
> QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, 
> spanhighlighter10.patch, spanhighlighter2.patch, spanhighlighter3.patch, 
> spanhighlighter5.patch, spanhighlighter6.patch, spanhighlighter7.patch, 
> spanhighlighter8.patch, spanhighlighter9.patch, spanhighlighter_patch_4.zip, 
> SpanHighlighterTest.java, SpanHighlighterTest.java, SpanScorer.java, 
> SpanScorer.java, WeightedSpanTerm.java
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-794) Extend contrib Highlighter to properly support phrase queries and span queries

2007-09-12 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12526803
 ] 

Andy Liu commented on LUCENE-794:
-

I gave this patch a whirl, and it looks great.

I do see one problem.  Say a document contains:

x y z a b y z

and the query is:

"x y z"

the highlighter will return (with terms in brackets denoting highlighted terms):

[x] [y] [z] a b [y] [z]

Since the last y and z are not part of the full phrase, they should not be 
highlighted.

> Extend contrib Highlighter to properly support phrase queries and span queries
> --
>
> Key: LUCENE-794
> URL: https://issues.apache.org/jira/browse/LUCENE-794
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Other
>Reporter: Mark Miller
>Priority: Minor
> Attachments: CachedTokenStream.java, CachedTokenStream.java, 
> CachedTokenStream.java, DefaultEncoder.java, Encoder.java, Formatter.java, 
> Highlighter.java, Highlighter.java, Highlighter.java, Highlighter.java, 
> Highlighter.java, HighlighterTest.java, HighlighterTest.java, 
> HighlighterTest.java, HighlighterTest.java, MemoryIndex.java, 
> QuerySpansExtractor.java, QuerySpansExtractor.java, QuerySpansExtractor.java, 
> QuerySpansExtractor.java, SimpleFormatter.java, spanhighlighter.patch, 
> spanhighlighter10.patch, spanhighlighter2.patch, spanhighlighter3.patch, 
> spanhighlighter5.patch, spanhighlighter6.patch, spanhighlighter7.patch, 
> spanhighlighter8.patch, spanhighlighter9.patch, spanhighlighter_patch_4.zip, 
> SpanHighlighterTest.java, SpanHighlighterTest.java, SpanScorer.java, 
> SpanScorer.java, WeightedSpanTerm.java
>
>
> This patch adds a new Scorer class (SpanQueryScorer) to the Highlighter 
> package that scores just like QueryScorer, but scores a 0 for Terms that did 
> not cause the Query hit. This gives 'actual' hit highlighting for the range 
> of SpanQuerys and PhraseQuery. There is also a new Fragmenter that attempts 
> to fragment without breaking up Spans.
> See http://issues.apache.org/jira/browse/LUCENE-403 for some background.
> There is a dependency on MemoryIndex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-11 Thread Andy Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Liu updated LUCENE-855:


Attachment: contrib-filters.tar.gz

I made a few changes to MemoryCachedRangeFilter:

- SortedFieldCache's values[] now contains only sorted unique values, while 
docId[] has been changed to a ragged 2D array with an array of docId's 
corresponding to each unique value.  Since there's no longer repeated values in 
values[]. forward() and rewind() are no longer required.  This also addresses 
the O(n) special case that Hoss brought up where every value is identical.
- bits() now returns OpenBitSetWrapper, a subclass of BitSet that uses Solr's 
OpenBitSet as a delegate.  Wrapping OpenBitSet presents some challenges.  Since 
the internal bits store of BitSet is private, it's difficult to perform 
operations between BitSet and OpenBitSet (like or, and, etc).
- An in-memory OpenBitSet cache is kept.  During warmup, the global range is 
partitioned and OpenBitSet instances are created for each partition.  During 
bits(), these cached OpenBitSet instances that fall in between the lower and 
upper ranges are used.
- Moved MCRF to contrib/ due to the Solr dependancy

Using the current (and incomplete) benchmark, MemoryCachedRangeFilter is 
slightly faster than FCRF when used in conjuction with ConstantRangeQuery and 
MatchAllDocsQuery:

Reader opened with 10 documents.  Creating RangeFilters...

TermQuery

FieldCacheRangeFilter
  * Total: 88ms
  * Bits: 0ms
  * Search: 14ms

MemoryCachedRangeFilter
  * Total: 89ms
  * Bits: 17ms
  * Search: 31ms

RangeFilter
  * Total: 9034ms
  * Bits: 4483ms
  * Search: 4521ms

Chained FieldCacheRangeFilter
  * Total: 33ms
  * Bits: 3ms
  * Search: 9ms

Chained MemoryCachedRangeFilter
  * Total: 77ms
  * Bits: 19ms
  * Search: 30ms


ConstantScoreQuery

FieldCacheRangeFilter
  * Total: 541ms
  * Bits: 2ms
  * Search: 485ms

MemoryCachedRangeFilter
  * Total: 473ms
  * Bits: 23ms
  * Search: 390ms

RangeFilter
  * Total: 13777ms
  * Bits: 4451ms
  * Search: 9298ms

Chained FieldCacheRangeFilter
  * Total: 12ms
  * Bits: 2ms
  * Search: 5ms

Chained MemoryCachedRangeFilter
  * Total: 80ms
  * Bits: 16ms
  * Search: 44ms


MatchAllDocsQuery

FieldCacheRangeFilter
  * Total: 1231ms
  * Bits: 3ms
  * Search: 1115ms

MemoryCachedRangeFilter
  * Total: 1222ms
  * Bits: 53ms
  * Search: 1149ms

RangeFilter
  * Total: 10689ms
  * Bits: 4954ms
  * Search: 5583ms

Chained FieldCacheRangeFilter
  * Total: 937ms
  * Bits: 1ms
  * Search: 862ms

Chained MemoryCachedRangeFilter
  * Total: 921ms
  * Bits: 19ms
  * Search: 894ms

Hoss, those were great comments you made.  I'd be happy to continue on and make 
those changes, although if the feeling around town is that Matt's range filter 
is the preferred implementation, I'll stop here.

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> 

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-10 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487897
 ] 

Andy Liu commented on LUCENE-855:
-

Hey Matt, I get this exception when running your newest FCRF with the 
performance test.  Can you check to see if you get this also?

java.lang.ArrayIndexOutOfBoundsException: 10
at 
org.apache.lucene.search.FieldCacheRangeFilter$5.get(FieldCacheRangeFilter.java:231)
at 
org.apache.lucene.search.IndexSearcher$1.collect(IndexSearcher.java:136)
at org.apache.lucene.search.Scorer.score(Scorer.java:49)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74)
at org.apache.lucene.search.Hits.(Hits.java:53)
at org.apache.lucene.search.Searcher.search(Searcher.java:46)
at 
org.apache.lucene.misc.TestRangeFilterPerformanceComparison$Benchmark.go(TestRangeFilterPerformanceComparison.java:312)
at 
org.apache.lucene.misc.TestRangeFilterPerformanceComparison.testPerformance(TestRangeFilterPerformanceComparison.java:201)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at junit.framework.TestCase.runTest(TestCase.java:154)
at junit.framework.TestCase.runBare(TestCase.java:127)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at junit.framework.TestSuite.runTest(TestSuite.java:208)
at junit.framework.TestSuite.run(TestSuite.java:203)
at 
org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:128)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)



> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, 
> MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The

[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-09 Thread Andy Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Liu updated LUCENE-855:


Attachment: TestRangeFilterPerformanceComparison.java

Here's my new benchmark.

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
> TestRangeFilterPerformanceComparison.java
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-09 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487595
 ] 

Andy Liu commented on LUCENE-855:
-

In your updated benchmark, you're combining the range filter with a term query 
that matches one document.  I don't believe that's the typical use case for a 
range filter.  Usually the user employs a range to filter a large document set. 
 

I created a different benchmark to compare standard range filter, 
MemoryCachedRangeFilter, and Matt's FieldCacheRangeFilter using 
MatchAllDocsQuery, ConstantScoreQuery, and TermQuery (matching one doc like the 
last benchmark).  Here are the results:

Reader opened with 10 documents.  Creating RangeFilters...
RangeFilter w/MatchAllDocsQuery:

  * Bits: 4421
  * Search: 5285

RangeFilter w/ConstantScoreQuery:

  * Bits: 4200
  * Search: 8694

RangeFilter w/TermQuery:

  * Bits: 4088
  * Search: 4133

MemoryCachedRangeFilter w/MatchAllDocsQuery:

  * Bits: 80
  * Search: 1142

MemoryCachedRangeFilter w/ConstantScoreQuery:

  * Bits: 79
  * Search: 482

MemoryCachedRangeFilter w/TermQuery:

  * Bits: 73
  * Search: 95

FieldCacheRangeFilter w/MatchAllDocsQuery:

  * Bits: 0
  * Search: 1146

FieldCacheRangeFilter w/ConstantScoreQuery:

  * Bits: 1
  * Search: 356

FieldCacheRangeFilter w/TermQuery:

  * Bits: 0
  * Search: 19

Here's some points:

1. When searching in a filter, bits() is called, so the search time includes 
bits() time.
2. Matt's FieldCacheRangeFilter is faster for ConstantScoreQuery, although not 
by much.  Using MatchAllDocsQuery, they run neck-and-neck.  FCRF is much faster 
for TermQuery since MCRF has to create the BItSet for the range before the 
search is executed.
3. I get less document hits when running FieldCacheRangeFilter with 
ConstantScoreQuery.  Matt, there may be a bug in getNextSetBit().  Not sure if 
this would affect the benchmark.
4. I'd be interested to see performance numbers when FieldCacheRangeFilter is 
used with ChainedFilter.  I suspect that MCRF would be faster in this case, 
since I'm assuming that FCRF has to reconstruct a standard BitSet during 
clone().

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Assigned To: Otis Gospodnetic
> Attachments: FieldCacheRangeFilter.patch, 
> FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
>

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-06 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487378
 ] 

Andy Liu commented on LUCENE-855:
-

Hey Matt,

The way you implemented FieldCacheRangeFilter is very simple and clever!  
Here's a couple comments:

1. My performance test that we both used is no longer valid, since 
FieldCacheRangeFilter.bits() only returns a wrapper around a BitSet.  The test 
only calls bits() .  Since you're wrapping BitSet, there's some overhead 
incurred when applying it to an actual search.  I reran the performance test 
applying the Filter to a search, and your implementation is still faster, 
although only slightly.

2. Your filter currently doesn't work with ConstantRangeQuery.  CRQ calls 
bits.nextSetBit() which fails in your wrapped BitSet implementation.  Your 
incomplete implementation of BitSet may cause problems elsewhere.

If you can fix #2 I'd vote for your implementation since it's cleaner and 
faster, although I might take another stab at trying to improve my 
implementation.

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Attachments: FieldCacheRangeFilter.patch, 
> MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-06 Thread Andy Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Liu updated LUCENE-855:


Attachment: MemoryCachedRangeFilter_1.4.patch

Here's a patch that should compile in Java 1.4 .  It includes

src/java/org/apache/lucene/search/MemoryCachedRangeFilter.java
src/test/org/apache/lucene/search/TestMemoryCachedRangeFilter.java
src/test/org/apache/lucene/search/TestMemoryCachedRangeFilterPerformance.java

You can try using TestMemoryCachedRangeFilterPerformance to compare runtime 
speed numbers.  Let me know if you have any problem running these.

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Attachments: MemoryCachedRangeFilter.patch, 
> MemoryCachedRangeFilter_1.4.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-04 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486791
 ] 

Andy Liu commented on LUCENE-855:
-

Ah, you're right.  I didn't read closely enough!

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Attachments: MemoryCachedRangeFilter.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-04 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486767
 ] 

Andy Liu commented on LUCENE-855:
-

Otis, looking forward to your colleague's patch.

LUCENE-798 caches RangeFilters so that if the same exact range is executed 
again, the cached RangeFilter is used.  However, the first time a range is 
encountered, you'll still have to calculate the RangeFilter, which can be slow. 
 I haven't looked at the patch, but I'm sure LUCENE-798 can be used in 
conjunction with MemoryCachedRangeFilter to further boost performance for 
repeated range queries.

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Attachments: MemoryCachedRangeFilter.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-02 Thread Andy Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Liu updated LUCENE-855:


Attachment: MemoryCachedRangeFilter.patch

Patch produced from latest from SVN

> MemoryCachedRangeFilter to boost performance of Range queries
> -
>
> Key: LUCENE-855
> URL: https://issues.apache.org/jira/browse/LUCENE-855
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.1
>Reporter: Andy Liu
> Attachments: MemoryCachedRangeFilter.patch
>
>
> Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
> within the specified range.  This requires iterating through every single 
> term in the index and can get rather slow for large document sets.
> MemoryCachedRangeFilter reads all  pairs of a given field, 
> sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
> searches are used to find the start and end indices of the lower and upper 
> bound values.  The BitSet is populated by all the docId values that fall in 
> between the start and end indices.
> TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
> index with random date values within a 5 year range.  Executing bits() 1000 
> times on standard RangeQuery using random date intervals took 63904ms.  Using 
> MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
> dramatic when you have less unique terms in a field or using less number of 
> documents.
> Currently MemoryCachedRangeFilter only works with numeric values (values are 
> stored in a long[] array) but it can be easily changed to support Strings.  A 
> side "benefit" of storing the values are stored as longs, is that there's no 
> longer the need to make the values lexographically comparable, i.e. padding 
> numeric values with zeros.
> The downside of using MemoryCachedRangeFilter is there's a fairly significant 
> memory requirement.  So it's designed to be used in situations where range 
> filter performance is critical and memory consumption is not an issue.  The 
> memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
> MemoryCachedRangeFilter also requires a warmup step which can take a while to 
> run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
> can be called explicitly or is automatically called the first time 
> MemoryCachedRangeFilter is applied using a given field.
> So in summery, MemoryCachedRangeFilter can be useful when:
> - Performance is critical
> - Memory is not an issue
> - Field contains many unique numeric values
> - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-02 Thread Andy Liu (JIRA)
MemoryCachedRangeFilter to boost performance of Range queries
-

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu


Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
within the specified range.  This requires iterating through every single term 
in the index and can get rather slow for large document sets.

MemoryCachedRangeFilter reads all  pairs of a given field, sorts 
by value, and stores in a SortedFieldCache.  During bits(), binary searches are 
used to find the start and end indices of the lower and upper bound values.  
The BitSet is populated by all the docId values that fall in between the start 
and end indices.

TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index 
with random date values within a 5 year range.  Executing bits() 1000 times on 
standard RangeQuery using random date intervals took 63904ms.  Using 
MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic 
when you have less unique terms in a field or using less number of documents.

Currently MemoryCachedRangeFilter only works with numeric values (values are 
stored in a long[] array) but it can be easily changed to support Strings.  A 
side "benefit" of storing the values are stored as longs, is that there's no 
longer the need to make the values lexographically comparable, i.e. padding 
numeric values with zeros.

The downside of using MemoryCachedRangeFilter is there's a fairly significant 
memory requirement.  So it's designed to be used in situations where range 
filter performance is critical and memory consumption is not an issue.  The 
memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  

MemoryCachedRangeFilter also requires a warmup step which can take a while to 
run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can 
be called explicitly or is automatically called the first time 
MemoryCachedRangeFilter is applied using a given field.

So in summery, MemoryCachedRangeFilter can be useful when:
- Performance is critical
- Memory is not an issue
- Field contains many unique numeric values
- Index contains large amount of documents


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]