[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-05 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653900#action_12653900
 ] 

Otis Gospodnetic commented on LUCENE-855:
-

Hi Matt! :)

Tim, want to benchmark the two? (since you already benchmarked 1461, you should 
be able to plug in Matt's thing and see how it compares)


 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread Tim Sturge (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653414#action_12653414
 ] 

Tim Sturge commented on LUCENE-855:
---

Matt, Andy,

Please take a look at LUCENE-1461. As far as I can tell it is identical in 
purpose and design to this patch.

Matt,

I would like to add you to the CHANGES.txt credits for LUCENE-1461. Are you OK 
with that?



 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653450#action_12653450
 ] 

Andy Liu commented on LUCENE-855:
-

Yes, it looks the same.  Glad this will finally make it to the source!

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread robert engels

I can't seem to post to Jira, so I am attaching here...I attached QueryFilter.java.In reading this patch, and other similar ones, the problem seems to be that if the index is modified, the cache is invalidated, causing a complete reload of the cache. Do I have this correct?The attached patch works really well in a highly interactive environment, as the cache is only invalidated at the segment level.The MyMultiReader is a subclass that allows access to the underlying SegmentReaders.The patch cannot be applied, but I think the implementation works far better in many cases - it is also far less memory intensive. Scanning the bitset could also be optimized very easily using internal skip values.Maybe this is completely off-base, but the solution has worked very well for us. Maybe this is a completely different issue and separate incident should be opened ?is there any interest in this?

QueryFilter.java
Description: Binary data
On Dec 4, 2008, at 2:10 PM, Andy Liu (JIRA) wrote:    [ https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653450#action_12653450 ] Andy Liu commented on LUCENE-855:-Yes, it looks the same.  Glad this will finally make it to the source! MemoryCachedRangeFilter to boost performance of Range queries-                Key: LUCENE-855                URL: https://issues.apache.org/jira/browse/LUCENE-855            Project: Lucene - Java         Issue Type: Improvement         Components: Search   Affects Versions: 2.1           Reporter: Andy Liu        Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, TestRangeFilterPerformanceComparison.javaCurrently RangeFilter uses TermEnum and TermDocs to find documents that fall within the specified range.  This requires iterating through every single term in the index and can get rather slow for large document sets.MemoryCachedRangeFilter reads all docId, value> pairs of a given field, sorts by value, and stores in a SortedFieldCache.  During bits(), binary searches are used to find the start and end indices of the lower and upper bound values.  The BitSet is populated by all the docId values that fall in between the start and end indices.TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed index with random date values within a 5 year range.  Executing bits() 1000 times on standard RangeQuery using random date intervals took 63904ms.  Using MemoryCachedRangeFilter, it took 876ms.  Performance increase is less dramatic when you have less unique terms in a field or using less number of documents.Currently MemoryCachedRangeFilter only works with numeric values (values are stored in a long[] array) but it can be easily changed to support Strings.  A side "benefit" of storing the values are stored as longs, is that there's no longer the need to make the values lexographically comparable, i.e. padding numeric values with zeros.The downside of using MemoryCachedRangeFilter is there's a fairly significant memory requirement.  So it's designed to be used in situations where range filter performance is critical and memory consumption is not an issue.  The memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  MemoryCachedRangeFilter also requires a warmup step which can take a while to run in large datasets (it took 40s to run on a 3M document corpus).  Warmup can be called explicitly or is automatically called the first time MemoryCachedRangeFilter is applied using a given field.So in summery, MemoryCachedRangeFilter can be useful when:- Performance is critical- Memory is not an issue- Field contains many unique numeric values- Index contains large amount of documents -- This message is automatically generated by JIRA.-You can reply to this email to add a comment to the issue online.-To unsubscribe, e-mail: [EMAIL PROTECTED]For additional commands, e-mail: [EMAIL PROTECTED] 

RE: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread Uwe Schindler
I am looking all the time to LUCENE-831, which is a new version of
FieldCache that is compatible with IndexReader.reopen() and invalidates only
reloaded segments. In each release of Lucene I am very unhappy, because it
is still not in. The same problem like yours is if you have a one million
documents index that is updated by adding a few documents each half hour. If
you use sorting by a field, whenever the index is reopened and you really
only a very small segment is added, nevertheless the complete FieldCache is
rebuild, very bad :(.


So I think the ultimative fix would be to hopefully apply LUCENE-831 soon
and also use LUCENE-1461 as RangeFilter cache.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]

From: robert engels [mailto:[EMAIL PROTECTED] 
Sent: Thursday, December 04, 2008 9:39 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost
performance of Range queries

I can't seem to post to Jira, so I am attaching here...

I attached QueryFilter.java.

In reading this patch, and other similar ones, the problem seems to be that
if the index is modified, the cache is invalidated, causing a complete
reload of the cache. Do I have this correct?

The attached patch works really well in a highly interactive environment, as
the cache is only invalidated at the segment level.

The MyMultiReader is a subclass that allows access to the underlying
SegmentReaders.

The patch cannot be applied, but I think the implementation works far better
in many cases - it is also far less memory intensive. Scanning the bitset
could also be optimized very easily using internal skip values.

Maybe this is completely off-base, but the solution has worked very well for
us. Maybe this is a completely different issue and separate incident should
be opened ?

is there any interest in this?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread robert engels

Lucene-831 is far more comprehensive.

I also think that by exposing access to the sub-readers it can be far  
simpler (closer to what I have provided).


In the mean-time, you should be able to use the provided class with a  
few modifications.


The reload the entire cache was a deal breaker for us, so I came up  
the attached. Works very well.


On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:


I am looking all the time to LUCENE-831, which is a new version of
FieldCache that is compatible with IndexReader.reopen() and  
invalidates only
reloaded segments. In each release of Lucene I am very unhappy,  
because it
is still not in. The same problem like yours is if you have a one  
million
documents index that is updated by adding a few documents each half  
hour. If
you use sorting by a field, whenever the index is reopened and you  
really
only a very small segment is added, nevertheless the complete  
FieldCache is

rebuild, very bad :(.


So I think the ultimative fix would be to hopefully apply  
LUCENE-831 soon

and also use LUCENE-1461 as RangeFilter cache.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]

From: robert engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2008 9:39 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter  
to boost

performance of Range queries

I can't seem to post to Jira, so I am attaching here...

I attached QueryFilter.java.

In reading this patch, and other similar ones, the problem seems to  
be that

if the index is modified, the cache is invalidated, causing a complete
reload of the cache. Do I have this correct?

The attached patch works really well in a highly interactive  
environment, as

the cache is only invalidated at the segment level.

The MyMultiReader is a subclass that allows access to the underlying
SegmentReaders.

The patch cannot be applied, but I think the implementation works  
far better
in many cases - it is also far less memory intensive. Scanning the  
bitset

could also be optimized very easily using internal skip values.

Maybe this is completely off-base, but the solution has worked very  
well for
us. Maybe this is a completely different issue and separate  
incident should

be opened ?

is there any interest in this?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread robert engels
The biggest benefit I see of using the field cache to do filter  
caching, is that the same cache can be used for sorting - thereby  
improving the performance and memory usage.


The downside I see is that if you have a common filter that is built  
from many fields, you are going to use a lot more memory, as every  
field used needs to be cached. With my code you would only have a  
single bitset for the filter.


On Dec 4, 2008, at 4:00 PM, robert engels wrote:


Lucene-831 is far more comprehensive.

I also think that by exposing access to the sub-readers it can be  
far simpler (closer to what I have provided).


In the mean-time, you should be able to use the provided class with  
a few modifications.


The reload the entire cache was a deal breaker for us, so I came  
up the attached. Works very well.


On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:


I am looking all the time to LUCENE-831, which is a new version of
FieldCache that is compatible with IndexReader.reopen() and  
invalidates only
reloaded segments. In each release of Lucene I am very unhappy,  
because it
is still not in. The same problem like yours is if you have a one  
million
documents index that is updated by adding a few documents each  
half hour. If
you use sorting by a field, whenever the index is reopened and you  
really
only a very small segment is added, nevertheless the complete  
FieldCache is

rebuild, very bad :(.


So I think the ultimative fix would be to hopefully apply  
LUCENE-831 soon

and also use LUCENE-1461 as RangeFilter cache.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]

From: robert engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2008 9:39 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-855)  
MemoryCachedRangeFilter to boost

performance of Range queries

I can't seem to post to Jira, so I am attaching here...

I attached QueryFilter.java.

In reading this patch, and other similar ones, the problem seems  
to be that
if the index is modified, the cache is invalidated, causing a  
complete

reload of the cache. Do I have this correct?

The attached patch works really well in a highly interactive  
environment, as

the cache is only invalidated at the segment level.

The MyMultiReader is a subclass that allows access to the underlying
SegmentReaders.

The patch cannot be applied, but I think the implementation works  
far better
in many cases - it is also far less memory intensive. Scanning the  
bitset

could also be optimized very easily using internal skip values.

Maybe this is completely off-base, but the solution has worked  
very well for
us. Maybe this is a completely different issue and separate  
incident should

be opened ?

is there any interest in this?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread Paul Elschot
Op Thursday 04 December 2008 23:03:40 schreef robert engels:
 The biggest benefit I see of using the field cache to do filter
 caching, is that the same cache can be used for sorting - thereby
 improving the performance and memory usage.

Would it be possible to build such Filter caching into 
CachingWrapperFilter instead of into QueryFilter?

Both filter caching and the field value caching will need
access to the underlying (segment?) readers.


 The downside I see is that if you have a common filter that is built
 from many fields, you are going to use a lot more memory, as every
 field used needs to be cached. With my code you would only have a
 single bitset for the filter.

But with many ranges that would mean many bitsets, and
MemoryCachedRangeFilter only needs to cache the field values once
for any number of ranges. It's a tradeoff.

Regards,
Paul Elschot



 On Dec 4, 2008, at 4:00 PM, robert engels wrote:
  Lucene-831 is far more comprehensive.
 
  I also think that by exposing access to the sub-readers it can be
  far simpler (closer to what I have provided).
 
  In the mean-time, you should be able to use the provided class with
  a few modifications.
 
  The reload the entire cache was a deal breaker for us, so I came
  up the attached. Works very well.
 
  On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:
  I am looking all the time to LUCENE-831, which is a new version of
  FieldCache that is compatible with IndexReader.reopen() and
  invalidates only
  reloaded segments. In each release of Lucene I am very unhappy,
  because it
  is still not in. The same problem like yours is if you have a one
  million
  documents index that is updated by adding a few documents each
  half hour. If
  you use sorting by a field, whenever the index is reopened and you
  really
  only a very small segment is added, nevertheless the complete
  FieldCache is
  rebuild, very bad :(.
 
 
  So I think the ultimative fix would be to hopefully apply
  LUCENE-831 soon
  and also use LUCENE-1461 as RangeFilter cache.
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: [EMAIL PROTECTED]
  
  From: robert engels [mailto:[EMAIL PROTECTED]
  Sent: Thursday, December 04, 2008 9:39 PM
  To: java-dev@lucene.apache.org
  Subject: Re: [jira] Commented: (LUCENE-855)
  MemoryCachedRangeFilter to boost
  performance of Range queries
 
  I can't seem to post to Jira, so I am attaching here...
 
  I attached QueryFilter.java.
 
  In reading this patch, and other similar ones, the problem seems
  to be that
  if the index is modified, the cache is invalidated, causing a
  complete
  reload of the cache. Do I have this correct?
 
  The attached patch works really well in a highly interactive
  environment, as
  the cache is only invalidated at the segment level.
 
  The MyMultiReader is a subclass that allows access to the
  underlying SegmentReaders.
 
  The patch cannot be applied, but I think the implementation works
  far better
  in many cases - it is also far less memory intensive. Scanning the
  bitset
  could also be optimized very easily using internal skip values.
 
  Maybe this is completely off-base, but the solution has worked
  very well for
  us. Maybe this is a completely different issue and separate
  incident should
  be opened ?
 
  is there any interest in this?
 
 
 
  --
 --- To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
  ---
 -- To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread Earwin Burrfoot
It would be cool to be able to explicitly list subreaders that were
added/removed as a result of reopen(), or have some kind of
notification mechanism.
We have filter caches, custom field/sort caches here and they are all
reader-bound. Currently warm-up delay is negated by reopening and
warming up in background, before switching to the new reader/caches,
but it still limits our minimum between-reopens delay.

On Fri, Dec 5, 2008 at 01:03, robert engels [EMAIL PROTECTED] wrote:
 The biggest benefit I see of using the field cache to do filter caching, is
 that the same cache can be used for sorting - thereby improving the
 performance and memory usage.

 The downside I see is that if you have a common filter that is built from
 many fields, you are going to use a lot more memory, as every field used
 needs to be cached. With my code you would only have a single bitset for
 the filter.

 On Dec 4, 2008, at 4:00 PM, robert engels wrote:

 Lucene-831 is far more comprehensive.

 I also think that by exposing access to the sub-readers it can be far
 simpler (closer to what I have provided).

 In the mean-time, you should be able to use the provided class with a few
 modifications.

 The reload the entire cache was a deal breaker for us, so I came up the
 attached. Works very well.

 On Dec 4, 2008, at 3:54 PM, Uwe Schindler wrote:

 I am looking all the time to LUCENE-831, which is a new version of
 FieldCache that is compatible with IndexReader.reopen() and invalidates
 only
 reloaded segments. In each release of Lucene I am very unhappy, because
 it
 is still not in. The same problem like yours is if you have a one million
 documents index that is updated by adding a few documents each half hour.
 If
 you use sorting by a field, whenever the index is reopened and you really
 only a very small segment is added, nevertheless the complete FieldCache
 is
 rebuild, very bad :(.


 So I think the ultimative fix would be to hopefully apply LUCENE-831 soon
 and also use LUCENE-1461 as RangeFilter cache.

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: [EMAIL PROTECTED]
 
 From: robert engels [mailto:[EMAIL PROTECTED]
 Sent: Thursday, December 04, 2008 9:39 PM
 To: java-dev@lucene.apache.org
 Subject: Re: [jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to
 boost
 performance of Range queries

 I can't seem to post to Jira, so I am attaching here...

 I attached QueryFilter.java.

 In reading this patch, and other similar ones, the problem seems to be
 that
 if the index is modified, the cache is invalidated, causing a complete
 reload of the cache. Do I have this correct?

 The attached patch works really well in a highly interactive environment,
 as
 the cache is only invalidated at the segment level.

 The MyMultiReader is a subclass that allows access to the underlying
 SegmentReaders.

 The patch cannot be applied, but I think the implementation works far
 better
 in many cases - it is also far less memory intensive. Scanning the bitset
 could also be optimized very easily using internal skip values.

 Maybe this is completely off-base, but the solution has worked very well
 for
 us. Maybe this is a completely different issue and separate incident
 should
 be opened ?

 is there any interest in this?



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-- 
Kirill Zakharenko/Кирилл Захаренко ([EMAIL PROTECTED])
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785


[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-12-04 Thread Matt Ericson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12653635#action_12653635
 ] 

Matt Ericson commented on LUCENE-855:
-

Looks similar to what I wrote but it uses a more data structures. I  
liked the what I built as it just has direct access to the Field Cache  
and there are no other data structures and that if once you load the  
data in the FC you can do any other search on that field and not have  
to rebuild anything you can just re-use the data.

But I think all 3 are improvements on what's there but as I am  
prejudiced and I really like they one I wrote and I think it will  
stack up faster then the 1461 if you do load tests on it.

Just my $0.02

Matt





 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-11-30 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12651837#action_12651837
 ] 

Paul Elschot commented on LUCENE-855:
-

On the face of it, this has some overlap with the recent FieldCacheRangeFilter 
of LUCENE-1461 .
Any comments?

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter_Lucene_2.3.0.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2008-01-20 Thread vivek (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12560912#action_12560912
 ] 

vivek commented on LUCENE-855:
--

Any plans to have this part of Lucene 2.3?

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
Assignee: Otis Gospodnetic
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, 
 MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-25 Thread Matt Ericson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12491677
 ] 

Matt Ericson commented on LUCENE-855:
-

Can someone take a look at the code I attached and let me know if there is 
anything we need to change?
Or did it get added to lucene?

I don't really know how long this should take? 

Matt

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Assigned To: Otis Gospodnetic
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, 
 MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-12 Thread Yiqing Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488291
 ] 

Yiqing Jin commented on LUCENE-855:
---

Hi, Matt
As i tried the FieldCacheRangeFilter i have got problem.

I added a test block at the end of TestFieldCacheRangeFilter

FieldCacheRangeFilter f1 =  new FieldCacheRangeFilter(id, 
(float)minIP, (float)maxIP, T, F);
FieldCacheRangeFilter f2 =  new FieldCacheRangeFilter(id, 
(float)minIP, (float)maxIP, F, T);
  
ChainedFilter f = new ChainedFilter(new 
Filter[]{f1,f2},ChainedFilter.AND);
result = search.search(q, f);
assertEquals(all but ends, numDocs-2, result.length());

This could not pass and in fact the result.length() is 0; Nothing could be 
found. 


I checked my code and traced the running but still can't get result expected. 
It seems the Filter won't work with the ChainedFilter. 
after the doChain the BitSet seems to be empty.(Either 'and' or 'or' 
operation). 
CODE:
[
case AND:
BitSet bit = filter.bits(reader);
result.and(bit);
]
The bit is already empty before it's added to the result.


 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Assigned To: Otis Gospodnetic
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-12 Thread Yiqing Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488297
 ] 

Yiqing Jin commented on LUCENE-855:
---

After i changed the code in ChainedFilter#doChain to
case AND:
BitSet bit = (BitSet)filter.bits(reader).clone();
result.and(bit);
break;
the result is fine.  but i know that's a bad way.
Since the FieldCacheBitSet is not a real BitSet and uses a fake get() method 
just get value from the FieldCache. I think the current imp is still not fit 
for the ChainedFilter because FieldCacheBitSet  do not have a good 
implementation of the logical cperotion such as 'and'. 
Maybe we could make the FieldCacheBitSet  public and implement all the methods 
in it's own way instead of having a convertToBitSet() to make things messed.

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Assigned To: Otis Gospodnetic
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-12 Thread Matt Ericson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488412
 ] 

Matt Ericson commented on LUCENE-855:
-

I have done a little research and I do not think I can get my bit set to act
the same as a normal bit set so this will not work with  ChainedFilter as
ChainedFilter calls BitSet.and() or BitSet.or()

I looked at these functions and they access private varables inside of the
BitSet and do the 'and', 'or', 'xor' on the bits in memory. Since my BitSet
is just a proxy for the field cache ChainedFilter will  not work unless we
also change ChainedFilter

Matt



 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Assigned To: Otis Gospodnetic
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-12 Thread Yiqing Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488547
 ] 

Yiqing Jin commented on LUCENE-855:
---

That's true you can't do the ''and '  or 'or'  as usual. but i am thingking  
the FieldCacheBitSet  may hold some private varables to store the range and 
field infomation and we do the 'and', 'or', 'xor'  in a tricky way by setting 
the value of the varables.  And we implement the #get() using the varables as a 
judgement .

Changing the ChainedFilter is  a good way, maybe we could have a special 
FieldCaheChainedFilter ^_^. 

i'm having a busy day but i'll try to do some experiment on it if had time.

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Assigned To: Otis Gospodnetic
 Attachments: contrib-filters.tar.gz, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-11 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12488125
 ] 

Hoss Man commented on LUCENE-855:
-

Another thing that occurred to me this morning is that the comparison test 
doesn't consider the performance of the various Filter's when cached and reused 
 (with something like CacheWrappingFilter)  ... you may actually see the stock 
RangeFilter be faster then either implementation when you can reuse the same 
exact Filter over and over on the same IndexReader -- a fairly common use case.

In general the numbers that really need to be conpared are...

  1) the time overhead of an implementation when opening a new IndexReader (and 
whether that overhead is per field)
  2) the time overhead of an implementation the first time a specific Filter is 
used on an IndexReader
  3) the time on average that it takes to use a Filter

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Assigned To: Otis Gospodnetic
 Attachments: FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch, 
 TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-10 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487897
 ] 

Andy Liu commented on LUCENE-855:
-

Hey Matt, I get this exception when running your newest FCRF with the 
performance test.  Can you check to see if you get this also?

java.lang.ArrayIndexOutOfBoundsException: 10
at 
org.apache.lucene.search.FieldCacheRangeFilter$5.get(FieldCacheRangeFilter.java:231)
at 
org.apache.lucene.search.IndexSearcher$1.collect(IndexSearcher.java:136)
at org.apache.lucene.search.Scorer.score(Scorer.java:49)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74)
at org.apache.lucene.search.Hits.init(Hits.java:53)
at org.apache.lucene.search.Searcher.search(Searcher.java:46)
at 
org.apache.lucene.misc.TestRangeFilterPerformanceComparison$Benchmark.go(TestRangeFilterPerformanceComparison.java:312)
at 
org.apache.lucene.misc.TestRangeFilterPerformanceComparison.testPerformance(TestRangeFilterPerformanceComparison.java:201)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at junit.framework.TestCase.runTest(TestCase.java:154)
at junit.framework.TestCase.runBare(TestCase.java:127)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at junit.framework.TestSuite.runTest(TestSuite.java:208)
at junit.framework.TestSuite.run(TestSuite.java:203)
at 
org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:128)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)



 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Assigned To: Otis Gospodnetic
 Attachments: FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, MemoryCachedRangeFilter.patch, 
 MemoryCachedRangeFilter_1.4.patch, TestRangeFilterPerformanceComparison.java, 
 TestRangeFilterPerformanceComparison.java


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements 

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-10 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487962
 ] 

Hoss Man commented on LUCENE-855:
-

On Mon, 9 Apr 2007, Otis Gospodnetic (JIRA) wrote:

: I'd love to know what Hoss and other big Filter users think about this.
: Solr makes a lof of use of (Range?)Filters, I believe.

This is one of those Jira issues that i didn't really have time to follow when 
it was first opened, and so the Jira emails have just been piling up waiting 
ofr me to read.

Here's the raw notes i took as i read through the patches...


FieldCacheRangeFilter.patch  from 10/Apr/07 01:52 PM

 * javadoc cut/paste errors (FieldCache)
 * FieldCacheRangeFilter should work with simple strings
   (using FieldCache.getStrings or FieldCache.getStringIndex)
   just like regular RangeFilter
 * it feels like the various parser versions should be in
   seperate subclasses (common abstract base class?)
 * why does clone need to construct a raw BitSet?  what exactly didn't
   work about ChainedFilter without this?
   (could cause other BitSet usage problems)
 * or/and/andNot/xor can all be implemented using convertToBitSet
 * need FieldCacheBitSet methods: cardinality, get(int,int)
 * need equals and hashCode methods in all new classes
 * FieldCacheBitSet.clear should be UnsuppOp
 * convertToBitSet can be cached.
 * FieldCacheBitSet should be abstract, requiring get(int) be implemented


MemoryCachedRangeFilter_1.4.patch from 06/Apr/07 06:14 AM

 * tuples should be initialized to fieldCache.length ... serious
   ArrayList resizing going on there
   (why is it an ArrayList, why not just Tules[] ?)
 * doesn't cache need synchronization? ... seems like the same
   CreationPlaceholder pattern used in FieldCache might make sense here.
 * this looks wrong...
 } else if ( (!includeLower)  (lowerIndex = 0) ) {
   ...consider case where lower==5, includeLower==false, and all values
   in index are 5, binary search could leave us in the middle of hte index,
   so we still need for move forward to the end?
 * ditto above concern for finding upperIndex
 * what is pathological worst case for rewind/forward when *lots* of
   duplicate values in index?  should another binarySearch be used?
 * a lot of code in MemoryCachedRangeFilter.bits for finding
   lowerIndex/upperIndex would probably make more sense as methods in
   SortedFieldCache
 * only seems to handle longs, at a minimum should deal with arbitrary
   strings, with optional add ons for longs/ints/etc...
 * I can't help but wonder how MemoryCachedRangeFilter would compare if it
   used Solr's OpenBitSet (facaded to implement the BitSet API)

TestRangeFilterPerformanceComparison.java   from 10/Apr/07

 * I can't help but wonder how RangeFilter would compare if it used Solr's
   OpenBitSet (facaded to implement the BitSet API)
 * no test of includeLower==false or includeUpper==false
 * i don't think the ranges being compared are the same for RangeFilter as they 
   are for the other Filters ... note the use of DateTools when building the 
index, 
   vs straight string usage in RangeFilter, vs Long.parseLong in 
   MemoryCachedRangeFilter and FieldCacheRangeFilter
 * is it really a fair comparison to call MemoryCachedRangeFilter.warmup
   or FieldCacheRangeFilter.bits outside of the timing code?
   for indexes where the IndexReader is reopened periodicaly this may
   be a significant number to be aware of.


Questions about the legitimacy of the testing aside...

In general, I like the approach of FieldCacheBitSet -- but it should be 
generalized into an AbstractReadOnlyBitSet where all methods are implemented 
via get(int) in subclasses -- we should make sure that every method in the 
BitSet API works as advertised in Java1.4.  

I don't really like the various hoops FieldCacheRangeFilter has to jump through 
to support int/float/long ... I think at it's core it should support simple 
Strings, with alternate/sub classes for dealing with other FieldCache formats 
... i just really dislike all the crazy nested ifs to deal with the different 
Parser types, if there's going to be separate constructors for 
longs/floats/ints, they might as well be separate sub-classes.

the really nice thing this has over RangeFilter is that people can index raw 
numeric values without needing to massage them into lexicographically ordered 
Strings (since the FieldCache will take care of parsing them appropriately) 

My gut tells me that the MemoryCachedRangeFilter approach will never ever be 
able to compete with the FieldCacheRangeFilter facading BitSet approach since 
it needs to build the FieldCache, then the SortedFieldCache, then a BitSet 
...it seems like any optimization into that pipeline can always be beaten by 
using the same logic, but then facading the BitSet




 MemoryCachedRangeFilter to boost performance of Range queries
 

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-09 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487587
 ] 

Otis Gospodnetic commented on LUCENE-855:
-

OK.  I'll wait for the new performance numbers before committing.  Andy, if you 
see anything funky in Matt's patch or if you managed to make your version 
faster, let us know, please.


 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-09 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487590
 ] 

Otis Gospodnetic commented on LUCENE-855:
-

Comments about the patch so far:
Cosmetics:
- You don't want to refer to Andy's class in javadocs, as that class won't go 
in unless Andy makes it faster.
- I see some incorrect (copy/paste error) javadocs and javadocs/comments with 
typos in both the test classes and non-test classes.
- Please configure your Lucene project in Eclipse to use 2 spaces instead of 4. 
 In general, once you get the code formatting settings right, it's a good 
practise to format your code with that setting before submitting a patch.

Testing:
- You can put the testPerformance() code from  
TestFieldCacheRangeFilterPerformance  in the other unit test class you have 
there.
- Your testPerformance() doesn't actually assert...() anything, just prints out 
numbers to stdout.  You can keep the printing, but it would be better to also 
do some asserts, so we can always test that the FCRangerFilter beats the 
vanilla RangeFilter without looking at the stdout.
- You may want to close that searcher in testPerformance() before opening a new 
one.  Probably won't make any difference, but still.
- You may also want to just close the searcher at the end of the method.


Impl:
- In the inner FieldCacheBitSet class, I see:
+public boolean intersects(BitSet set)  {
+for (int i =0; i  length; i++) {
+if (get(i)  set.get(i)) {
+return true;
+}
+}
+return false;
+}

Is there room for a small optimization?  What if BitSets are not of equal size? 
 Wouldn't it make sense to loop through a smaller BitSet then?  Sorry if I'm 
off, I hardly ever work with BitSets.

- I see you made *_PARSERs in FCImpl public (were private).  Is that really 
needed?  Would ackage protected be enough?

- Make sure ASL is in all test and non-test classes, I don't see it there now.


Overall, I like it - slick and elegant usage of FC!

I'd love to know what Hoss and other big Filter users think about this.  Solr 
makes a lof of use of (Range?)Filters, I believe.


 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to 

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-09 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487595
 ] 

Andy Liu commented on LUCENE-855:
-

In your updated benchmark, you're combining the range filter with a term query 
that matches one document.  I don't believe that's the typical use case for a 
range filter.  Usually the user employs a range to filter a large document set. 
 

I created a different benchmark to compare standard range filter, 
MemoryCachedRangeFilter, and Matt's FieldCacheRangeFilter using 
MatchAllDocsQuery, ConstantScoreQuery, and TermQuery (matching one doc like the 
last benchmark).  Here are the results:

Reader opened with 10 documents.  Creating RangeFilters...
RangeFilter w/MatchAllDocsQuery:

  * Bits: 4421
  * Search: 5285

RangeFilter w/ConstantScoreQuery:

  * Bits: 4200
  * Search: 8694

RangeFilter w/TermQuery:

  * Bits: 4088
  * Search: 4133

MemoryCachedRangeFilter w/MatchAllDocsQuery:

  * Bits: 80
  * Search: 1142

MemoryCachedRangeFilter w/ConstantScoreQuery:

  * Bits: 79
  * Search: 482

MemoryCachedRangeFilter w/TermQuery:

  * Bits: 73
  * Search: 95

FieldCacheRangeFilter w/MatchAllDocsQuery:

  * Bits: 0
  * Search: 1146

FieldCacheRangeFilter w/ConstantScoreQuery:

  * Bits: 1
  * Search: 356

FieldCacheRangeFilter w/TermQuery:

  * Bits: 0
  * Search: 19

Here's some points:

1. When searching in a filter, bits() is called, so the search time includes 
bits() time.
2. Matt's FieldCacheRangeFilter is faster for ConstantScoreQuery, although not 
by much.  Using MatchAllDocsQuery, they run neck-and-neck.  FCRF is much faster 
for TermQuery since MCRF has to create the BItSet for the range before the 
search is executed.
3. I get less document hits when running FieldCacheRangeFilter with 
ConstantScoreQuery.  Matt, there may be a bug in getNextSetBit().  Not sure if 
this would affect the benchmark.
4. I'd be interested to see performance numbers when FieldCacheRangeFilter is 
used with ChainedFilter.  I suspect that MCRF would be faster in this case, 
since I'm assuming that FCRF has to reconstruct a standard BitSet during 
clone().

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Assigned To: Otis Gospodnetic
 Attachments: FieldCacheRangeFilter.patch, 
 FieldCacheRangeFilter.patch, FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - 

[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-06 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487378
 ] 

Andy Liu commented on LUCENE-855:
-

Hey Matt,

The way you implemented FieldCacheRangeFilter is very simple and clever!  
Here's a couple comments:

1. My performance test that we both used is no longer valid, since 
FieldCacheRangeFilter.bits() only returns a wrapper around a BitSet.  The test 
only calls bits() .  Since you're wrapping BitSet, there's some overhead 
incurred when applying it to an actual search.  I reran the performance test 
applying the Filter to a search, and your implementation is still faster, 
although only slightly.

2. Your filter currently doesn't work with ConstantRangeQuery.  CRQ calls 
bits.nextSetBit() which fails in your wrapped BitSet implementation.  Your 
incomplete implementation of BitSet may cause problems elsewhere.

If you can fix #2 I'd vote for your implementation since it's cleaner and 
faster, although I might take another stab at trying to improve my 
implementation.

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-06 Thread Matt Ericson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487380
 ] 

Matt Ericson commented on LUCENE-855:
-

I will be happy to fix #2 or a to try to fix #2 

The test had the real work done out side the Timing 

The other thing I like about is is that there is less data saved in cache. Some 
of our indexes are 10 Gigs so every bite counts at least in my application. 




 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: FieldCacheRangeFilter.patch, 
 MemoryCachedRangeFilter.patch, MemoryCachedRangeFilter_1.4.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-05 Thread Matt Ericson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12487108
 ] 

Matt Ericson commented on LUCENE-855:
-

I am almost done with my patch and I wanted to test it against this patch so 
see who has the faster version 
But the MemoryCachedRangeFilter is written using Java 1.5

And as far as I know Lucene is still on java 1.4 

Lines like 
private static WeakHashMapIndexReader, MapString,SortedFieldCache cache = 
new WeakHashMapIndexReader, MapString, SortedFieldCache();


Will not compile in java 1.4 Andy I would love to see who has the faster patch 
if you would convert your patch to use java 1.4 I would be happy to put them 
side by side

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: MemoryCachedRangeFilter.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-04 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486758
 ] 

Otis Gospodnetic commented on LUCENE-855:
-

A colleague of mine is working on something similar, but possibly more 
efficient (less sorting and binary searching).  He'll probably attach his patch 
to this issue.


 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: MemoryCachedRangeFilter.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-04 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486763
 ] 

Yonik Seeley commented on LUCENE-855:
-

There is also something from Mark Harwood:
https://issues.apache.org/jira/browse/LUCENE-798

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: MemoryCachedRangeFilter.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-04 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486767
 ] 

Andy Liu commented on LUCENE-855:
-

Otis, looking forward to your colleague's patch.

LUCENE-798 caches RangeFilters so that if the same exact range is executed 
again, the cached RangeFilter is used.  However, the first time a range is 
encountered, you'll still have to calculate the RangeFilter, which can be slow. 
 I haven't looked at the patch, but I'm sure LUCENE-798 can be used in 
conjunction with MemoryCachedRangeFilter to further boost performance for 
repeated range queries.

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: MemoryCachedRangeFilter.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-04 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486788
 ] 

Yonik Seeley commented on LUCENE-855:
-

 LUCENE-798 caches RangeFilters so that if the same exact range is executed 
 again [...]

It's not just the exact same range though... it can reuse parts of ranges AFAIK.



 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: MemoryCachedRangeFilter.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-855) MemoryCachedRangeFilter to boost performance of Range queries

2007-04-04 Thread Andy Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486791
 ] 

Andy Liu commented on LUCENE-855:
-

Ah, you're right.  I didn't read closely enough!

 MemoryCachedRangeFilter to boost performance of Range queries
 -

 Key: LUCENE-855
 URL: https://issues.apache.org/jira/browse/LUCENE-855
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.1
Reporter: Andy Liu
 Attachments: MemoryCachedRangeFilter.patch


 Currently RangeFilter uses TermEnum and TermDocs to find documents that fall 
 within the specified range.  This requires iterating through every single 
 term in the index and can get rather slow for large document sets.
 MemoryCachedRangeFilter reads all docId, value pairs of a given field, 
 sorts by value, and stores in a SortedFieldCache.  During bits(), binary 
 searches are used to find the start and end indices of the lower and upper 
 bound values.  The BitSet is populated by all the docId values that fall in 
 between the start and end indices.
 TestMemoryCachedRangeFilterPerformance creates a 100K RAMDirectory-backed 
 index with random date values within a 5 year range.  Executing bits() 1000 
 times on standard RangeQuery using random date intervals took 63904ms.  Using 
 MemoryCachedRangeFilter, it took 876ms.  Performance increase is less 
 dramatic when you have less unique terms in a field or using less number of 
 documents.
 Currently MemoryCachedRangeFilter only works with numeric values (values are 
 stored in a long[] array) but it can be easily changed to support Strings.  A 
 side benefit of storing the values are stored as longs, is that there's no 
 longer the need to make the values lexographically comparable, i.e. padding 
 numeric values with zeros.
 The downside of using MemoryCachedRangeFilter is there's a fairly significant 
 memory requirement.  So it's designed to be used in situations where range 
 filter performance is critical and memory consumption is not an issue.  The 
 memory requirements are: (sizeof(int) + sizeof(long)) * numDocs.  
 MemoryCachedRangeFilter also requires a warmup step which can take a while to 
 run in large datasets (it took 40s to run on a 3M document corpus).  Warmup 
 can be called explicitly or is automatically called the first time 
 MemoryCachedRangeFilter is applied using a given field.
 So in summery, MemoryCachedRangeFilter can be useful when:
 - Performance is critical
 - Memory is not an issue
 - Field contains many unique numeric values
 - Index contains large amount of documents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]