[jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)

2012-11-12 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495353#comment-13495353
 ] 

David Smiley commented on LUCENE-4548:
--

I'm going to use FilteredQuery with the strategy indicators -- perfect.  I was 
using ChainedFilter but I'll back that out.  I won't need BooleanFilter since 
I've just got clauses to AND, and FilteredQuery accomplishes that.

 BooleanFilter should optionally pass down further restricted acceptDocs in 
 the MUST case (and acceptDocs in general)
 

 Key: LUCENE-4548
 URL: https://issues.apache.org/jira/browse/LUCENE-4548
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Uwe Schindler
 Attachments: LUCENE-4548.patch


 Spin-off from dev@lao:
 {quote}
 bq. I am about to write a Filter that only operates on a set of documents 
 that have already passed other filter(s).  It's rather expensive, since it 
 has to use DocValues to examine a value and then determine if its a match.  
 So it scales O(n) where n is the number of documents it must see.  The 2nd 
 arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have an 
 int iterator but I can deal with that seeing if it extends DocIdSet.
 bq. I'm looking at BooleanFilter which I want to use and I notice that it 
 passes null to filter.getDocIdSet for acceptDocs, and it justifies this with 
 the following comment:
 bq. // we dont pass acceptDocs, we will filter at the end using an additional 
 filter
 the idea of passing the already build bits for the MUST is a good idea and 
 can be implemented easily.
 The reason why the acceptDocs were not passed down is the new way of filter 
 works in Lucene 4.0 and to optimize caching. Because accept docs are the only 
 thing that changes when deletions are applied and filters are required to 
 handle them separately:  whenever something is able to cache (e.g. 
 CachingWrapperFilter), the acceptDocs are not cached, so the underlying 
 filters get a null acceptDocs to produce the full bitset and the filtering is 
 done when CachingWrapperFilter gets the “uptodate” acceptDocs. But for this 
 case this does not matter if the first filter clause does not get acceptdocs, 
 but later MUST clauses of course can get them (they are not 
 deletion-specific)!
 Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
 Another thing that could help here: You can stop using BooleanFilter if you 
 can apply the filters sequentially (only MUST clauses) by wrapping with 
 multiple FilteredQuery: new FilteredQuery(new FilteredQuery(originalQuery, 
 clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery 
 autodetection decides to use random access filters, the acceptdocs are also 
 passed down from the outside to the inner, removing the documents filtered 
 out.
 {quote}
 Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down 
 the acceptDocs to every filter (for the case where Filter calculation is 
 expensive and accept docs help to limit the calculations) or not passing down 
 (if the filter is cheap and the multiple acceptDocs bit checks for every 
 single filter is more expensive – which is then more effective, e.g. when the 
 Filter is only a cached bitset). The first mode would also optimize the 
 MUST/MUST_NOT case to pass down the further restricted acceptDocs on later 
 filters (just like FilteredQuery does).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)

2012-11-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494855#comment-13494855
 ] 

Uwe Schindler commented on LUCENE-4548:
---

bq. My broad comments on this having looked at a variety of these classes, is 
that the whole situation is very confusing. There are a bunch of classes here 
related to filtering that if you consider the sum total of them, it seems like 
a bit much to get a handle on: Filter, ChainedFilter, BooleanFilter, 
FilteredQuery, FilteredDocIdSet, BitsFilteredDocIdSet. I'm probably missing 
some. And then of course Filter != Query but sometimes they need to be adapted 
to each other. I bet there are a dozen ways I could skin this cat . That's a 
problem.

You are mixing user-faced classes and internal @lucene.internal classes!

My general preference would be to nuke Filters completely from Lucene and make 
everything a Query (this is how Solr handles the stuff, too). A filter is just 
a Query with a constant score. Those queries could optionally use a Bitset for 
matching...

Some comments:
- BitsFilteredDocIdSet, FilteredDocIdSet: This are just helper classes to not 
repeat the same stuff everywhere in Lucene. User's are never facing them.
- FilteredQuery is *the one and only approch* to apply filters in recent Lucene 
versions! Since Lucene 4.0, IndexSearcher.search(Query, Filter) just wraps the 
Query and Filter with FilteredQuery, there is no more Filter logic in 
IndexSearcher anymore! IndexSearcher.search(Query, Filter) is just a 
convenience method and aliases to IndexSearcher.search(new FilteredQuery(Query, 
Filter))!
- ChainedFilter should be deprecated, this class is so broken. It also still 
uses outdated OpenBitSet. At least we should move to sandbox. E.g., to chain 
and'ed filters just use new FilteredQuery(new FilteredQuery(query, filter1), 
filter2) or use BooleanFilter.
- BooleanFilter may be useful, but I don't really like it. Once we have Filters 
and Queries the same class, one could use BooleanQuery to achieve the same with 
the constant score queries. BooleanFilter is also inconsistent to BooleanQuery 
with pure negative clauses!

 BooleanFilter should optionally pass down further restricted acceptDocs in 
 the MUST case (and acceptDocs in general)
 

 Key: LUCENE-4548
 URL: https://issues.apache.org/jira/browse/LUCENE-4548
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Uwe Schindler
 Attachments: LUCENE-4548.patch


 Spin-off from dev@lao:
 {quote}
 bq. I am about to write a Filter that only operates on a set of documents 
 that have already passed other filter(s).  It's rather expensive, since it 
 has to use DocValues to examine a value and then determine if its a match.  
 So it scales O(n) where n is the number of documents it must see.  The 2nd 
 arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have an 
 int iterator but I can deal with that seeing if it extends DocIdSet.
 bq. I'm looking at BooleanFilter which I want to use and I notice that it 
 passes null to filter.getDocIdSet for acceptDocs, and it justifies this with 
 the following comment:
 bq. // we dont pass acceptDocs, we will filter at the end using an additional 
 filter
 the idea of passing the already build bits for the MUST is a good idea and 
 can be implemented easily.
 The reason why the acceptDocs were not passed down is the new way of filter 
 works in Lucene 4.0 and to optimize caching. Because accept docs are the only 
 thing that changes when deletions are applied and filters are required to 
 handle them separately:  whenever something is able to cache (e.g. 
 CachingWrapperFilter), the acceptDocs are not cached, so the underlying 
 filters get a null acceptDocs to produce the full bitset and the filtering is 
 done when CachingWrapperFilter gets the “uptodate” acceptDocs. But for this 
 case this does not matter if the first filter clause does not get acceptdocs, 
 but later MUST clauses of course can get them (they are not 
 deletion-specific)!
 Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
 Another thing that could help here: You can stop using BooleanFilter if you 
 can apply the filters sequentially (only MUST clauses) by wrapping with 
 multiple FilteredQuery: new FilteredQuery(new FilteredQuery(originalQuery, 
 clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery 
 autodetection decides to use random access filters, the acceptdocs are also 
 passed down from the outside to the inner, removing the documents filtered 
 out.
 {quote}
 Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down 
 the acceptDocs to every filter (for the case where Filter calculation is 
 

[jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)

2012-11-11 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13494858#comment-13494858
 ] 

Eks Dev commented on LUCENE-4548:
-

...would be to nuke Filters completely from Lucene ...

User +1

Filter is conceptually nothing more than no-scoring and a possibility to have 
an implementation that can be cached. 

From the user API point of whew, there is really no need to bother users with 
Filter abstraction. Both of these two are just attributes of the query (do you 
need to score this clause or would you like to have it cached). 

 BooleanFilter should optionally pass down further restricted acceptDocs in 
 the MUST case (and acceptDocs in general)
 

 Key: LUCENE-4548
 URL: https://issues.apache.org/jira/browse/LUCENE-4548
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Uwe Schindler
 Attachments: LUCENE-4548.patch


 Spin-off from dev@lao:
 {quote}
 bq. I am about to write a Filter that only operates on a set of documents 
 that have already passed other filter(s).  It's rather expensive, since it 
 has to use DocValues to examine a value and then determine if its a match.  
 So it scales O(n) where n is the number of documents it must see.  The 2nd 
 arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have an 
 int iterator but I can deal with that seeing if it extends DocIdSet.
 bq. I'm looking at BooleanFilter which I want to use and I notice that it 
 passes null to filter.getDocIdSet for acceptDocs, and it justifies this with 
 the following comment:
 bq. // we dont pass acceptDocs, we will filter at the end using an additional 
 filter
 the idea of passing the already build bits for the MUST is a good idea and 
 can be implemented easily.
 The reason why the acceptDocs were not passed down is the new way of filter 
 works in Lucene 4.0 and to optimize caching. Because accept docs are the only 
 thing that changes when deletions are applied and filters are required to 
 handle them separately:  whenever something is able to cache (e.g. 
 CachingWrapperFilter), the acceptDocs are not cached, so the underlying 
 filters get a null acceptDocs to produce the full bitset and the filtering is 
 done when CachingWrapperFilter gets the “uptodate” acceptDocs. But for this 
 case this does not matter if the first filter clause does not get acceptdocs, 
 but later MUST clauses of course can get them (they are not 
 deletion-specific)!
 Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
 Another thing that could help here: You can stop using BooleanFilter if you 
 can apply the filters sequentially (only MUST clauses) by wrapping with 
 multiple FilteredQuery: new FilteredQuery(new FilteredQuery(originalQuery, 
 clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery 
 autodetection decides to use random access filters, the acceptdocs are also 
 passed down from the outside to the inner, removing the documents filtered 
 out.
 {quote}
 Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down 
 the acceptDocs to every filter (for the case where Filter calculation is 
 expensive and accept docs help to limit the calculations) or not passing down 
 (if the filter is cheap and the multiple acceptDocs bit checks for every 
 single filter is more expensive – which is then more effective, e.g. when the 
 Filter is only a cached bitset). The first mode would also optimize the 
 MUST/MUST_NOT case to pass down the further restricted acceptDocs on later 
 filters (just like FilteredQuery does).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)

2012-11-11 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495118#comment-13495118
 ] 

David Smiley commented on LUCENE-4548:
--

Uwe: Thanks tremendously for helping me understand some of these things.

I never looked at FilteredQuery before, but now that I have I like it a whole 
lot.  I like that I can compose them and pick the filter strategy.  That 
pretty much addresses one of my concerns on how to compose them given different 
algorithm, and the solution looks very well designed.

It's especially interesting to me that Filter logic isn't in IndexSearcher 
anymore.  Not that I knew before, but what this basically tells me is that 
Lucene internally just deals with Queries, not Filters.  That's fantastic.  
Based on your remarks, it seems there is a lot of cleanup potential.  There is 
a lot of stuff to confuse people, like me, and that was a big part of my 
concern.

RE OpenBitSet -- it's not deprecated nor marked as internal.  FastBitSet on the 
other hand is marked as internal.

 BooleanFilter should optionally pass down further restricted acceptDocs in 
 the MUST case (and acceptDocs in general)
 

 Key: LUCENE-4548
 URL: https://issues.apache.org/jira/browse/LUCENE-4548
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Uwe Schindler
 Attachments: LUCENE-4548.patch


 Spin-off from dev@lao:
 {quote}
 bq. I am about to write a Filter that only operates on a set of documents 
 that have already passed other filter(s).  It's rather expensive, since it 
 has to use DocValues to examine a value and then determine if its a match.  
 So it scales O(n) where n is the number of documents it must see.  The 2nd 
 arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have an 
 int iterator but I can deal with that seeing if it extends DocIdSet.
 bq. I'm looking at BooleanFilter which I want to use and I notice that it 
 passes null to filter.getDocIdSet for acceptDocs, and it justifies this with 
 the following comment:
 bq. // we dont pass acceptDocs, we will filter at the end using an additional 
 filter
 the idea of passing the already build bits for the MUST is a good idea and 
 can be implemented easily.
 The reason why the acceptDocs were not passed down is the new way of filter 
 works in Lucene 4.0 and to optimize caching. Because accept docs are the only 
 thing that changes when deletions are applied and filters are required to 
 handle them separately:  whenever something is able to cache (e.g. 
 CachingWrapperFilter), the acceptDocs are not cached, so the underlying 
 filters get a null acceptDocs to produce the full bitset and the filtering is 
 done when CachingWrapperFilter gets the “uptodate” acceptDocs. But for this 
 case this does not matter if the first filter clause does not get acceptdocs, 
 but later MUST clauses of course can get them (they are not 
 deletion-specific)!
 Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
 Another thing that could help here: You can stop using BooleanFilter if you 
 can apply the filters sequentially (only MUST clauses) by wrapping with 
 multiple FilteredQuery: new FilteredQuery(new FilteredQuery(originalQuery, 
 clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery 
 autodetection decides to use random access filters, the acceptdocs are also 
 passed down from the outside to the inner, removing the documents filtered 
 out.
 {quote}
 Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down 
 the acceptDocs to every filter (for the case where Filter calculation is 
 expensive and accept docs help to limit the calculations) or not passing down 
 (if the filter is cheap and the multiple acceptDocs bit checks for every 
 single filter is more expensive – which is then more effective, e.g. when the 
 Filter is only a cached bitset). The first mode would also optimize the 
 MUST/MUST_NOT case to pass down the further restricted acceptDocs on later 
 filters (just like FilteredQuery does).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)

2012-11-11 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495131#comment-13495131
 ] 

Uwe Schindler commented on LUCENE-4548:
---

bq. RE OpenBitSet – it's not deprecated nor marked as internal. FastBitSet on 
the other hand is marked as internal.

We should fix this! The problem here is, that FixedBitSet is meant as 
replacement and performs better because of less checks. The problem with 
ChainedFilter is that it uses some features, not yet in FixedBitSet like XORing 
the bitsets. The use of OpenBitSet instead of FixedBitSet makes some 
optimizations inside FixedBitSet harder (e.g. FixedBitSet.or(DocIdSetIterator) 
checks if the DISI is from a FixedBitSet one and then then directly ors the 
bits instead of using the iterator (unless the iterator is not before the first 
document) - fortunately, both OBS and FBS have the same iterator 
implementation).

Have you looked at my attached patch, it should fix your problems, but I don't 
want to commit it now - just for testing. Or do you use FilteredQuery now for 
what you intended to do?

 BooleanFilter should optionally pass down further restricted acceptDocs in 
 the MUST case (and acceptDocs in general)
 

 Key: LUCENE-4548
 URL: https://issues.apache.org/jira/browse/LUCENE-4548
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Uwe Schindler
 Attachments: LUCENE-4548.patch


 Spin-off from dev@lao:
 {quote}
 bq. I am about to write a Filter that only operates on a set of documents 
 that have already passed other filter(s).  It's rather expensive, since it 
 has to use DocValues to examine a value and then determine if its a match.  
 So it scales O(n) where n is the number of documents it must see.  The 2nd 
 arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have an 
 int iterator but I can deal with that seeing if it extends DocIdSet.
 bq. I'm looking at BooleanFilter which I want to use and I notice that it 
 passes null to filter.getDocIdSet for acceptDocs, and it justifies this with 
 the following comment:
 bq. // we dont pass acceptDocs, we will filter at the end using an additional 
 filter
 the idea of passing the already build bits for the MUST is a good idea and 
 can be implemented easily.
 The reason why the acceptDocs were not passed down is the new way of filter 
 works in Lucene 4.0 and to optimize caching. Because accept docs are the only 
 thing that changes when deletions are applied and filters are required to 
 handle them separately:  whenever something is able to cache (e.g. 
 CachingWrapperFilter), the acceptDocs are not cached, so the underlying 
 filters get a null acceptDocs to produce the full bitset and the filtering is 
 done when CachingWrapperFilter gets the “uptodate” acceptDocs. But for this 
 case this does not matter if the first filter clause does not get acceptdocs, 
 but later MUST clauses of course can get them (they are not 
 deletion-specific)!
 Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
 Another thing that could help here: You can stop using BooleanFilter if you 
 can apply the filters sequentially (only MUST clauses) by wrapping with 
 multiple FilteredQuery: new FilteredQuery(new FilteredQuery(originalQuery, 
 clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery 
 autodetection decides to use random access filters, the acceptdocs are also 
 passed down from the outside to the inner, removing the documents filtered 
 out.
 {quote}
 Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down 
 the acceptDocs to every filter (for the case where Filter calculation is 
 expensive and accept docs help to limit the calculations) or not passing down 
 (if the filter is cheap and the multiple acceptDocs bit checks for every 
 single filter is more expensive – which is then more effective, e.g. when the 
 Filter is only a cached bitset). The first mode would also optimize the 
 MUST/MUST_NOT case to pass down the further restricted acceptDocs on later 
 filters (just like FilteredQuery does).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)

2012-11-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493090#comment-13493090
 ] 

Robert Muir commented on LUCENE-4548:
-

I think this dirties up the api: think about cases like if the filter isn't 
random access.

I think it would be best if this stuff stayed confined to FilteredQuery

 BooleanFilter should optionally pass down further restricted acceptDocs in 
 the MUST case (and acceptDocs in general)
 

 Key: LUCENE-4548
 URL: https://issues.apache.org/jira/browse/LUCENE-4548
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Uwe Schindler

 Spin-off from dev@lao:
 {quote}
 bq. I am about to write a Filter that only operates on a set of documents 
 that have already passed other filter(s).  It's rather expensive, since it 
 has to use DocValues to examine a value and then determine if its a match.  
 So it scales O(n) where n is the number of documents it must see.  The 2nd 
 arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have an 
 int iterator but I can deal with that seeing if it extends DocIdSet.
 bq. I'm looking at BooleanFilter which I want to use and I notice that it 
 passes null to filter.getDocIdSet for acceptDocs, and it justifies this with 
 the following comment:
 bq. // we dont pass acceptDocs, we will filter at the end using an additional 
 filter
 the idea of passing the already build bits for the MUST is a good idea and 
 can be implemented easily.
 The reason why the acceptDocs were not passed down is the new way of filter 
 works in Lucene 4.0 and to optimize caching. Because accept docs are the only 
 thing that changes when deletions are applied and filters are required to 
 handle them separately:  whenever something is able to cache (e.g. 
 CachingWrapperFilter), the acceptDocs are not cached, so the underlying 
 filters get a null acceptDocs to produce the full bitset and the filtering is 
 done when CachingWrapperFilter gets the “uptodate” acceptDocs. But for this 
 case this does not matter if the first filter clause does not get acceptdocs, 
 but later MUST clauses of course can get them (they are not 
 deletion-specific)!
 Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
 Another thing that could help here: You can stop using BooleanFilter if you 
 can apply the filters sequentially (only MUST clauses) by wrapping with 
 multiple FilteredQuery: new FilteredQuery(new FilteredQuery(originalQuery, 
 clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery 
 autodetection decides to use random access filters, the acceptdocs are also 
 passed down from the outside to the inner, removing the documents filtered 
 out.
 {quote}
 Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down 
 the acceptDocs to every filter (for the case where Filter calculation is 
 expensive and accept docs help to limit the calculations) or not passing down 
 (if the filter is cheap and the multiple acceptDocs bit checks for every 
 single filter is more expensive – which is then more effective, e.g. when the 
 Filter is only a cached bitset). The first mode would also optimize the 
 MUST/MUST_NOT case to pass down the further restricted acceptDocs on later 
 filters (just like FilteredQuery does).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)

2012-11-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493109#comment-13493109
 ] 

Uwe Schindler commented on LUCENE-4548:
---

I am talking about the contrib BooleanFilter, FilteredQuery is not affected at 
all.

 BooleanFilter should optionally pass down further restricted acceptDocs in 
 the MUST case (and acceptDocs in general)
 

 Key: LUCENE-4548
 URL: https://issues.apache.org/jira/browse/LUCENE-4548
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Uwe Schindler

 Spin-off from dev@lao:
 {quote}
 bq. I am about to write a Filter that only operates on a set of documents 
 that have already passed other filter(s).  It's rather expensive, since it 
 has to use DocValues to examine a value and then determine if its a match.  
 So it scales O(n) where n is the number of documents it must see.  The 2nd 
 arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have an 
 int iterator but I can deal with that seeing if it extends DocIdSet.
 bq. I'm looking at BooleanFilter which I want to use and I notice that it 
 passes null to filter.getDocIdSet for acceptDocs, and it justifies this with 
 the following comment:
 bq. // we dont pass acceptDocs, we will filter at the end using an additional 
 filter
 the idea of passing the already build bits for the MUST is a good idea and 
 can be implemented easily.
 The reason why the acceptDocs were not passed down is the new way of filter 
 works in Lucene 4.0 and to optimize caching. Because accept docs are the only 
 thing that changes when deletions are applied and filters are required to 
 handle them separately:  whenever something is able to cache (e.g. 
 CachingWrapperFilter), the acceptDocs are not cached, so the underlying 
 filters get a null acceptDocs to produce the full bitset and the filtering is 
 done when CachingWrapperFilter gets the “uptodate” acceptDocs. But for this 
 case this does not matter if the first filter clause does not get acceptdocs, 
 but later MUST clauses of course can get them (they are not 
 deletion-specific)!
 Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
 Another thing that could help here: You can stop using BooleanFilter if you 
 can apply the filters sequentially (only MUST clauses) by wrapping with 
 multiple FilteredQuery: new FilteredQuery(new FilteredQuery(originalQuery, 
 clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery 
 autodetection decides to use random access filters, the acceptdocs are also 
 passed down from the outside to the inner, removing the documents filtered 
 out.
 {quote}
 Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down 
 the acceptDocs to every filter (for the case where Filter calculation is 
 expensive and accept docs help to limit the calculations) or not passing down 
 (if the filter is cheap and the multiple acceptDocs bit checks for every 
 single filter is more expensive – which is then more effective, e.g. when the 
 Filter is only a cached bitset). The first mode would also optimize the 
 MUST/MUST_NOT case to pass down the further restricted acceptDocs on later 
 filters (just like FilteredQuery does).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)

2012-11-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493112#comment-13493112
 ] 

Robert Muir commented on LUCENE-4548:
-

So am I. please re-read my comment.

I think its bad enough this makes FilteredQuery's API complex: but its 
over-the-top
for this stuff to start *ALSO* making the APIs of concrete filters complex: 
it should stay confined to FilteredQuery.

 BooleanFilter should optionally pass down further restricted acceptDocs in 
 the MUST case (and acceptDocs in general)
 

 Key: LUCENE-4548
 URL: https://issues.apache.org/jira/browse/LUCENE-4548
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Uwe Schindler

 Spin-off from dev@lao:
 {quote}
 bq. I am about to write a Filter that only operates on a set of documents 
 that have already passed other filter(s).  It's rather expensive, since it 
 has to use DocValues to examine a value and then determine if its a match.  
 So it scales O(n) where n is the number of documents it must see.  The 2nd 
 arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have an 
 int iterator but I can deal with that seeing if it extends DocIdSet.
 bq. I'm looking at BooleanFilter which I want to use and I notice that it 
 passes null to filter.getDocIdSet for acceptDocs, and it justifies this with 
 the following comment:
 bq. // we dont pass acceptDocs, we will filter at the end using an additional 
 filter
 the idea of passing the already build bits for the MUST is a good idea and 
 can be implemented easily.
 The reason why the acceptDocs were not passed down is the new way of filter 
 works in Lucene 4.0 and to optimize caching. Because accept docs are the only 
 thing that changes when deletions are applied and filters are required to 
 handle them separately:  whenever something is able to cache (e.g. 
 CachingWrapperFilter), the acceptDocs are not cached, so the underlying 
 filters get a null acceptDocs to produce the full bitset and the filtering is 
 done when CachingWrapperFilter gets the “uptodate” acceptDocs. But for this 
 case this does not matter if the first filter clause does not get acceptdocs, 
 but later MUST clauses of course can get them (they are not 
 deletion-specific)!
 Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
 Another thing that could help here: You can stop using BooleanFilter if you 
 can apply the filters sequentially (only MUST clauses) by wrapping with 
 multiple FilteredQuery: new FilteredQuery(new FilteredQuery(originalQuery, 
 clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery 
 autodetection decides to use random access filters, the acceptdocs are also 
 passed down from the outside to the inner, removing the documents filtered 
 out.
 {quote}
 Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down 
 the acceptDocs to every filter (for the case where Filter calculation is 
 expensive and accept docs help to limit the calculations) or not passing down 
 (if the filter is cheap and the multiple acceptDocs bit checks for every 
 single filter is more expensive – which is then more effective, e.g. when the 
 Filter is only a cached bitset). The first mode would also optimize the 
 MUST/MUST_NOT case to pass down the further restricted acceptDocs on later 
 filters (just like FilteredQuery does).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)

2012-11-08 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493130#comment-13493130
 ] 

Uwe Schindler commented on LUCENE-4548:
---

I agree with you: In my opinion BooleanFilter with MUST clauses is somehow 
obsolete alltogether (at least for the MUST case), because you can easily chain 
FilteredQuery to combine multiple MUST filters. David Smiley can use this and 
the issue would be a no-op. In all cases he could additionally force 
FilteredQuery to use the random access mode (using the new 4.1 FilterModes), so 
acceptDocs are passed down in all cases.

The question is then (for this issue): Should we pass the acceptDocs down the 
BooleanFilter chain of clauses or apply them at the very end of BooleanFilter 
bitset creation (like we do currently)?

 BooleanFilter should optionally pass down further restricted acceptDocs in 
 the MUST case (and acceptDocs in general)
 

 Key: LUCENE-4548
 URL: https://issues.apache.org/jira/browse/LUCENE-4548
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Uwe Schindler

 Spin-off from dev@lao:
 {quote}
 bq. I am about to write a Filter that only operates on a set of documents 
 that have already passed other filter(s).  It's rather expensive, since it 
 has to use DocValues to examine a value and then determine if its a match.  
 So it scales O(n) where n is the number of documents it must see.  The 2nd 
 arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have an 
 int iterator but I can deal with that seeing if it extends DocIdSet.
 bq. I'm looking at BooleanFilter which I want to use and I notice that it 
 passes null to filter.getDocIdSet for acceptDocs, and it justifies this with 
 the following comment:
 bq. // we dont pass acceptDocs, we will filter at the end using an additional 
 filter
 the idea of passing the already build bits for the MUST is a good idea and 
 can be implemented easily.
 The reason why the acceptDocs were not passed down is the new way of filter 
 works in Lucene 4.0 and to optimize caching. Because accept docs are the only 
 thing that changes when deletions are applied and filters are required to 
 handle them separately:  whenever something is able to cache (e.g. 
 CachingWrapperFilter), the acceptDocs are not cached, so the underlying 
 filters get a null acceptDocs to produce the full bitset and the filtering is 
 done when CachingWrapperFilter gets the “uptodate” acceptDocs. But for this 
 case this does not matter if the first filter clause does not get acceptdocs, 
 but later MUST clauses of course can get them (they are not 
 deletion-specific)!
 Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
 Another thing that could help here: You can stop using BooleanFilter if you 
 can apply the filters sequentially (only MUST clauses) by wrapping with 
 multiple FilteredQuery: new FilteredQuery(new FilteredQuery(originalQuery, 
 clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery 
 autodetection decides to use random access filters, the acceptdocs are also 
 passed down from the outside to the inner, removing the documents filtered 
 out.
 {quote}
 Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down 
 the acceptDocs to every filter (for the case where Filter calculation is 
 expensive and accept docs help to limit the calculations) or not passing down 
 (if the filter is cheap and the multiple acceptDocs bit checks for every 
 single filter is more expensive – which is then more effective, e.g. when the 
 Filter is only a cached bitset). The first mode would also optimize the 
 MUST/MUST_NOT case to pass down the further restricted acceptDocs on later 
 filters (just like FilteredQuery does).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4548) BooleanFilter should optionally pass down further restricted acceptDocs in the MUST case (and acceptDocs in general)

2012-11-08 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493456#comment-13493456
 ] 

David Smiley commented on LUCENE-4548:
--

My broad comments on this having looked at a variety of these classes, is that 
the whole situation is very confusing.  There are a bunch of classes here 
related to filtering that if you consider the sum total of them, it seems like 
a bit much to get a handle on: Filter, ChainedFilter, BooleanFilter, 
FilteredQuery, FilteredDocIdSet, BitsFilteredDocIdSet.  I'm probably missing 
some.  And then of course Filter != Query but sometimes they need to be adapted 
to each other.  I bet there are a dozen ways I could skin this cat (?).  That's 
a problem.

Looking ahead at Lucene 5, can we think of a smaller set of classes for filters 
and chaining them (with AND, OR, ...), and annotating which are expensive and 
considering random access vs doc id iteration?  The title of this issue seems 
like it's a band-aid to the API complexity that probably makes it worse.

 BooleanFilter should optionally pass down further restricted acceptDocs in 
 the MUST case (and acceptDocs in general)
 

 Key: LUCENE-4548
 URL: https://issues.apache.org/jira/browse/LUCENE-4548
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Uwe Schindler

 Spin-off from dev@lao:
 {quote}
 bq. I am about to write a Filter that only operates on a set of documents 
 that have already passed other filter(s).  It's rather expensive, since it 
 has to use DocValues to examine a value and then determine if its a match.  
 So it scales O(n) where n is the number of documents it must see.  The 2nd 
 arg of getDocIdSet is Bits acceptDocs.  Unfortunately Bits doesn't have an 
 int iterator but I can deal with that seeing if it extends DocIdSet.
 bq. I'm looking at BooleanFilter which I want to use and I notice that it 
 passes null to filter.getDocIdSet for acceptDocs, and it justifies this with 
 the following comment:
 bq. // we dont pass acceptDocs, we will filter at the end using an additional 
 filter
 the idea of passing the already build bits for the MUST is a good idea and 
 can be implemented easily.
 The reason why the acceptDocs were not passed down is the new way of filter 
 works in Lucene 4.0 and to optimize caching. Because accept docs are the only 
 thing that changes when deletions are applied and filters are required to 
 handle them separately:  whenever something is able to cache (e.g. 
 CachingWrapperFilter), the acceptDocs are not cached, so the underlying 
 filters get a null acceptDocs to produce the full bitset and the filtering is 
 done when CachingWrapperFilter gets the “uptodate” acceptDocs. But for this 
 case this does not matter if the first filter clause does not get acceptdocs, 
 but later MUST clauses of course can get them (they are not 
 deletion-specific)!
 Can you open issue to optimize the MUST case (possibly MUST_NOT, too)?
 Another thing that could help here: You can stop using BooleanFilter if you 
 can apply the filters sequentially (only MUST clauses) by wrapping with 
 multiple FilteredQuery: new FilteredQuery(new FilteredQuery(originalQuery, 
 clause1), clause2). If the DocIdSets enable bits() and the FilteredQuery 
 autodetection decides to use random access filters, the acceptdocs are also 
 passed down from the outside to the inner, removing the documents filtered 
 out.
 {quote}
 Maybe BooleanFilter should have 2 modes (Boolean ctor argument): Passing down 
 the acceptDocs to every filter (for the case where Filter calculation is 
 expensive and accept docs help to limit the calculations) or not passing down 
 (if the filter is cheap and the multiple acceptDocs bit checks for every 
 single filter is more expensive – which is then more effective, e.g. when the 
 Filter is only a cached bitset). The first mode would also optimize the 
 MUST/MUST_NOT case to pass down the further restricted acceptDocs on later 
 filters (just like FilteredQuery does).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org