RE: Filtering question

Uwe Schindler Thu, 12 Mar 2015 11:46:57 -0700

Hi Chris,


> Hi Uwe, thanks for your suggestions.  I have tried a couple of things with no
> luck yet:
> 
> > Sorry,
> > I just noticed, you are using TermFilter not TermsFilter: This one
> > does not support random access (using bits()). Because of this the
> > filtered docs cannot be passed down using acceptDocs.
> >
> TermsFilter made no difference, still no acceptDocs passed to the filter.

I know, the problem is that a TermsFilter with one term behaves like a 
TermFilter :-) In any case you could use CachingWrapperFilter to forcefully 
create a bitset, but I don't think it's worth the hassle. It is still strange, 
I did not dig into it.

> > The should
> > clause in addition causes that the ConstantScoreQuery has to try all
> > documents because there is nothing else that could drive the query.
> >
> As an experiment I tried MUST, this didn't help either.

I checked your impl: You just create a Bitset, so it won't help. Please look at 
other DocValues filters like DocValuesRangeFilter how they implement the 
iterator. Creating a BitSet is just overhead and while doing so, you have no 
chance to take other query constraints into account (because the bitset is 
built *before* the query is executed).

Instead you should implement a custom DocIdSet (Lucene 4.10 offers 
FieldCacheDocIdSet as base class; in 5.0 it was renamed to DocValuesDocIdSet, 
you can implement the abstract matchDoc() method there). This one automatically 
handles everything correctly, like acceptDocs or uses advance(). It does not 
build a bitset, it does everything by calling the abstract matchDoc() method on 
the fly. You just have to put the matching logic into matchDoc(int docId).

> > An alternative approach would be (in Lucene 4.10 or 5.0) to add the
> > TermFilter as ConstantScoreFilter(TermQuery) with boost=0 to the
> > BooleanQuery. In that case it can drive the query and does not affect
> > scoring. In later Lucene versions you may use the new
> > BooleanQuery.Occur type "FILTER" which can add any query as filter.
> > Filters will be deprecated once this is ready.
> >
> This is interesting and I will try it when I get a chance.

I mean ConstantScoreQuery, not ConstantScoreFilter. But you need to implement 
your own DocIdSetIterator with DocIdSetIterator.advance(), otherwise it won't 
help (see above).

> >> My goal is to slowly transform a particular field from StringField to
> >> BinaryDocValues so that during the transition a doc may hold the
> >> value either in the old location or the new. Therefore a query must
> >> be able to say
> >>     oldField:"foo" OR newField:"foo"
> >> Where oldField is a StringField and newField is a BinaryDocValues.
> >
> > Why do you want to do this.
> >
> Good question!  In our architecture we build indexes by pulling data from
> several sources and it is _expensive_.  Increasingly we are requested to
> change one or two fields which currently requires a full re-index of the doc.
> When I attended the Dublin Lucene conference I spoke to Shai Erera about
> this problem and he pointed me at DocValues which allow you to update
> fields without incurring the full doc reindex cost.  That is the appeal for 
> us.
> As I said before, we want to transform docs only as they are updated, where
> transformation involves dropping the old TextField and creating a new
> BinaryDocValuesField containing the same value.  Hence the need for the
> query to be able to search 'old OR new'.
> 
> > If you want to query like this on the field, it is a bad idea to use
> > DocValues.
> >
> Why is it a bad idea?

Indeed, DocValues are update-able. But they have the backside, that they don't 
provide a way to query the index for a term and it tells you which documents 
have the term (our inverted index - the reason why we use Lucene!). DocValues 
are just a large array with random access. If you want to query on it, you have 
to brute force, unless there is something else in the query structure that can 
"drive" your query (advance() on the filter's iterator). On a BooleanQuery 
containing of 2 should clauses, nothing can drive the query, so there is only 
the possibility to do a full scan of the docvalues doc-by-doc.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Filtering question

Reply via email to