RE: Filtering question

Uwe Schindler Mon, 16 Mar 2015 05:49:42 -0700

Hi,

http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/FieldCacheDocIdSet.html


You just have to implement the "protected boolean matchDoc(int docId)" method. 
You should return this DocIdSet from your filter instead of the manual code you 
created.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -----Original Message-----
> From: chrisbamford [mailto:chrisbamf...@chrisbamford.plus.com]
> Sent: Monday, March 16, 2015 1:07 PM
> To: Uwe Schindler
> Cc: java-user@lucene.apache.org; ch...@bammers.net
> Subject: RE: Filtering question
> 
> Hi Uwe
> 
> I have downloaded Lucene 5.0.0 source to look at the Filters you mention.
> DocValuesTermsFilter looks promising, however I cannot find
> FieldCacheDocIdSet anywhere in Lucene 4.10.2 or in 5.0.0.  Where should I be
> looking?
> 
> I take your point about brute-forcing the DocValues search and all I can do is
> implement / test and decide if it is acceptable.  This is the main driver 
> behind
> getting flitering working correctly!
> 
> Thanks for your continued help.
> 
> - Chris
> 
> On 12.03.2015 18:45, Uwe Schindler wrote:
> > Hi Chris,
> >
> >
> >> Hi Uwe, thanks for your suggestions.  I have tried a couple of things
> >> with no luck yet:
> >>
> >> > Sorry,
> >> > I just noticed, you are using TermFilter not TermsFilter: This one
> >> > does not support random access (using bits()). Because of this the
> >> > filtered docs cannot be passed down using acceptDocs.
> >> >
> >> TermsFilter made no difference, still no acceptDocs passed to the
> >> filter.
> >
> > I know, the problem is that a TermsFilter with one term behaves like a
> > TermFilter :-) In any case you could use CachingWrapperFilter to
> > forcefully create a bitset, but I don't think it's worth the hassle.
> > It is still strange, I did not dig into it.
> >
> >> > The should
> >> > clause in addition causes that the ConstantScoreQuery has to try
> >> all
> >> > documents because there is nothing else that could drive the
> >> query.
> >> >
> >> As an experiment I tried MUST, this didn't help either.
> >
> > I checked your impl: You just create a Bitset, so it won't help.
> > Please look at other DocValues filters like DocValuesRangeFilter how
> > they implement the iterator. Creating a BitSet is just overhead and
> > while doing so, you have no chance to take other query constraints
> > into account (because the bitset is built *before* the query is
> > executed).
> >
> > Instead you should implement a custom DocIdSet (Lucene 4.10 offers
> > FieldCacheDocIdSet as base class; in 5.0 it was renamed to
> > DocValuesDocIdSet, you can implement the abstract matchDoc() method
> > there). This one automatically handles everything correctly, like
> > acceptDocs or uses advance(). It does not build a bitset, it does
> > everything by calling the abstract matchDoc() method on the fly. You
> > just have to put the matching logic into matchDoc(int docId).
> >
> >> > An alternative approach would be (in Lucene 4.10 or 5.0) to add
> >> the
> >> > TermFilter as ConstantScoreFilter(TermQuery) with boost=0 to the
> >> > BooleanQuery. In that case it can drive the query and does not
> >> affect
> >> > scoring. In later Lucene versions you may use the new
> >> > BooleanQuery.Occur type "FILTER" which can add any query as
> >> filter.
> >> > Filters will be deprecated once this is ready.
> >> >
> >> This is interesting and I will try it when I get a chance.
> >
> > I mean ConstantScoreQuery, not ConstantScoreFilter. But you need to
> > implement your own DocIdSetIterator with DocIdSetIterator.advance(),
> > otherwise it won't help (see above).
> >
> >> >> My goal is to slowly transform a particular field from
> >> StringField to
> >> >> BinaryDocValues so that during the transition a doc may hold the
> >> >> value either in the old location or the new. Therefore a query
> >> must
> >> >> be able to say
> >> >>     oldField:"foo" OR newField:"foo"
> >> >> Where oldField is a StringField and newField is a
> >> BinaryDocValues.
> >> >
> >> > Why do you want to do this.
> >> >
> >> Good question!  In our architecture we build indexes by pulling data
> >> from several sources and it is _expensive_.  Increasingly we are
> >> requested to change one or two fields which currently requires a full
> >> re-index of the doc.
> >> When I attended the Dublin Lucene conference I spoke to Shai Erera
> >> about this problem and he pointed me at DocValues which allow you to
> >> update fields without incurring the full doc reindex cost.  That is
> >> the appeal for us.
> >> As I said before, we want to transform docs only as they are updated,
> >> where transformation involves dropping the old TextField and creating
> >> a new BinaryDocValuesField containing the same value.  Hence the need
> >> for the query to be able to search 'old OR new'.
> >>
> >> > If you want to query like this on the field, it is a bad idea to
> >> use
> >> > DocValues.
> >> >
> >> Why is it a bad idea?
> >
> > Indeed, DocValues are update-able. But they have the backside, that
> > they don't provide a way to query the index for a term and it tells
> > you which documents have the term (our inverted index - the reason why
> > we use Lucene!). DocValues are just a large array with random access.
> > If you want to query on it, you have to brute force, unless there is
> > something else in the query structure that can "drive" your query
> > (advance() on the filter's iterator). On a BooleanQuery containing of
> > 2 should clauses, nothing can drive the query, so there is only the
> > possibility to do a full scan of the docvalues doc-by-doc.
> >
> > Uwe
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Filtering question

Reply via email to