Re: Term numbering and range filtering

Michael McCandless Tue, 11 Nov 2008 02:30:22 -0800

The other part of your proposal was to somehow "number" term text such
that term range comparisons can be implemented fast int comparison.


I like the idea of building dynamic filters on top of a
"column-stride" array of field values.  You could extend it to be a
real Scorer, too.  EG I could imagine holding a "last changed"
column-stride array, and then building a Scorer on top of that which
returns a "recency score", and then using function query to multiply
that with the actual relevance score.

Maybe the right place for this to land is function queries, since they
are already creating queries based on DocValues (= a column-stride
store)?  But your change would extend function queries to also be able
to do filtering based on the DocValues (today function queries can
only alter the score, I think).

However, this approach is still a linear O(maxDocID) search.  It
should have a very low constant in front, since 1) it's all in RAM and
2) if you number term-text then you're using much faster int
comparison, not String comparison.  Also, once LUCENE-1231
(column-stride fields) is in, loading your column-stride array should
be much faster than it is today with FieldCache.

Yet another option, which gives faster than linear performance for
Term range searching, are the ideas in this paper for enabling
efficient range searching:

  http://fontoura.org/papers/paramsearch.pdf

However that'd be quite a bit deeper change to Lucene.

Mike

Tim Sturge wrote:

Reading this I realize how unclear it is, so let me give a concreteexample:
I want to do a search restricting users by age range. So someone canask for
the users 18-35, 40-60 etc.

Here are the options I considered:
1) construct a RangeQuery. This is a 20-40 clause boolean subqueryin an
otherwise 1 or 2 clause query and I'd like to avoid that. It also has
scoring artifacts I wish to avoid (I don't want users to rank higherjust
because we have less users of that particular age).
2) construct a ConstantScoreRangeQuery. Then I'm forced to iterateall the
users in the age range for each query. This was cost-prohibitive.
3) cache filters for each age range. Problem is there are 50starting points
and 50 ending points and caching 2500 filters is unrealistic.
4) cache filters for each age, and or them together for each query(there'sstuff in contrib that does this.) This is best so far, but doesrequire
caching 50 filters and doing 20 bitset ors per query on an 8MB bitset.

So what I thought would be interesting is
5) build an array of bytes where the n-th byte contains the age ofuser n.Given the range it's fairly trivial to make this behave like afilter (ie
it's relatively easy to implement next() and skipTo() efficiently and
trivial to decide if a document is in the range or not.)
Then I realized this approach would make sense not just for ages,but alsofor countries, date ranges and zip code sets, so I thought I'd askif anyone
had tried it before.

Part of me assumes that someone must have done this already; so either
there's an implementation out there already or there's a good reasonI don'tsee that this is entirely impractical. So I'm interested to getfeedback.
Tim




On 11/10/08 2:26 PM, "Tim Sturge" <[EMAIL PROTECTED]> wrote:
I think we've gone around in a loop here. It's exactly due to theinadequacy
of cached filters that I'm considering what I'm doing.

Here's the section from my first email that is most illuminating:
"
The reason I have this question is that I am writing a multi-filterfor singleterm fields. My index contains many fields for which each documentcontains asingle term (e.g. date, zipcode, country) and I need to performrange queriesor set matches over these fields, many of which are very inclusive(they match
10% of the total documents)
A cached RangeFilter works well when there are a small number ofpotentialoptions (e.g. for countries) but when there are many options(consider a daterange or a set of zipcodes) there are too many potential choices tocache eachpossibility and it is too inefficient to build a filter on the flyfor eachquery (as you have to visit 10% of documents to build the filterdespite the
query itself matching 0.1%)
Therefore I was considering building a int[reader.maxDocs()] arrayfor eachfield and putting into it the term number for each document. Thisrelies onthe fact that each document contains only a single term for thisfield, butwith it I should be able to quickly construct a “multi-filter” (that is,something that iterates the array and checks that the term is inthe range or
set).
"
Does this help explain my rationale? The reason I'm posting here isthat Iimagine there are lots of people with this issue. In particulardate rangesseem to be something that lots of people use but Lucene implementsfairly
poorly.

Tim

On 11/10/08 1:58 PM, "Paul Elschot" <[EMAIL PROTECTED]> wrote:
Op Monday 10 November 2008 22:21:20 schreef Tim Sturge:
Hmmm -- I hadn't thought about that so I took a quick look at the
term vector support.

What I'm really looking for is a compact but performant
representation of a set of filters on the same (one term field).
Using term vectors would mean an algorithm similar to:

String myfield;
String myterm;
TermVector tv;
for (int i = 0 ;  i < maxDoc ; i++) {
   tv = reader.getTermFreqVector(i,country)
   if (tv.indexOf(myterm) != -1) {
         // include this doc...
       }
}
The key thing I am looking to achieve here is performancecomparableto filters. I suspect getTermFremVector() is not efficient enoughbut
I'll give it a try.
Better use a TermDocs on myterm for this, have a look at the code of
RangeFilter.
Filters are normally created from a slower query by setting a bitin anOpenBitSet at "include this doc". Then they are reused for theirspeed.
Filter caching could help. In case memory becomes a problem
and the filters are sparse enough, try and use SortedVIntList
as the underlying data structure in the cache. (Sparse enough means
less than 1 in 8 of all docs available the index reader.)
See also LUCENE-1296 for caching another data structure than the
one used to collect the filtered docs.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term numbering and range filtering

Reply via email to