Chris Hostetter wrote:
i haven't read the patch, but based on the jira description i don't think
this is attempting to reuse cached info across updates -- i think it's
just trying to address the issue of eliminating redundent TermEnum/TermDoc
iterating for similar ranges.
Correct. I have conveniently ignored the issue of index updates and
concerned myself with how to best handle similar (but not exactly the
same) queries.
The logic is this:
* Certain types of field (dates/location coordinates) are commonly
accessed as a range rather than as individual terms
* Reading these ranges requires reading a lot of info off disk
* To avoid reading too much disk we cache a set doc ids (typically in a
bitset)
The options for caching these sets of doc ids are:
1) Caching sets representing a particular query - cache hits are likely
to be low because users request different ranges of values and they
would need to be exactly the same start/end range for such a scheme to work.
2) Caching sets representing uniform ranges of values e.g. years/months
- this requires knowledge of the term value format plus some ranges may
be very sparse e.g. months or years with no content - when using bitsets
for caching this is inefficient.
3) Caching sets representing non-uniform value ranges depending on index
content distribution
Option 3 is the approach I took. Initialization involves ripping
through a TermEnum counting the total number of docs the terms
represent. I then create a collection of ascending term ranges (e.g.
20010101 to 20020327, 20020328 to 20020415) which represent
equally-sized numbers of doc references and therefore a guarantee of a
reasonably non-sparse set for caching purposes.
At query time I create range filters dynamically by comparing the query
range against these partitions to see which ranges are wholly contained
- these are deemed cache hits. Each range records the number of "cache
hits" and bitsets are cached for only the top "n" ranges that seem
popular. Unpopular ranges continue to hit the disk (via
TermEnum/TermDocs) and do not cache this info. Popular ranges that have
a bitset cache "OR" their cached bitset into the final filter's bitset.
Performance seems to vary depending on the number of ranges you
configure, the number of bitsets you allow to be cached and the query
ranges in use.
Cheers
Mark
The approach I took therefore is to cache
: Given, a sample date range:
: 01/01/2001 to 01/01/2007
: has a bit set associated.
:
: If I did a query for
: 12/31/2000 to 01/01/2007
:
: the current filter code would cause a new filter to be created (which
: is what I assume you are attempting to solve). Under you solution I
i don't think it's an attempt at reducing the number of bitsets, just at
reducing the amount of time spent computing those bitsets -- with the
filter on "01/01/2001 to 01/01/2007" already in the cache, the filter on
"12/31/2000 to 01/01/2007" only needs to look at the terms between
12/31/2000 and 01/01/2001 - and can and the result with the existing
bitset.
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
___________________________________________________________
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine
http://uk.docs.yahoo.com/nowyoucan.html
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]