Chris Hostetter wrote:
i haven't read the patch, but based on the jira description i don't think
this is attempting to reuse cached info across updates -- i think it's
just trying to address the issue of eliminating redundent TermEnum/TermDoc
iterating for similar ranges.
Correct. I have conveniently ignored the issue of index updates and concerned myself with how to best handle similar (but not exactly the same) queries.
The logic is this:
* Certain types of field (dates/location coordinates) are commonly accessed as a range rather than as individual terms
* Reading these ranges requires reading a lot of info off disk
* To avoid reading too much disk we cache a set doc ids (typically in a bitset)

The options for caching these sets of doc ids are:
1) Caching sets representing a particular query - cache hits are likely to be low because users request different ranges of values and they would need to be exactly the same start/end range for such a scheme to work. 2) Caching sets representing uniform ranges of values e.g. years/months - this requires knowledge of the term value format plus some ranges may be very sparse e.g. months or years with no content - when using bitsets for caching this is inefficient. 3) Caching sets representing non-uniform value ranges depending on index content distribution

Option 3 is the approach I took. Initialization involves ripping through a TermEnum counting the total number of docs the terms represent. I then create a collection of ascending term ranges (e.g. 20010101 to 20020327, 20020328 to 20020415) which represent equally-sized numbers of doc references and therefore a guarantee of a reasonably non-sparse set for caching purposes. At query time I create range filters dynamically by comparing the query range against these partitions to see which ranges are wholly contained - these are deemed cache hits. Each range records the number of "cache hits" and bitsets are cached for only the top "n" ranges that seem popular. Unpopular ranges continue to hit the disk (via TermEnum/TermDocs) and do not cache this info. Popular ranges that have a bitset cache "OR" their cached bitset into the final filter's bitset.

Performance seems to vary depending on the number of ranges you configure, the number of bitsets you allow to be cached and the query ranges in use.

Cheers
Mark







The approach I took therefore is to cache

: Given, a sample date range:
: 01/01/2001 to 01/01/2007
: has a bit set associated.
:
: If I did a query for
: 12/31/2000 to 01/01/2007
:
: the current filter code would cause a new filter to be created (which
: is what I assume you are attempting to solve). Under you solution I

i don't think it's an attempt at reducing the number of bitsets, just at
reducing the amount of time spent computing those bitsets -- with the
filter on "01/01/2001 to 01/01/2007" already in the cache, the filter on
"12/31/2000 to 01/01/2007" only needs to look at the terms between
12/31/2000 and 01/01/2001 - and can and the result with the existing
bitset.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






        
        
                
___________________________________________________________ All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to