Re: [jira] Created: (LUCENE-798) Factory for RangeFilters that caches sections of ranges to reduce disk reads

markharw00d Tue, 13 Feb 2007 00:42:53 -0800

Chris Hostetter wrote:

i haven't read the patch, but based on the jira description i don't think
this is attempting to reuse cached info across updates -- i think it's
just trying to address the issue of eliminating redundent TermEnum/TermDoc
iterating for similar ranges.

Correct. I have conveniently ignored the issue of index updates andconcerned myself with how to best handle similar (but not exactly thesame) queries.

The logic is this:

* Certain types of field (dates/location coordinates) are commonlyaccessed as a range rather than as individual terms

* Reading these ranges requires reading a lot of info off disk

* To avoid reading too much disk we cache a set doc ids (typically in abitset)


The options for caching these sets of doc ids are:

1) Caching sets representing a particular query - cache hits are likelyto be low because users request different ranges of values and theywould need to be exactly the same start/end range for such a scheme to work.2) Caching sets representing uniform ranges of values e.g. years/months- this requires knowledge of the term value format plus some ranges maybe very sparse e.g. months or years with no content - when using bitsetsfor caching this is inefficient.3) Caching sets representing non-uniform value ranges depending on indexcontent distribution

Option 3 is the approach I took. Initialization involves rippingthrough a TermEnum counting the total number of docs the termsrepresent. I then create a collection of ascending term ranges (e.g.20010101 to 20020327, 20020328 to 20020415) which representequally-sized numbers of doc references and therefore a guarantee of areasonably non-sparse set for caching purposes.At query time I create range filters dynamically by comparing the queryrange against these partitions to see which ranges are wholly contained- these are deemed cache hits. Each range records the number of "cachehits" and bitsets are cached for only the top "n" ranges that seempopular. Unpopular ranges continue to hit the disk (viaTermEnum/TermDocs) and do not cache this info. Popular ranges that havea bitset cache "OR" their cached bitset into the final filter's bitset.

Performance seems to vary depending on the number of ranges youconfigure, the number of bitsets you allow to be cached and the queryranges in use.


Cheers
Mark







The approach I took therefore is to cache

: Given, a sample date range:
: 01/01/2001 to 01/01/2007
: has a bit set associated.
:
: If I did a query for
: 12/31/2000 to 01/01/2007
:
: the current filter code would cause a new filter to be created (which
: is what I assume you are attempting to solve). Under you solution I

i don't think it's an attempt at reducing the number of bitsets, just at
reducing the amount of time spent computing those bitsets -- with the
filter on "01/01/2001 to 01/01/2007" already in the cache, the filter on
"12/31/2000 to 01/01/2007" only needs to look at the terms between
12/31/2000 and 01/01/2001 - and can and the result with the existing
bitset.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

___________________________________________________________All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazinehttp://uk.docs.yahoo.com/nowyoucan.html


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Created: (LUCENE-798) Factory for RangeFilters that caches sections of ranges to reduce disk reads

Reply via email to