RE: Improve worst-case performance of TrieRange queries

Uwe Schindler Tue, 24 Mar 2009 15:07:22 -0700

The biggest problem is: to find out if it is an improvement for a specific
range (when using TrieRange), you have to count the trie encoded terms from
the index that fall into the sub-ranges... And if you have done this, you
have done half of the filter work. The important thing with TrieRange is,
that the number of terms is not dependent on index size and how large the
range is. So a very short range may need more terms than a very large range.


Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

> -----Original Message-----
> From: Michael Busch [mailto:[email protected]]
> Sent: Tuesday, March 24, 2009 10:48 PM
> To: [email protected]
> Subject: Re: Improve worst-case performance of TrieRange queries
> 
> Uwe and I talked a little bit about this at the ApacheCon. We figured
> that this will probably only improve a very small amount of ranges, so
> as Uwe recommended, this is probably not worth the effort and complexity.
> 
> Never mind, just an idea :)
> 
> -Michael
> 
> On 3/24/09 12:40 AM, Michael Busch wrote:
> > Let me give an example to explain my idea - I'm using dates in my
> > example, because it's easier to imagine :)
> >
> > Let's say we have the following posting lists. There are 20 docs in the
> > index and an X means that a doc contains the corresponding term:
> >
> > Jan                X   X
> > Feb X    X      X
> > Mar          X
> > Apr        X        X
> > May    X
> > Jun
> > Jul           XX
> > Aug       X      X
> > Sep   X
> > Oct               X
> > Nov  X      X
> > Dec     X             X
> >
> > Then we index another term 'ALL'. It gets added for any document that
> > has a numeric value in this bucket:
> >
> > All XXXXXXXXXXXXXXXXX XX
> >
> > If the query is [Jun TO Jul], then we process the query normally
> > (ORing terms Jun and Jul). If the query is [Feb TO Nov], then we
> > basically translate that into All AND NOT (Jan OR Dec).
> >
> > Since you only evaluate the complement of the terms, you can (almost)
> > double the worst case performance.
> >
> > Downsides:
> > - you have to have another BitSet in memory to perform the AND NOT
> > operation, so it needs more memory
> > - this complement approach is only this simple for numeric fields
> > where one document has only a single value; similar things are doable
> > for multi-valued numeric fields, however more complex and possibly
> > with less performance gain
> > - you need to index an additional term per bucket, so the index size
> > increases slightly
> >
> > Does this make sense? Maybe this has even been discussed already?
> >
> > -Michael
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Improve worst-case performance of TrieRange queries

Reply via email to