Re: Indexing integer ranges for point search

Michael Sokolov Thu, 05 Jun 2014 06:14:43 -0700

It all depends on the statistics: how the ranges are correlated. If theinteger range is small: from 1-20000, for example, you might considerindexing every integer in each range as a separate value, especially ifmost documents will only have a small number of small ranges.

If there are too many integers, but the ranges have a large degree ofoverlap, you could consider taking a partition of all the integers thatyou see in the documents: a division into non-overlapping sets that canbe union-ed to form the original intervals. Then index ids of thepartition-sets. This is complicated because the partition needs to beupdated as you add documents (since it depends on the distribution ofthe numbers among your documents), but you can also take a hybridapproach where you allow some degree of overlap among these sets, whichallows for updating documents while keeping a fixed partition.


As an example, suppose you had documents A and B:

A: 1-3, 12-20, 1000-1200
B: 10-100, 900-1100

Then you could create a partition:

[ a{1-3}, b{10-11}, c{12-20}, d{21-100}, e{900-999}, f{1000-1100},g{1101-1200} ]


and index each of the documents with the tokens a ... g

then when you query for an integer, you compare it against the partitionto see which range it falls in (you could do this as a Lucene indexlookup e.g.), and then search for that range token


The complication is that when a new document comes in,

C: 50-60

you can either update the partition and then all of the affecteddocuments (B in the example above) have to be reindexed as well, or youcan delay that and in the meantime index C with a new range [50-60] (ieallowing some overlap, and increasing the number of tokens you index andquery for) until you have a chance to reduce the partition down to aminimal one in some periodic rebalancing.

I worked up this idea when looking at indexing access-control lists;users can get permission to arbitrary sets of documents - it's amany-many relation, basically, and we want a way of filtering by useraccess. We found we were getting queries with a large number(thousands) of OR-ed terms and in some cases performance was suffering.The idea is that all these users' permissions, although not correlatedby rule or by design, do in fact exhibit a high degree of correlation,and we can use that to reduce the size of the term lists.

In the end this seemed too complex a solution for our ACL problem, whichwe were able to solve by simply caching the slow queries as filters andprewarming them in the background at log in time, so I don't have anyworking code to share (just some numerical experiments), but perhapsthis idea might help here?


-Mike

On 6/4/2014 10:30 PM, Paul Tyson wrote:

I'm new to Lucene and search technology. I've read just enough to be
confused and dangerous, so please bear with me.

My documents have sets of integer ranges, like 1-3,
12-20,....13290-16509, ....

The total enumeration of ranges across all documents will be tens of
thousands. Most documents will have only a few ranges, but some will
have dozens or hundreds.

I need to search for documents with a range that includes a given
number. So in the above example, the document would match "2" or
"15001", but not "4" or "11".

Is this an appropriate use case for Lucene, and if so can someone sketch
out a solution so I can connect the dots? Or is there example code, or
documentation for this sort of thing? I've studied dynamic range facets,
but those don't seem right because I have all the ranges at index time.

Thanks,
--Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Indexing integer ranges for point search

Reply via email to