It all depends on the statistics: how the ranges are correlated. If the integer range is small: from 1-20000, for example, you might consider indexing every integer in each range as a separate value, especially if most documents will only have a small number of small ranges.

If there are too many integers, but the ranges have a large degree of overlap, you could consider taking a partition of all the integers that you see in the documents: a division into non-overlapping sets that can be union-ed to form the original intervals. Then index ids of the partition-sets. This is complicated because the partition needs to be updated as you add documents (since it depends on the distribution of the numbers among your documents), but you can also take a hybrid approach where you allow some degree of overlap among these sets, which allows for updating documents while keeping a fixed partition.

As an example, suppose you had documents A and B:

A: 1-3, 12-20, 1000-1200
B: 10-100, 900-1100

Then you could create a partition:
[ a{1-3}, b{10-11}, c{12-20}, d{21-100}, e{900-999}, f{1000-1100}, g{1101-1200} ]

and index each of the documents with the tokens a ... g

then when you query for an integer, you compare it against the partition to see which range it falls in (you could do this as a Lucene index lookup e.g.), and then search for that range token

The complication is that when a new document comes in,

C: 50-60

you can either update the partition and then all of the affected documents (B in the example above) have to be reindexed as well, or you can delay that and in the meantime index C with a new range [50-60] (ie allowing some overlap, and increasing the number of tokens you index and query for) until you have a chance to reduce the partition down to a minimal one in some periodic rebalancing.

I worked up this idea when looking at indexing access-control lists; users can get permission to arbitrary sets of documents - it's a many-many relation, basically, and we want a way of filtering by user access. We found we were getting queries with a large number (thousands) of OR-ed terms and in some cases performance was suffering. The idea is that all these users' permissions, although not correlated by rule or by design, do in fact exhibit a high degree of correlation, and we can use that to reduce the size of the term lists.

In the end this seemed too complex a solution for our ACL problem, which we were able to solve by simply caching the slow queries as filters and prewarming them in the background at log in time, so I don't have any working code to share (just some numerical experiments), but perhaps this idea might help here?

-Mike

On 6/4/2014 10:30 PM, Paul Tyson wrote:
I'm new to Lucene and search technology. I've read just enough to be
confused and dangerous, so please bear with me.

My documents have sets of integer ranges, like 1-3,
12-20,....13290-16509, ....

The total enumeration of ranges across all documents will be tens of
thousands. Most documents will have only a few ranges, but some will
have dozens or hundreds.

I need to search for documents with a range that includes a given
number. So in the above example, the document would match "2" or
"15001", but not "4" or "11".

Is this an appropriate use case for Lucene, and if so can someone sketch
out a solution so I can connect the dots? Or is there example code, or
documentation for this sort of thing? I've studied dynamic range facets,
but those don't seem right because I have all the ranges at index time.

Thanks,
--Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to