Is your entire corpus a single document? Because I'm having trouble
imagining a single document where this would be a problem, unless
your increment gap is huge. The term positions are relative to
a single document...

It is getting pretty huge, yes (see below). The term positions are also relative to a single field, aren't they?

<MyField>
<Level_1>
<Level_2>
<Level_3>

Let me plug in some figures to help clarify. On Level 3 there are hundreds of tokens. So to be able to search two or more terms in MyField in the same Level_3, I put a position gap of 1000 between all Level_3's. Per Level_2 there might be hundreds of Level_3 entries. As I want to restrict the search to all Level_3 entries of a Level_2, I set the position increment gap for Level_2 at 1000x1000 = 1,000,000 (1000 for the Tokens in Level_3 and 1000 for the Level_3 entries in Level_2).

This done, Level_1 still needs to be accomodated. If you're looking at 500 Level_2 entries, a gap of 1,000,000x500 is needed per Level_1 entry, to be able to search only within each of the Level_1 elements.That way only four Level_1 entries can be included before the maximum value is reached.

Queries I am looking to support might look like this in an easy case:

Search in MyField: Terms T1 and T2 on Level_2 and T3, T4, and T5 on Level_3, which should both be in the same Level_1.

Sorry if this is confusing, what with all these levels going on. I think what it comes down to is whether the integer based position counting might be replaced by long. Can this be done at all? Are performance or other implications conceivable? Or is the current implementation too deeply wired to Lucene core workings to make this a reasonable endeavour?

Cheers
Rene

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to