Is your entire corpus a single document? Because I'm having trouble
imagining a single document where this would be a problem, unless
your increment gap is huge. The term positions are relative to
a single document...
It is getting pretty huge, yes (see below). The term positions are also
relative to a single field, aren't they?
<MyField>
<Level_1>
<Level_2>
<Level_3>
Let me plug in some figures to help clarify. On Level 3 there are
hundreds of tokens. So to be able to search two or more terms in MyField
in the same Level_3, I put a position gap of 1000 between all Level_3's.
Per Level_2 there might be hundreds of Level_3 entries. As I want to
restrict the search to all Level_3 entries of a Level_2, I set the
position increment gap for Level_2 at 1000x1000 = 1,000,000 (1000 for
the Tokens in Level_3 and 1000 for the Level_3 entries in Level_2).
This done, Level_1 still needs to be accomodated. If you're looking at
500 Level_2 entries, a gap of 1,000,000x500 is needed per Level_1 entry,
to be able to search only within each of the Level_1 elements.That way
only four Level_1 entries can be included before the maximum value is
reached.
Queries I am looking to support might look like this in an easy case:
Search in MyField: Terms T1 and T2 on Level_2 and T3, T4, and T5 on
Level_3, which should both be in the same Level_1.
Sorry if this is confusing, what with all these levels going on. I think
what it comes down to is whether the integer based position counting
might be replaced by long. Can this be done at all? Are performance or
other implications conceivable? Or is the current implementation too
deeply wired to Lucene core workings to make this a reasonable endeavour?
Cheers
Rene
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org