Re: Fwd: changing index format

Paul Elschot Wed, 25 Jun 2008 00:46:17 -0700

Op Wednesday 25 June 2008 07:03:59 schreef John Wang:
> Hi guys:
>     Perhaps I should have posted this to this list in the first
> place.
>
>     I am trying to work on a patch to for each term, expose minDoc
> and maxDoc. This value can be retrieve while constructing the
> TermInfo.
>
>     Knowing these two values can be very helpful in caching DocIdSet
> for a given Term. This would help to determine what type of
> underlying implementation to use, e.g. BitSet, HashSet, or ArraySet,
> etc.


I suppose you know about
https://issues.apache.org/jira/browse/LUCENE-1296 ?

But how about using TermScorer? In the trunk it's a subclass of
DocIdSetIterator (via Scorer) and the caching is already done by
Lucene and the underlying OS file cache.
TermScorer does some extra work for its scoring, but I don't think
that would affect performance.

>      The problem I am having is stated below, I don't know how to add
> the minDoc and maxDoc values to the index while keeping backward
> compatibility.

I doubt they would help very much. The most important info for this 
is maxDoc from the index reader and the document frequency of the term,
and these are easily determined.

Btw, I've just started to add encoding intervals of consecutive doc ids
to SortedVIntList. For very high document frequencies, that might 
actually be faster than TermScorer and more compact than the current 
index. Once I've got some working code I'll open an issue for it.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Fwd: changing index format

Reply via email to