Op Wednesday 25 June 2008 07:03:59 schreef John Wang: > Hi guys: > Perhaps I should have posted this to this list in the first > place. > > I am trying to work on a patch to for each term, expose minDoc > and maxDoc. This value can be retrieve while constructing the > TermInfo. > > Knowing these two values can be very helpful in caching DocIdSet > for a given Term. This would help to determine what type of > underlying implementation to use, e.g. BitSet, HashSet, or ArraySet, > etc.
I suppose you know about https://issues.apache.org/jira/browse/LUCENE-1296 ? But how about using TermScorer? In the trunk it's a subclass of DocIdSetIterator (via Scorer) and the caching is already done by Lucene and the underlying OS file cache. TermScorer does some extra work for its scoring, but I don't think that would affect performance. > The problem I am having is stated below, I don't know how to add > the minDoc and maxDoc values to the index while keeping backward > compatibility. I doubt they would help very much. The most important info for this is maxDoc from the index reader and the document frequency of the term, and these are easily determined. Btw, I've just started to add encoding intervals of consecutive doc ids to SortedVIntList. For very high document frequencies, that might actually be faster than TermScorer and more compact than the current index. Once I've got some working code I'll open an issue for it. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]