Thanks Paul and Mike for the feedback. Paul, for us, sparsity of the docIds determine which data structure to use. Where cardinality gives some of that, min/max docId would also help, example:
say maxdoc=1000000, cardinality = 7, docids: {0,1,...6} or {99993,99994...99999}, using arrayDocIdSet would take 28 bytes and bitset would take only 1. Furthermore, knowing min/maxDocId would help in predetermine the size needed in construction of a given DocIdSet datastructure, to avoid growth. Thanks for pointing me to SortedVIntList, what is the underlying compression algorithm? We have developed a DocIdSet implementation using the a variation of the P4Delta compression algorithm ( http://cis.poly.edu/cs912/indexcomp.pdf) that we would like to contribute sometime. From our benchmark, we get about 70% compression (30% of the original size) of arrays, which also give you iteration in compressed format with performance similar to OpenBitSet. (Iterating over arrays is much faster over OpenBitSet) I am not sure TermScorer serves the purpose here. TermScorer reads a batch of 32 at a time (don't understand why 32 is picked or should it be customizable), we can't rely on "getting lucky" to have the underlying OS cache for us. Many times, we want to move the construction of some filters ahead while the IndexReader reads. Here is an example, say we have a field called: gender with only 2 terms: M, F. And our query is always of the form "content:query text AND gender:M/F", it is ideal to keep DocIdSet for M and F in memory for the life of the IndexReader. I can't imagine constructing a TermScorer for each query is similar in performance. Reading the trunk code for TermScorer, I don't see the internal termDocs is closed in skipTo. skipTo returns a boolean which tells the caller if the end is reached, the caller may not/should not call next again to have it closed. So wouldn't this scenario leak? Also in explain(docid), what happens if termDoc is already closed from the next() call? Thanks -John On Wed, Jun 25, 2008 at 12:45 AM, Paul Elschot <[EMAIL PROTECTED]> wrote: > Op Wednesday 25 June 2008 07:03:59 schreef John Wang: > > Hi guys: > > Perhaps I should have posted this to this list in the first > > place. > > > > I am trying to work on a patch to for each term, expose minDoc > > and maxDoc. This value can be retrieve while constructing the > > TermInfo. > > > > Knowing these two values can be very helpful in caching DocIdSet > > for a given Term. This would help to determine what type of > > underlying implementation to use, e.g. BitSet, HashSet, or ArraySet, > > etc. > > I suppose you know about > https://issues.apache.org/jira/browse/LUCENE-1296 ? > > But how about using TermScorer? In the trunk it's a subclass of > DocIdSetIterator (via Scorer) and the caching is already done by > Lucene and the underlying OS file cache. > TermScorer does some extra work for its scoring, but I don't think > that would affect performance. > > > The problem I am having is stated below, I don't know how to add > > the minDoc and maxDoc values to the index while keeping backward > > compatibility. > > I doubt they would help very much. The most important info for this > is maxDoc from the index reader and the document frequency of the term, > and these are easily determined. > > Btw, I've just started to add encoding intervals of consecutive doc ids > to SortedVIntList. For very high document frequencies, that might > actually be faster than TermScorer and more compact than the current > index. Once I've got some working code I'll open an issue for it. > > Regards, > Paul Elschot > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >