Hi Paul: Regarding to your comment on adding required/prohibited to BooleanQuery:
Based on the new api on DocIdSet and DocIdSetIterator abstractions, we also developed decorators such as AndDocIdSet,OrDocIdSet and NotDocIdSet, furthermore a DocIdSetQuery class that honors the Query api contracts. Given these tools, we are able to build a customized scored BooleanQuery-like query infrastructure. We'd be happy to contribute them. Thanks -John On Wed, Jun 25, 2008 at 9:29 AM, Paul Elschot <[EMAIL PROTECTED]> wrote: > Op Wednesday 25 June 2008 17:05:17 schreef John Wang: > > Thanks Paul and Mike for the feedback. > > Paul, for us, sparsity of the docIds determine which data structure > > to use. Where cardinality gives some of that, min/max docId would > > also help, example: > > > > say maxdoc=1000000, cardinality = 7, docids: {0,1,...6} or > > {99993,99994...99999}, using arrayDocIdSet would take 28 bytes and > > bitset would take only 1. > > > > Furthermore, knowing min/maxDocId would help in predetermine the size > > needed in construction of a given DocIdSet datastructure, to avoid > > growth. > > > > Thanks for pointing me to SortedVIntList, what is the underlying > > compression algorithm? > > A SortedVIntList uses a byte array to store the docid differences as > a series of VInt's, with a VInt being a series of bytes in which the > high bit is a continuation bit, and the remaining bits are data for an > unsigned integer. The same VInt is used in a lucene index in various > places. > > > We have developed a DocIdSet implementation > > using the a variation of the P4Delta compression algorithm ( > > http://cis.poly.edu/cs912/indexcomp.pdf) that we would like to > > contribute sometime. From our benchmark, we get about 70% compression > > (30% of the original size) of arrays, which also give you iteration > > in compressed format with performance similar to OpenBitSet. > > (Iterating over arrays is much faster over OpenBitSet) > > Andrzej recently pointed to a paper on PForDelta, and since then > I have a java implementation rather low on my todo list. > Needless to say that I'm interested to see it contributed. > > > I am not sure TermScorer serves the purpose here. TermScorer reads a > > batch of 32 at a time (don't understand why 32 is picked or should it > > be customizable), we can't rely on "getting lucky" to have the > > underlying OS cache for us. Many times, we want to move the > > construction of some filters ahead while the IndexReader reads. Here > > is an example, say we have a field called: gender with only 2 terms: > > M, F. And our query is always of the form "content:query text AND > > gender:M/F", it is ideal to keep DocIdSet for M and F in memory for > > the life of the IndexReader. I can't imagine constructing a > > TermScorer for each query is similar in performance. > > Well, you can give TermScorer a try before writing other code. > Adding a DocIdSet as required or prohibited to a BooleanQuery > would be nice, but that is not yet possible. > > > Reading the trunk code for TermScorer, I don't see the internal > > termDocs is closed in skipTo. skipTo returns a boolean which tells > > the caller if the end is reached, the caller may not/should not call > > next again to have it closed. So wouldn't this scenario leak? > > Closing of Scorers has been discussed before, the only conclusion > I remember now is that there is no bug in the current code. > > > Also in > > explain(docid), what happens if termDoc is already closed from the > > next() call? > > When explain() is called on a Scorer, next() and skipTo() should > not be called. A Scorer can either explain, or search, but not both. > > Regards, > Paul Elschot > > > > > > Thanks > > > > -John > > > > On Wed, Jun 25, 2008 at 12:45 AM, Paul Elschot > > <[EMAIL PROTECTED]> > > > > wrote: > > > Op Wednesday 25 June 2008 07:03:59 schreef John Wang: > > > > Hi guys: > > > > Perhaps I should have posted this to this list in the first > > > > place. > > > > > > > > I am trying to work on a patch to for each term, expose > > > > minDoc and maxDoc. This value can be retrieve while constructing > > > > the TermInfo. > > > > > > > > Knowing these two values can be very helpful in caching > > > > DocIdSet for a given Term. This would help to determine what type > > > > of underlying implementation to use, e.g. BitSet, HashSet, or > > > > ArraySet, etc. > > > > > > I suppose you know about > > > https://issues.apache.org/jira/browse/LUCENE-1296 ? > > > > > > But how about using TermScorer? In the trunk it's a subclass of > > > DocIdSetIterator (via Scorer) and the caching is already done by > > > Lucene and the underlying OS file cache. > > > TermScorer does some extra work for its scoring, but I don't think > > > that would affect performance. > > > > > > > The problem I am having is stated below, I don't know how to > > > > add the minDoc and maxDoc values to the index while keeping > > > > backward compatibility. > > > > > > I doubt they would help very much. The most important info for this > > > is maxDoc from the index reader and the document frequency of the > > > term, and these are easily determined. > > > > > > Btw, I've just started to add encoding intervals of consecutive doc > > > ids to SortedVIntList. For very high document frequencies, that > > > might actually be faster than TermScorer and more compact than the > > > current index. Once I've got some working code I'll open an issue > > > for it. > > > > > > Regards, > > > Paul Elschot > > > > > > ------------------------------------------------------------------- > > >-- To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >