Op Wednesday 25 June 2008 17:05:17 schreef John Wang: > Thanks Paul and Mike for the feedback. > Paul, for us, sparsity of the docIds determine which data structure > to use. Where cardinality gives some of that, min/max docId would > also help, example: > > say maxdoc=1000000, cardinality = 7, docids: {0,1,...6} or > {99993,99994...99999}, using arrayDocIdSet would take 28 bytes and > bitset would take only 1. > > Furthermore, knowing min/maxDocId would help in predetermine the size > needed in construction of a given DocIdSet datastructure, to avoid > growth. > > Thanks for pointing me to SortedVIntList, what is the underlying > compression algorithm?
A SortedVIntList uses a byte array to store the docid differences as a series of VInt's, with a VInt being a series of bytes in which the high bit is a continuation bit, and the remaining bits are data for an unsigned integer. The same VInt is used in a lucene index in various places. > We have developed a DocIdSet implementation > using the a variation of the P4Delta compression algorithm ( > http://cis.poly.edu/cs912/indexcomp.pdf) that we would like to > contribute sometime. From our benchmark, we get about 70% compression > (30% of the original size) of arrays, which also give you iteration > in compressed format with performance similar to OpenBitSet. > (Iterating over arrays is much faster over OpenBitSet) Andrzej recently pointed to a paper on PForDelta, and since then I have a java implementation rather low on my todo list. Needless to say that I'm interested to see it contributed. > I am not sure TermScorer serves the purpose here. TermScorer reads a > batch of 32 at a time (don't understand why 32 is picked or should it > be customizable), we can't rely on "getting lucky" to have the > underlying OS cache for us. Many times, we want to move the > construction of some filters ahead while the IndexReader reads. Here > is an example, say we have a field called: gender with only 2 terms: > M, F. And our query is always of the form "content:query text AND > gender:M/F", it is ideal to keep DocIdSet for M and F in memory for > the life of the IndexReader. I can't imagine constructing a > TermScorer for each query is similar in performance. Well, you can give TermScorer a try before writing other code. Adding a DocIdSet as required or prohibited to a BooleanQuery would be nice, but that is not yet possible. > Reading the trunk code for TermScorer, I don't see the internal > termDocs is closed in skipTo. skipTo returns a boolean which tells > the caller if the end is reached, the caller may not/should not call > next again to have it closed. So wouldn't this scenario leak? Closing of Scorers has been discussed before, the only conclusion I remember now is that there is no bug in the current code. > Also in > explain(docid), what happens if termDoc is already closed from the > next() call? When explain() is called on a Scorer, next() and skipTo() should not be called. A Scorer can either explain, or search, but not both. Regards, Paul Elschot > > Thanks > > -John > > On Wed, Jun 25, 2008 at 12:45 AM, Paul Elschot > <[EMAIL PROTECTED]> > > wrote: > > Op Wednesday 25 June 2008 07:03:59 schreef John Wang: > > > Hi guys: > > > Perhaps I should have posted this to this list in the first > > > place. > > > > > > I am trying to work on a patch to for each term, expose > > > minDoc and maxDoc. This value can be retrieve while constructing > > > the TermInfo. > > > > > > Knowing these two values can be very helpful in caching > > > DocIdSet for a given Term. This would help to determine what type > > > of underlying implementation to use, e.g. BitSet, HashSet, or > > > ArraySet, etc. > > > > I suppose you know about > > https://issues.apache.org/jira/browse/LUCENE-1296 ? > > > > But how about using TermScorer? In the trunk it's a subclass of > > DocIdSetIterator (via Scorer) and the caching is already done by > > Lucene and the underlying OS file cache. > > TermScorer does some extra work for its scoring, but I don't think > > that would affect performance. > > > > > The problem I am having is stated below, I don't know how to > > > add the minDoc and maxDoc values to the index while keeping > > > backward compatibility. > > > > I doubt they would help very much. The most important info for this > > is maxDoc from the index reader and the document frequency of the > > term, and these are easily determined. > > > > Btw, I've just started to add encoding intervals of consecutive doc > > ids to SortedVIntList. For very high document frequencies, that > > might actually be faster than TermScorer and more compact than the > > current index. Once I've got some working code I'll open an issue > > for it. > > > > Regards, > > Paul Elschot > > > > ------------------------------------------------------------------- > >-- To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]