Re: Fwd: changing index format

Paul Elschot Wed, 25 Jun 2008 09:30:11 -0700

Op Wednesday 25 June 2008 17:05:17 schreef John Wang:
> Thanks Paul and Mike for the feedback.
> Paul, for us, sparsity of the docIds determine which data structure
> to use. Where cardinality gives some of that, min/max docId would
> also help, example:
>
> say maxdoc=1000000, cardinality = 7, docids: {0,1,...6} or
> {99993,99994...99999}, using arrayDocIdSet would take 28 bytes and
> bitset would take only 1.
>
> Furthermore, knowing min/maxDocId would help in predetermine the size
> needed in construction of a given DocIdSet datastructure, to avoid
> growth.
>
> Thanks for pointing me to SortedVIntList, what is the underlying
> compression algorithm?


A SortedVIntList uses a byte array to store the docid differences as
a series of VInt's, with a VInt being a series of bytes in which the 
high bit is a continuation bit, and the remaining bits are data for an
unsigned integer. The same VInt is used in a lucene index in various
places.

> We have developed a DocIdSet implementation 
> using the a variation of the P4Delta compression algorithm (
> http://cis.poly.edu/cs912/indexcomp.pdf) that we would like to
> contribute sometime. From our benchmark, we get about 70% compression
> (30% of the original size) of arrays, which also give you iteration
> in compressed format with performance similar to OpenBitSet.
> (Iterating over arrays is much faster over OpenBitSet)

Andrzej recently pointed to a paper on PForDelta, and since then 
I have a java implementation rather low on my todo list.
Needless to say that I'm interested to see it contributed.

> I am not sure TermScorer serves the purpose here. TermScorer reads a
> batch of 32 at a time (don't understand why 32 is picked or should it
> be customizable), we can't rely on "getting lucky" to have the
> underlying OS cache for us. Many times, we want to move the
> construction of some filters ahead while the IndexReader reads. Here
> is an example, say we have a field called: gender with only 2 terms:
> M, F. And our query is always of the form "content:query text AND
> gender:M/F", it is ideal to keep DocIdSet for M and F in memory for
> the life of the IndexReader. I can't imagine constructing a
> TermScorer for each query is similar in performance.

Well, you can give TermScorer a try before writing other code.
Adding a DocIdSet as required or prohibited to a BooleanQuery 
would be nice, but that is not yet possible.

> Reading the trunk code for TermScorer, I don't see the internal
> termDocs is closed in skipTo. skipTo returns a boolean which tells
> the caller if the end is reached, the caller may not/should not call
> next again to have it closed. So wouldn't this scenario leak? 

Closing of Scorers has been discussed before, the only conclusion
I remember now is that there is no bug in the current code.

> Also in 
> explain(docid), what happens if termDoc is already closed from the
> next() call?

When explain() is called on a Scorer, next() and skipTo() should
not be called. A Scorer can either explain, or search, but not both.

Regards,
Paul Elschot


>
> Thanks
>
> -John
>
> On Wed, Jun 25, 2008 at 12:45 AM, Paul Elschot
> <[EMAIL PROTECTED]>
>
> wrote:
> > Op Wednesday 25 June 2008 07:03:59 schreef John Wang:
> > > Hi guys:
> > >     Perhaps I should have posted this to this list in the first
> > > place.
> > >
> > >     I am trying to work on a patch to for each term, expose
> > > minDoc and maxDoc. This value can be retrieve while constructing
> > > the TermInfo.
> > >
> > >     Knowing these two values can be very helpful in caching
> > > DocIdSet for a given Term. This would help to determine what type
> > > of underlying implementation to use, e.g. BitSet, HashSet, or
> > > ArraySet, etc.
> >
> > I suppose you know about
> > https://issues.apache.org/jira/browse/LUCENE-1296 ?
> >
> > But how about using TermScorer? In the trunk it's a subclass of
> > DocIdSetIterator (via Scorer) and the caching is already done by
> > Lucene and the underlying OS file cache.
> > TermScorer does some extra work for its scoring, but I don't think
> > that would affect performance.
> >
> > >      The problem I am having is stated below, I don't know how to
> > > add the minDoc and maxDoc values to the index while keeping
> > > backward compatibility.
> >
> > I doubt they would help very much. The most important info for this
> > is maxDoc from the index reader and the document frequency of the
> > term, and these are easily determined.
> >
> > Btw, I've just started to add encoding intervals of consecutive doc
> > ids to SortedVIntList. For very high document frequencies, that
> > might actually be faster than TermScorer and more compact than the
> > current index. Once I've got some working code I'll open an issue
> > for it.
> >
> > Regards,
> > Paul Elschot
> >
> > -------------------------------------------------------------------
> >-- To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Fwd: changing index format

Reply via email to