Re: Fwd: changing index format

John Wang Wed, 25 Jun 2008 09:45:51 -0700

Hi Paul:
    Regarding to your comment on adding required/prohibited to BooleanQuery:


    Based on the new api on DocIdSet and DocIdSetIterator abstractions, we
also developed decorators such as AndDocIdSet,OrDocIdSet and NotDocIdSet,
furthermore a DocIdSetQuery class that honors the Query api contracts. Given
these tools, we are able to build a customized scored BooleanQuery-like
query infrastructure. We'd be happy to contribute them.

Thanks

-John

On Wed, Jun 25, 2008 at 9:29 AM, Paul Elschot <[EMAIL PROTECTED]>
wrote:

> Op Wednesday 25 June 2008 17:05:17 schreef John Wang:
> > Thanks Paul and Mike for the feedback.
> > Paul, for us, sparsity of the docIds determine which data structure
> > to use. Where cardinality gives some of that, min/max docId would
> > also help, example:
> >
> > say maxdoc=1000000, cardinality = 7, docids: {0,1,...6} or
> > {99993,99994...99999}, using arrayDocIdSet would take 28 bytes and
> > bitset would take only 1.
> >
> > Furthermore, knowing min/maxDocId would help in predetermine the size
> > needed in construction of a given DocIdSet datastructure, to avoid
> > growth.
> >
> > Thanks for pointing me to SortedVIntList, what is the underlying
> > compression algorithm?
>
> A SortedVIntList uses a byte array to store the docid differences as
> a series of VInt's, with a VInt being a series of bytes in which the
> high bit is a continuation bit, and the remaining bits are data for an
> unsigned integer. The same VInt is used in a lucene index in various
> places.
>
> > We have developed a DocIdSet implementation
> > using the a variation of the P4Delta compression algorithm (
> > http://cis.poly.edu/cs912/indexcomp.pdf) that we would like to
> > contribute sometime. From our benchmark, we get about 70% compression
> > (30% of the original size) of arrays, which also give you iteration
> > in compressed format with performance similar to OpenBitSet.
> > (Iterating over arrays is much faster over OpenBitSet)
>
> Andrzej recently pointed to a paper on PForDelta, and since then
> I have a java implementation rather low on my todo list.
> Needless to say that I'm interested to see it contributed.
>
> > I am not sure TermScorer serves the purpose here. TermScorer reads a
> > batch of 32 at a time (don't understand why 32 is picked or should it
> > be customizable), we can't rely on "getting lucky" to have the
> > underlying OS cache for us. Many times, we want to move the
> > construction of some filters ahead while the IndexReader reads. Here
> > is an example, say we have a field called: gender with only 2 terms:
> > M, F. And our query is always of the form "content:query text AND
> > gender:M/F", it is ideal to keep DocIdSet for M and F in memory for
> > the life of the IndexReader. I can't imagine constructing a
> > TermScorer for each query is similar in performance.
>
> Well, you can give TermScorer a try before writing other code.
> Adding a DocIdSet as required or prohibited to a BooleanQuery
> would be nice, but that is not yet possible.
>
> > Reading the trunk code for TermScorer, I don't see the internal
> > termDocs is closed in skipTo. skipTo returns a boolean which tells
> > the caller if the end is reached, the caller may not/should not call
> > next again to have it closed. So wouldn't this scenario leak?
>
> Closing of Scorers has been discussed before, the only conclusion
> I remember now is that there is no bug in the current code.
>
> > Also in
> > explain(docid), what happens if termDoc is already closed from the
> > next() call?
>
> When explain() is called on a Scorer, next() and skipTo() should
> not be called. A Scorer can either explain, or search, but not both.
>
> Regards,
> Paul Elschot
>
>
> >
> > Thanks
> >
> > -John
> >
> > On Wed, Jun 25, 2008 at 12:45 AM, Paul Elschot
> > <[EMAIL PROTECTED]>
> >
> > wrote:
> > > Op Wednesday 25 June 2008 07:03:59 schreef John Wang:
> > > > Hi guys:
> > > >     Perhaps I should have posted this to this list in the first
> > > > place.
> > > >
> > > >     I am trying to work on a patch to for each term, expose
> > > > minDoc and maxDoc. This value can be retrieve while constructing
> > > > the TermInfo.
> > > >
> > > >     Knowing these two values can be very helpful in caching
> > > > DocIdSet for a given Term. This would help to determine what type
> > > > of underlying implementation to use, e.g. BitSet, HashSet, or
> > > > ArraySet, etc.
> > >
> > > I suppose you know about
> > > https://issues.apache.org/jira/browse/LUCENE-1296 ?
> > >
> > > But how about using TermScorer? In the trunk it's a subclass of
> > > DocIdSetIterator (via Scorer) and the caching is already done by
> > > Lucene and the underlying OS file cache.
> > > TermScorer does some extra work for its scoring, but I don't think
> > > that would affect performance.
> > >
> > > >      The problem I am having is stated below, I don't know how to
> > > > add the minDoc and maxDoc values to the index while keeping
> > > > backward compatibility.
> > >
> > > I doubt they would help very much. The most important info for this
> > > is maxDoc from the index reader and the document frequency of the
> > > term, and these are easily determined.
> > >
> > > Btw, I've just started to add encoding intervals of consecutive doc
> > > ids to SortedVIntList. For very high document frequencies, that
> > > might actually be faster than TermScorer and more compact than the
> > > current index. Once I've got some working code I'll open an issue
> > > for it.
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > > -------------------------------------------------------------------
> > >-- To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Fwd: changing index format

Reply via email to