Re: Fwd: changing index format

John Wang Wed, 25 Jun 2008 08:05:51 -0700

Thanks Paul and Mike for the feedback.
Paul, for us, sparsity of the docIds determine which data structure to use.
Where cardinality gives some of that, min/max docId would also help,
example:

say maxdoc=1000000, cardinality = 7, docids: {0,1,...6} or
{99993,99994...99999}, using arrayDocIdSet would take 28 bytes and bitset
would take only 1.

Furthermore, knowing min/maxDocId would help in predetermine the size needed
in construction of a given DocIdSet datastructure, to avoid growth.

Thanks for pointing me to SortedVIntList, what is the underlying compression
algorithm? We have developed a DocIdSet implementation using the a variation
of the P4Delta compression algorithm (
http://cis.poly.edu/cs912/indexcomp.pdf) that we would like to contribute
sometime. From our benchmark, we get about 70% compression (30% of the
original size) of arrays, which also give you iteration in compressed format
with performance similar to OpenBitSet. (Iterating over arrays is much
faster over OpenBitSet)

I am not sure TermScorer serves the purpose here. TermScorer reads a batch
of 32 at a time (don't understand why 32 is picked or should it be
customizable), we can't rely on "getting lucky" to have the underlying OS
cache for us. Many times, we want to move the construction of some filters
ahead while the IndexReader reads. Here is an example, say we have a field
called: gender with only 2 terms: M, F. And our query is always of the form
"content:query text AND gender:M/F", it is ideal to keep DocIdSet for M and
F in memory for the life of the IndexReader. I can't imagine constructing a
TermScorer for each query is similar in performance.

Reading the trunk code for TermScorer, I don't see the internal termDocs is
closed in skipTo. skipTo returns a boolean which tells the caller if the end
is reached, the caller may not/should not call next again to have it closed.
So wouldn't this scenario leak? Also in explain(docid), what happens if
termDoc is already closed from the next() call?

Thanks

-John

On Wed, Jun 25, 2008 at 12:45 AM, Paul Elschot <[EMAIL PROTECTED]>
wrote:

> Op Wednesday 25 June 2008 07:03:59 schreef John Wang:
> > Hi guys:
> >     Perhaps I should have posted this to this list in the first
> > place.
> >
> >     I am trying to work on a patch to for each term, expose minDoc
> > and maxDoc. This value can be retrieve while constructing the
> > TermInfo.
> >
> >     Knowing these two values can be very helpful in caching DocIdSet
> > for a given Term. This would help to determine what type of
> > underlying implementation to use, e.g. BitSet, HashSet, or ArraySet,
> > etc.
>
> I suppose you know about
> https://issues.apache.org/jira/browse/LUCENE-1296 ?
>
> But how about using TermScorer? In the trunk it's a subclass of
> DocIdSetIterator (via Scorer) and the caching is already done by
> Lucene and the underlying OS file cache.
> TermScorer does some extra work for its scoring, but I don't think
> that would affect performance.
>
> >      The problem I am having is stated below, I don't know how to add
> > the minDoc and maxDoc values to the index while keeping backward
> > compatibility.
>
> I doubt they would help very much. The most important info for this
> is maxDoc from the index reader and the document frequency of the term,
> and these are easily determined.
>
> Btw, I've just started to add encoding intervals of consecutive doc ids
> to SortedVIntList. For very high document frequencies, that might
> actually be faster than TermScorer and more compact than the current
> index. Once I've got some working code I'll open an issue for it.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Fwd: changing index format

Reply via email to