Given you have 1M docs and about 1M terms, do you see very few docs per
term?
If your DocSet per term is very sparse, BitSet is probably not a good
representation. Simple int array maybe better for memory, and faster for
iterating.

-John

On Mon, Oct 12, 2009 at 8:45 AM, Paul Elschot <paul.elsc...@xs4all.nl>wrote:

> On Monday 12 October 2009 14:53:45 Christoph Boosz wrote:
> > Hi,
> >
> > I have a question related to faceted search. My index contains more than
> 1
> > million documents, and nearly 1 million terms. My aim is to get a
> DocIdSet
> > for each term occurring in the result of a query. I use the approach
> > described on
> >
> http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html
> <
> https://service.gmx.net/de/cgi/derefer?TYPE=3&DEST=http%3A%2F%2Fsujitpal.blogspot.com%2F2007%2F04%2Flucene-search-within-search-with.html
> >,
> > where a BitSet is built out of a QueryFilter for each term and
> intersected
> > with the BitSet representing the user query.
> > However, performance could be better. I guess it’s because the term
> filter
> > considers each document in the index, even if it’s not in the result. My
> > attempt to use a ChainedFilter, where the first filter (cached) is for
> the
> > user query, and the second one for the term (done for all terms), didn’t
> > speed things up, though.
> > Am I missing something? Is there a better way to get the DocIdSets for a
> > huge number of terms in a limited set of documents?
>
> Assuming you only need the number of documents within the original query
> that contain each term, one thing that can be saved is the allocation of
> the
> resulting BitSet for each term. To do this, use the cached BitSet (or the
> OpenBitSet in current lucene) for the original Query as a filter for a
> TermQuery
> per term, and then count the matching documents by using a counting
> HitCollector on the IndexSearcher.
>
> Regards,
> Paul Elschot
>

Reply via email to