Given you have 1M docs and about 1M terms, do you see very few docs per term? If your DocSet per term is very sparse, BitSet is probably not a good representation. Simple int array maybe better for memory, and faster for iterating.
-John On Mon, Oct 12, 2009 at 8:45 AM, Paul Elschot <paul.elsc...@xs4all.nl>wrote: > On Monday 12 October 2009 14:53:45 Christoph Boosz wrote: > > Hi, > > > > I have a question related to faceted search. My index contains more than > 1 > > million documents, and nearly 1 million terms. My aim is to get a > DocIdSet > > for each term occurring in the result of a query. I use the approach > > described on > > > http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html > < > https://service.gmx.net/de/cgi/derefer?TYPE=3&DEST=http%3A%2F%2Fsujitpal.blogspot.com%2F2007%2F04%2Flucene-search-within-search-with.html > >, > > where a BitSet is built out of a QueryFilter for each term and > intersected > > with the BitSet representing the user query. > > However, performance could be better. I guess it’s because the term > filter > > considers each document in the index, even if it’s not in the result. My > > attempt to use a ChainedFilter, where the first filter (cached) is for > the > > user query, and the second one for the term (done for all terms), didn’t > > speed things up, though. > > Am I missing something? Is there a better way to get the DocIdSets for a > > huge number of terms in a limited set of documents? > > Assuming you only need the number of documents within the original query > that contain each term, one thing that can be saved is the allocation of > the > resulting BitSet for each term. To do this, use the cached BitSet (or the > OpenBitSet in current lucene) for the original Query as a filter for a > TermQuery > per term, and then count the matching documents by using a counting > HitCollector on the IndexSearcher. > > Regards, > Paul Elschot >