Before I open a JIRA, I would like to run a sanity check here. My understanding:
The DocSetCollector is used e.g. when the ResponseBuilder.isNeedDocSet is true. This is the case for e.g. field faceting. The DocSetCollector is optimistic and works from the assumption that the result set will be less than 1/64th of maxDoc. It does this by allocating an int[maxDoc/64], which takes up maxDoc/16 bytes. docIDs are collected in this array and when the DocSet is to be returned, the array is wrapped in a SortedIntDocSet, which reduces it to int[hits]. If the result set exceeds maxDoc/64, a FixedBitSet, which takes up maxDoc/8 bytes, is created and all the values are copied from the int-array. The int-array is not freed, so temporary overhead during collection is now maxDoc/16 + maxDoc/8 bytes. Then the collection has finished, a BitDocSet (maxDoc/8 bytes) is created from the FixedBitSet. The JavaDoc indicates that the reason for this solution is to avoid very sparse bitmaps from small result sets with a FixedBitSet-only strategy. As the memory overhead for the current solution varies from 1/2 to 3/2 of the pure FixedBitSet-only solution, I assume that the problem with sparse bitmaps is not memory but processing time? An idea: The primary performance problem with BitDocSet (FixedBitSet really) is that all the empty areas (i.e. longs) of the bitmap must still be read for most operations. But we can avoid that. Tracking non-empty areas can be done fairly simple by having a bitmap of length maxDoc/64: One bit for every long != 0 in the internal array in FixedBitSet. The memory overhead of the tracker is thus maxDoc/64/8 bytes and the performance penalty for updates will be the addition of tracker[index >>> 12] |= 1L << ((index >>> 6) & 63) to all FixedBitSet.set-calls. Using the tracker would make operations such as iteration and intersection on sparse result sets cheap. For non-sparse results (we can define the cut-off point of sparseness where we want), the tracker could be discarded and the collected bits used with a standard FixedBitSet. Does this sound reasonable? Should I open a JIRA? Attempt a patch? - Toke Eskildsen --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org