Before I open a JIRA, I would like to run a sanity check here.

My understanding:

The DocSetCollector is used e.g. when the ResponseBuilder.isNeedDocSet is true. 
This is the case for e.g. field faceting. The DocSetCollector is optimistic and 
works from the assumption that the result set will be less than 1/64th of 
maxDoc. It does this by allocating an int[maxDoc/64], which takes up maxDoc/16 
bytes. docIDs are collected in this array and when the DocSet is to be 
returned, the array is wrapped in a SortedIntDocSet, which reduces it to 
int[hits].

If the result set exceeds maxDoc/64, a FixedBitSet, which takes up maxDoc/8 
bytes, is created and all the values are copied from the int-array. The 
int-array is not freed, so temporary overhead during collection is now 
maxDoc/16 + maxDoc/8 bytes. Then the collection has finished, a BitDocSet 
(maxDoc/8 bytes) is created from the FixedBitSet.

The JavaDoc indicates that the reason for this solution is to avoid very sparse 
bitmaps from small result sets with a FixedBitSet-only strategy. As the memory 
overhead for the current solution varies from 1/2 to 3/2 of the pure 
FixedBitSet-only solution, I assume that the problem with sparse bitmaps is not 
memory but processing time?


An idea:

The primary performance problem with BitDocSet (FixedBitSet really) is that all 
the empty areas (i.e. longs) of the bitmap must still be read for most 
operations. But we can avoid that.

Tracking non-empty areas can be done fairly simple by having a bitmap of length 
maxDoc/64: One bit for every long != 0 in the internal array in FixedBitSet. 
The memory overhead of the tracker is thus maxDoc/64/8 bytes and the 
performance penalty for updates will be the addition of 
tracker[index >>> 12] |= 1L << ((index >>> 6) & 63)
to all FixedBitSet.set-calls.

Using the tracker would make operations such as iteration and intersection on 
sparse result sets cheap. For non-sparse results (we can define the cut-off 
point of sparseness where we want), the tracker could be discarded and the 
collected bits used with a standard FixedBitSet.


Does this sound reasonable? Should I open a JIRA? Attempt a patch?

- Toke Eskildsen

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to