richardstartin commented on a change in pull request #8411:
URL: https://github.com/apache/pinot/pull/8411#discussion_r837850389
##########
File path:
pinot-core/src/main/java/org/apache/pinot/core/operator/filter/BitmapBasedFilterOperator.java
##########
@@ -100,6 +101,67 @@ protected FilterBlock getNextBlock() {
}
}
+ @Override
+ public boolean canOptimizeCount() {
+ return true;
+ }
+
+ @Override
+ public int getNumMatchingDocs() {
+ int count = 0;
+ if (_invertedIndexReader == null) {
+ count = _docIds.getCardinality();
+ } else {
+ int[] dictIds = _exclusive
+ ? _predicateEvaluator.getNonMatchingDictIds()
+ : _predicateEvaluator.getMatchingDictIds();
+ switch (dictIds.length) {
+ case 0:
+ break;
+ case 1: {
+ count = _invertedIndexReader.getDocIds(dictIds[0]).getCardinality();
+ break;
+ }
+ case 2: {
+ count =
ImmutableRoaringBitmap.orCardinality(_invertedIndexReader.getDocIds(dictIds[0]),
+ _invertedIndexReader.getDocIds(dictIds[1]));
+ break;
+ }
+ default: {
+ // this could be optimised if the bitmaps are known to be disjoint
(as in a single value bitmap index)
+ MutableRoaringBitmap bitmap = new MutableRoaringBitmap();
+ for (int dictId : dictIds) {
+ bitmap.or(_invertedIndexReader.getDocIds(dictId));
+ }
+ count = bitmap.getCardinality();
+ break;
+ }
+ }
+ }
+ return _exclusive ? _numDocs - count : count;
+ }
+
+ @Override
+ public boolean canProduceBitmaps() {
+ return true;
+ }
+
+ @Override
+ public BitmapCollection getBitmaps() {
+ if (_docIds == null) {
+ int[] dictIds = _exclusive
+ ? _predicateEvaluator.getNonMatchingDictIds()
+ : _predicateEvaluator.getMatchingDictIds();
+ ImmutableRoaringBitmap[] bitmaps = new
ImmutableRoaringBitmap[dictIds.length];
+ for (int i = 0; i < dictIds.length; i++) {
+ bitmaps[i] = (_invertedIndexReader.getDocIds(dictIds[i]));
Review comment:
Perhaps I should clarify the reasons not to merge bitmaps eagerly
1. `or` is expensive, its size increases monotonically
2. flipping bits is expensive, it maps sparse to dense by filling in holes
3. (not implemented yet) it's always better to distribute intersections
across unions because it increases sparsity before doing the union
4. it may never be necessary to do the `or` at all, it can be implemented
with `andNot` from a universe bitmap, this approach is generally 2x faster than
eagerly computing an `or`
5. these bitmaps are all just handles around mapped memory and may never
need to be copied
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]