tarun11Mavani commented on code in PR #18760:
URL: https://github.com/apache/pinot/pull/18760#discussion_r3441502114


##########
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/funnel/BitmapResultExtractionStrategy.java:
##########
@@ -42,14 +49,32 @@ public List<RoaringBitmap> 
extractIntermediateResult(DictIdsWrapper dictIdsWrapp
       }
       return result;
     }
-    Dictionary dictionary = dictIdsWrapper._dictionary;
     List<RoaringBitmap> result = new ArrayList<>(_numSteps);
-    for (RoaringBitmap dictIdBitmap : dictIdsWrapper._stepsBitmaps) {
-      result.add(convertToValueBitmap(dictionary, dictIdBitmap));
+    if (dictIdsWrapper.isMultiKey()) {
+      for (RoaringBitmap compositeIdBitmap : dictIdsWrapper._stepsBitmaps) {
+        result.add(convertCompositeToValueBitmap(dictIdsWrapper, 
compositeIdBitmap));
+      }
+    } else {
+      Dictionary dictionary = dictIdsWrapper._dictionaries[0];
+      for (RoaringBitmap dictIdBitmap : dictIdsWrapper._stepsBitmaps) {
+        result.add(convertToValueBitmap(dictionary, dictIdBitmap));
+      }
     }
     return result;
   }
 
+  private RoaringBitmap convertCompositeToValueBitmap(DictIdsWrapper wrapper, 
RoaringBitmap compositeIdBitmap) {
+    RoaringBitmap valueBitmap = new RoaringBitmap();
+    PeekableIntIterator iterator = compositeIdBitmap.getIntIterator();
+    int numKeys = wrapper._dictionaries.length;
+    int[] dictIds = new int[numKeys];
+    while (iterator.hasNext()) {
+      wrapper.reverseCompositeId(iterator.next(), dictIds);
+      valueBitmap.add(DictIdsWrapper.toCompositeString(wrapper._dictionaries, 
dictIds).hashCode());

Review Comment:
   This is actually an existing limitation of the bitmap strategy, not 
something new from multi-key. The single-key path in `convertToValueBitmap` 
already uses `.hashCode()` for LONG, FLOAT, DOUBLE, and STRING types — only INT 
gets exact values stored directly. The multi-key path is consistent: 
`toCompositeString` itself is collision-free (length-prefix encoding is 
injective), but the `.hashCode()` mapping to 32-bit int has the same collision 
properties as single-key STRING at line 109.
   
   I've updated the method Javadoc on `convertCompositeToValueBitmap` to call 
this out more explicitly, linking it to the existing single-key non-INT 
approximation.
   



##########
pinot-core/src/main/java/org/apache/pinot/core/query/aggregation/function/funnel/ThetaSketchAggregationStrategy.java:
##########
@@ -66,8 +75,14 @@ void add(Dictionary dictionary, UpdateSketch[] 
stepsSketches, int step, int corr
         sketch.update(dictionary.getStringValue(correlationId));
         break;
       default:
-        throw new IllegalStateException("Illegal CORRELATED_BY column data 
type for FUNNEL_COUNT aggregation function: "
-            + dictionary.getValueType());
+        throw new IllegalStateException(
+            "Illegal CORRELATED_BY column data type for FUNNEL_COUNT 
aggregation function: "
+                + dictionary.getValueType());
     }
   }
+
+  @Override
+  void addMultiKey(UpdateSketch[] stepsSketches, int step, Dictionary[] 
dictionaries, int[] correlationDictIds) {
+    stepsSketches[step].update(DictIdsWrapper.toCompositeString(dictionaries, 
correlationDictIds));

Review Comment:
   Fair point — `toCompositeString` does allocate a new `StringBuilder` + 
`String` per row. A couple of things to note though:
   
   - Theta sketch's `update()` only accepts primitives (`int`, `long`, 
`double`) or `String`/`byte[]`. Since a multi-key tuple has no single primitive 
representation, some form of serialization is unavoidable here.
   - The single-key STRING path (line 75) already allocates a string per row 
via `dictionary.getStringValue()`, so the cost pattern is structurally similar 
— just slightly more overhead from the length-prefix encoding.
   - JMH baseline for multi-key theta_sketch is 287 ops/s, which we can measure 
future optimizations against.
   
   One option would be switching to `update(byte[])` with a reusable 
`ByteBuffer` to avoid the String intermediate, but wanted to keep it simple for 
the initial implementation. Do you have other optimization ideas in mind?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to