praveenc7 opened a new pull request, #17411:
URL: https://github.com/apache/pinot/pull/17411

   ## Summary
   
   ISSUE=https://github.com/apache/pinot/issues/17336 
   
   For dictionary-encoded columns, DISTINCT_COUNT_SMART_HLL currently uses a 
RoaringBitmap to deduplicate dictionary IDs before feeding values into HLL. 
While efficient for low cardinality, this approach becomes CPU-intensive for 
high cardinality (hundreds of thousands to millions of distinct values), where 
RoaringBitmap insertions dominate query execution time and negate the benefits 
of HLL.
   
   <img width="1560" height="785" alt="Screenshot 2025-12-16 at 3 59 43 PM" 
src="https://github.com/user-attachments/assets/2e4ea2b2-a594-4a3b-9b16-5af726b8d953";
 />
   
   
   ## Proposal
   Introduce a cardinality-aware execution path for DISTINCT_COUNT_HLL:
   
   - Low cardinality → continue using RoaringBitmap (exact deduplication, 
memory-efficient)
   - High cardinality → bypass RoaringBitmap and update HLL directly
   
   ### Observed improvements
   
   - Reduces server-side CPU time by ~4x - 10× for high-cardinality queries 
(observed improvements from ~8s → ~700ms in prod benchmarks).
   
   
   ## Testing Done 
   
   Added JMH benchmark covering:
   
   This JMH benchmark isolates server-side aggregation cost for the 
DistinctCountHLLAggregationFunction under controlled parameters: Each variation 
was run for 10 minutes
   
   recordCount: {100K, 500K, 1M, 5M, 10M, 25M} 
   cardinalityRatioPercent: {1, 10, 30, 50, 80, 100} → Creates a record with 
configured cardinality
   useRoaringBitMap/HLL -> Controls on to run the test with useRoaringBitMap or 
HLL
   
   DictIds are pre-generated so benchmark timing includes only aggregation, not 
data generation.
   
   
   Sample plots : 
   <img width="628" height="391" alt="Screenshot 2025-12-16 at 3 58 39 PM" 
src="https://github.com/user-attachments/assets/fcb37396-1f21-4f71-8fef-fa12f0dad255";
 />
   
   Flame graph after optimization : Aggregate doesn't dominate CPU
   <img width="1537" height="836" alt="Screenshot 2025-12-16 at 4 01 42 PM" 
src="https://github.com/user-attachments/assets/8edcc39c-2539-479b-a180-f16b0e2b77f5";
 />
   
   
   ## Benchmark Results (Average Latency, ms/op)
   
   ### Record Count = 100,000
   | Cardinality  | RoaringBitmap | Direct HLL |
   |---:|---:|---:|
   | 1,000 | 0.6 | 0.80 |
   | 10,000 | 0.71 | 0.87 |
   | 30,000 | 0.89 | 0.90 |
   | 50,000 | 1.05 | 0.96 |
   | 80,000 | 1.79 | 1.00 |
   | 100,000 | 1.91 | 1.05 |
   
   ### Record Count = 500,000
   | Cardinality | RoaringBitmap | Direct HLL |
   |---:|---:|---:|
   | 5,000 | 1.45 | 2.85 |
   | 50,000 | 2.36 | 2.92 |
   | 150,000 | 5.53 | 3.16 |
   | 250,000 | 7.26 | 3.18 |
   | 400,000 | 9.59 | 3.18 |
   | 500,000 | 10.69 | 3.17 |
   
   ### Record Count = 1,000,000
   | Cardinality  | RoaringBitmap | Direct HLL |
   |---:|---:|---:|
   | 10,000 | 2.53 | 5.36 |
   | 100,000 | 6.69 | 5.44 |
   | 300,000 | 13.12 | 5.80 |
   | 500,000 | 15.92 | 5.78 |
   | 800,000 | 19.84 | 5.78 |
   | 1,000,000 | 22.11 | 5.71 |
   
   ### Record Count = 5,000,000
   | Cardinality  | RoaringBitmap | Direct HLL |
   |---:|---:|---:|
   | 50,000 | 11.51 | 25.12 |
   | 500,000 | 53.62 | 25.29 |
   | 1,500,000 | 75.60 | 26.13 |
   | 2,500,000 | 92.91 | 25.53 |
   | 4,000,000 | 113.21 | 25.24 |
   | 5,000,000 | 129.34 | 25.79 |
   
   ### Record Count = 10,000,000
   | Cardinality | RoaringBitmap | Direct HLL |
   |---:|---:|---:|
   | 100,000 | 52.98 | 50.64 |
   | 1,000,000 | 117.68 | 50.61 |
   | 3,000,000 | 161.56 | 50.08 |
   | 5,000,000 | 206.71 | 51.14 |
   | 8,000,000 | 248.77 | 50.01 |
   | 10,000,000 | 278.78 | 50.37 |
   
   ### Record Count = 25,000,000
   | Cardinality | RoaringBitmap | Direct HLL |
   |---:|---:|---:|
   | 250,000 | 199.06 | 125.82 |
   | 2,500,000 | 348.39 | 126.40 |
   | 7,500,000 | 466.14 | 124.74 |
   | 12,500,000 | 555.77 | 124.35 |
   | 20,000,000 | 679.43 | 124.99 |
   
   
   
   ### Recommendation:
   Based on the micro-benchmark results across record counts and cardinalities, 
100K distinct values is a good default threshold to start with for switching 
away from the RoaringBitmap path. At this scale, RoaringBitmap remains 
efficient for low-cardinality cases, while higher cardinalities already show 
clear benefits from using direct HLL updates. This threshold provides a safe 
balance between preserving deduplication benefits for low cardinality and 
avoiding excessive bitmap maintenance cost for high-cardinality workloads
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to