shauryachats opened a new issue, #14685:
URL: https://github.com/apache/pinot/issues/14685
While running some high-volume multi-stage engine queries on Pinot where the
join key was high cardinality, we recently observed a disproportionate latency
increase when data was increased across both sides of the joins for the
following query shape:
```
SELECT
count(*)
FROM
table_A
WHERE (
user_uuid IN (
SELECT
user_uuid
FROM
table_B
)
)
AND (
user_uuid NOT IN (
SELECT
user_uuid
FROM
table_B
)
)
LIMIT
100 option(useMultistageEngine=true, timeoutMs=120000, useColocatedJoin =
true, maxRowsInJoin = 40000000)
```
After profiling conducted on a server
<img width="1800" alt="Screenshot 2024-12-18 at 4 36 12 PM"
src="https://github.com/user-attachments/assets/e5ccea53-baa1-4851-84a5-31532ddc4ddb"
/>
It turns out that the major cause of the latency increase is due to
inefficient groupId generation in
`org/apache/pinot/query/runtime/operator/MultistageGroupByExecutor.generateGroupByKeys`,
which is happening due to a few reasons:
- Open Addressing is the current collision resolution for
`Object2IntOpenHashMap` which performs poorly for high cardinality use cases.
- Low default initial size of 16 and a default load factor of 0.75 which
causes a high number of multiple resizes and rehashing of existing keys for
high cardinality use cases, causing a major latency contribution to the overall
query runtime.
We are considering a few different strategies like better hash-map selection
(avoid open addressing for high-cardinality), generating groupIds in batches,
etc. We would be leveraging benchmarks for selecting the appropriate strategy
with the most RoI.
This optimization can help boost performance for both Pinot v1 and v2
engines simultaneously, since both the engines rely on this logic. cc:
@Jackie-Jiang
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]