ahmed-mez commented on PR #19562:
URL: https://github.com/apache/datafusion/pull/19562#issuecomment-3728479697

   > > This is the comparison output and it's not great 😞 At first glance 
mainly topk queries are impacted by the perf regression
   > 
   > I suspect the queries that are most impacted are ones with high group 
counts (high cardinality grouping).
   > 
   > I think we would need to actually change the storage to be chunked rather 
than just the emission part
   
   Thanks for the insight @alamb , I did some profiling to validate the 
hypothesis: I focused on ClickBench Q32 (partitioned dataset) which is ~12x 
slower on the current branch [profiles comparison (main vs current 
branch)](https://share.firefox.dev/4qoYS8Y)
   
   The hotspots in this PR trace back to multiple `emit` → `take_n`, with the 
cost distributed across allocating new vectors and copying null bitmaps
   
   Example profile (zoomed in on one of multiple 
`GroupedHashAggregateStream::emit`)
   <img width="1906" height="542" alt="image" 
src="https://github.com/user-attachments/assets/2d34fb7e-f9bd-4b36-8e94-e15a7e0b0ec4";
 />
   
   This confirms that with high-cardinality groups and frequent emissions, 
we're doing massive memory operations repeatedly.
   
   That said, it looks like chunked storage is needed and it'd be great to 
agree on a path forward. This is a larger change touching all `GroupColumn` 
implementations. Should we create an epic to track all this (incremental 
emission + chunked storage)? Happy to help scope it out.
   
   Thanks for your help and patience guiding me through this investigation! 🙏


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to