[GitHub] [arrow-datafusion] tustvold commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregaton

via GitHub Tue, 07 Mar 2023 04:37:02 -0800


tustvold commented on issue #4973:
URL: 
https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1458099098


   > introduce much more random memory access state update
   
   The cache locality is actually likely better as the aggregated values can be 
stored inline within the hash table as they are known at compile time. 
Similarly the row format stores consecutive rows contiguously in memory, and so 
you have very good locality from that perspective as well.
   
   > I would like to try JIT
   
   I think if we can get competitive performance without having to reach for 
JIT I think that is a massive win to maintainability. I originally reached for 
a JIT when creating the row format, and came to the conclusion it is a huge 
additional complexity that needs to yield order of magnitude performance 
improvements to justify itself. In the case of the row format the JIT approach 
was actually slower, than a correctly vectorised version. That's not to say we 
couldn't use a JIT, just it should definitely not be the first thing we try.
   
   > is a fundamental change
   
   I see no reason to be fundamentalist here, we can support multiple 
approaches so long as they provide benefits to certain workloads. I suggest we 
move ahead with a POC and get some real performance data
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] tustvold commented on issue #4973: Improve the performance of `Aggregator`, grouping, aggregaton

Reply via email to