tustvold commented on issue #4973: URL: https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1458099098
> introduce much more random memory access state update The cache locality is actually likely better as the aggregated values can be stored inline within the hash table as they are known at compile time. Similarly the row format stores consecutive rows contiguously in memory, and so you have very good locality from that perspective as well. > I would like to try JIT I think if we can get competitive performance without having to reach for JIT I think that is a massive win to maintainability. I originally reached for a JIT when creating the row format, and came to the conclusion it is a huge additional complexity that needs to yield order of magnitude performance improvements to justify itself. In the case of the row format the JIT approach was actually slower, than a correctly vectorised version. That's not to say we couldn't use a JIT, just it should definitely not be the first thing we try. > is a fundamental change I see no reason to be fundamentalist here, we can support multiple approaches so long as they provide benefits to certain workloads. I suggest we move ahead with a POC and get some real performance data -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
