rtpsw commented on PR #36499: URL: https://github.com/apache/arrow/pull/36499#issuecomment-1625532994
> What I am proposing as an alternative is to have key hasher class produces an immutable tuple ... Yes, this was my understanding of your proposal when I wrote [this post](https://github.com/apache/arrow/pull/36499#issuecomment-1624849882), where I noted performance concerns. > The extra cost of what I proposing is basically allocating one tuple for each input batches, maybe + some shared pointer copies per batch, both seems fairly cheap. There's also additional deallocations, additional memory needed, additional costs due to computing in one thread and using the results in another thread, and probably more CPU cache misses. It's conceptually simpler but I don't think we can be sure it'd be more performant. It's hard to predict performance impact of changes to complex concurrent code. Ultimately, if we want to know, there's no escaping from implementing the alternative and trying it out. It's going to take some effort to change `RecordBatch` to a tuple in a bunch of places, so whether to go down this path is a decision, I guess for you, to make - let me know your call. Note that in this PR, I was aiming for a minimal change; in fact, it's not even new code but a reversion of part of a [previous PR](https://github.com/apache/arrow/pull/36094). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org