rtpsw commented on PR #36499:
URL: https://github.com/apache/arrow/pull/36499#issuecomment-1625532994

   > What I am proposing as an alternative is to have key hasher class produces 
an immutable tuple  ...
   
   Yes, this was my understanding of your proposal when I wrote [this 
post](https://github.com/apache/arrow/pull/36499#issuecomment-1624849882), 
where I noted performance concerns.
   
   > The extra cost of what I proposing is basically allocating one tuple for 
each input batches, maybe + some shared pointer copies per batch, both seems 
fairly cheap.
   
   There's also additional deallocations, additional memory needed, additional 
costs due to computing in one thread and using the results in another thread, 
and probably more CPU cache misses. It's conceptually simpler but I don't think 
we can be sure it'd be more performant.
   
   It's hard to predict performance impact of changes to complex concurrent 
code.  Ultimately, if we want to know, there's no escaping from implementing 
the alternative and trying it out. It's going to take some effort to change 
`RecordBatch` to a tuple in a bunch of places, so whether to go down this path 
is a decision, I guess for you, to make - let me know your call. Note that in 
this PR, I was aiming for a minimal change; in fact, it's not even new code but 
a reversion of part of a [previous 
PR](https://github.com/apache/arrow/pull/36094).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to