ctsk opened a new issue, #16206:
URL: https://github.com/apache/datafusion/issues/16206

   ### Describe the bug
   
   An unfortunate pattern in the hash join implementation leads to excessive 
Arc-cloning: Assume the build-side carries a string-view column as a payload. 
Let N be the number of batches seen on the build side
   
   1. In the build phase, datafusion concatenates the batches on the build 
side. The string-view column now holds references to at least N data buffers in 
a vec;
   
   2. When constructing the output batch, the `take` implementation for 
string-views clones the build-side's buffer vector - thus incrementing the 
references on all N Arcs.
   
   ### To Reproduce
   
   I noticed this issue when executing and profiling tpch query 18.
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   - The concat during build:
   
https://github.com/apache/datafusion/blob/7002a0027876a17e5bdf275e63d2a25373331943/datafusion/physical-plan/src/joins/hash_join.rs#L1013-L1015
   
   - The take call during batch construction:
   
https://github.com/apache/datafusion/blob/7002a0027876a17e5bdf275e63d2a25373331943/datafusion/physical-plan/src/joins/utils.rs#L918
   
   - The relevant bit of arrow-rs
   
https://github.com/apache/arrow-rs/blob/7e85b48dc8f929afa82f2878b17db7b2df240b8b/arrow-select/src/take.rs#L565-L567


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to