ctsk opened a new issue, #16206: URL: https://github.com/apache/datafusion/issues/16206
### Describe the bug An unfortunate pattern in the hash join implementation leads to excessive Arc-cloning: Assume the build-side carries a string-view column as a payload. Let N be the number of batches seen on the build side 1. In the build phase, datafusion concatenates the batches on the build side. The string-view column now holds references to at least N data buffers in a vec; 2. When constructing the output batch, the `take` implementation for string-views clones the build-side's buffer vector - thus incrementing the references on all N Arcs. ### To Reproduce I noticed this issue when executing and profiling tpch query 18. ### Expected behavior _No response_ ### Additional context - The concat during build: https://github.com/apache/datafusion/blob/7002a0027876a17e5bdf275e63d2a25373331943/datafusion/physical-plan/src/joins/hash_join.rs#L1013-L1015 - The take call during batch construction: https://github.com/apache/datafusion/blob/7002a0027876a17e5bdf275e63d2a25373331943/datafusion/physical-plan/src/joins/utils.rs#L918 - The relevant bit of arrow-rs https://github.com/apache/arrow-rs/blob/7e85b48dc8f929afa82f2878b17db7b2df240b8b/arrow-select/src/take.rs#L565-L567 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org