Re: [PR] GH-45917: [C++][Acero] Join result materialize in parallel [arrow]

via GitHub Sat, 05 Apr 2025 10:00:11 -0700


zanmato1984 commented on PR #45918:
URL: https://github.com/apache/arrow/pull/45918#issuecomment-2750144540


   Hi @uchenily , thank you for opening the PR.
   
   I would like add some more about the problem this PR is trying to address:
   By the time that `JoinProbeProcessor::OnFinished` is invoked, there will be 
at most `1 << 15` or `32k` pending rows (that is, ought to be but not yet been 
emitted to downstream nodes) in each `JoinResultMaterialize` (`num_threads` of 
them in total) and we are going to `Flush` them. Serial execution of these 
`Flush` not only slows down the `Flush` itself, but also disables parallelism 
for any downstream processing, which in some cases, might be computational 
intensive (consider an aggregation like `select sum(c) from t1 right join t2 on 
a = b group by d`). Thus I think this PR makes sense.
   
   Just wondering if you have encountered any case that the above problem 
causes real performance issue and how bad it is? And how much does this PR 
improve it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-45917: [C++][Acero] Join result materialize in parallel [arrow]

Reply via email to