2010YOUY01 commented on issue #17267: URL: https://github.com/apache/datafusion/issues/17267#issuecomment-3209685859
> Yes that is a bit difficult as well, we don't know size up front and usually the number of partitions is based on the memory size and memory limit. It will just be a configurable variable. > > > I recommend to get sort merge join working reliably before experimenting HJ spilling (i.e. benchmarks should be able to finish under a modest memory limit, perhaps also more tests), the existing solution is not production ready yet, but I think SMJ should have lower maintenance overhead -- It's core is reusing the external sort implementation. > > Is there a reason we shouldn't do it in parallel? I will only start after [#17260](https://github.com/apache/datafusion/issues/17260) is finished. I don’t think we’re ready to merge another external join implementation right now. The main reason is that the existing SMJ is still flaky. Adding another potentially flaky implementation (given the lack of tests to ensure reliability) won’t really help users, while it will increase maintenance overhead. To make this implementation mergeable, I think it would be better to: 1. Strengthen external join tests, e.g., allow the TPC-H benchmark to run under fuzzed memory limit configurations. 2. Fix issues in SMJ and improve its performance. Letting HJ fall back to SMJ when the memory limit is hit could be both simpler and avoid introducing additional configuration (like partition numbers). 3. Do a PoC with this approach. If it provides significant performance improvements, then it would make sense to include it. So at this stage, I think it’s a good idea to implement it as a PoC, but there’s still quite a bit of work to do before it’s mergeable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
