mingmwang commented on PR #4620: URL: https://github.com/apache/arrow-datafusion/pull/4620#issuecomment-1354468789
> > I prepare to review this PR carefully. In general, I think DP Join Reorder can cover more condition. like `DpSize` `DpSub` `DpCcp/DpHyp`, DpSub is implemented in `GPORCA`, it isn't complex. `fact-dimension joins` is star-join in query-graph, I think we can cover more join like chain-join/.... > > I'd be interested to see this. Prior to this PR I tried implementing join reordering in Spark RAPIDS based on a DP approach, using Spark's `CostBasedJoinReorder` as a starting point (based on [Access Path Selection in a Relational Database Management System](https://courses.cs.duke.edu/compsci516/cps216/spring03/papers/selinger-etal-1979.pdf)). It did work well for a few queries in TPC-DS but also introduced regressions. The challenge with this approach was that it only works well with good cardinality estimates, and that requires good column-level statistics to be available (which was not the case in Spark). I don't know anything about GPORCA and whether they were able to solve that problem, > > > I also notice PR just handle inner-join, I think we can also optimize SEMI/ANTI JOIN trickly (push them down), because SEMI/ANTI JOIN can't increase the cardinality, it's similar with Filter. > > I agree. I am planning on trying this out. > > > BTW, I notice `join_selection` in `physical optimizer`, look like they are similar > > I don't think they are. I think `join_selection` is just changing build/probe order in hash joins rather than rebuilding join trees? Yes, they are for different purpose and as you had mentioned those two rules can co-work together. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org