[GitHub] [arrow-datafusion] mingmwang commented on pull request #4620: Implement optimizer rule for reordering fact-dimension joins

GitBox Fri, 16 Dec 2022 01:43:29 -0800


mingmwang commented on PR #4620:
URL: 
https://github.com/apache/arrow-datafusion/pull/4620#issuecomment-1354468789


   > > I prepare to review this PR carefully. In general, I think DP Join 
Reorder can cover more condition. like `DpSize` `DpSub` `DpCcp/DpHyp`, DpSub is 
implemented in `GPORCA`, it isn't complex. `fact-dimension joins` is star-join 
in query-graph, I think we can cover more join like chain-join/....
   > 
   > I'd be interested to see this. Prior to this PR I tried implementing join 
reordering in Spark RAPIDS based on a DP approach, using Spark's 
`CostBasedJoinReorder` as a starting point (based on [Access Path Selection in 
a Relational Database Management 
System](https://courses.cs.duke.edu/compsci516/cps216/spring03/papers/selinger-etal-1979.pdf)).
 It did work well for a few queries in TPC-DS but also introduced regressions. 
The challenge with this approach was that it only works well with good 
cardinality estimates, and that requires good column-level statistics to be 
available (which was not the case in Spark). I don't know anything about GPORCA 
and whether they were able to solve that problem,
   > 
   > > I also notice PR just handle inner-join, I think we can also optimize 
SEMI/ANTI JOIN trickly (push them down), because SEMI/ANTI JOIN can't increase 
the cardinality, it's similar with Filter.
   > 
   > I agree. I am planning on trying this out.
   > 
   > > BTW, I notice `join_selection` in `physical optimizer`, look like they 
are similar
   > 
   > I don't think they are. I think `join_selection` is just changing 
build/probe order in hash joins rather than rebuilding join trees?
   
   Yes, they are for different purpose and as you had mentioned those two rules 
can co-work together.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] mingmwang commented on pull request #4620: Implement optimizer rule for reordering fact-dimension joins

Reply via email to