[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

cloud-fan Thu, 16 Mar 2017 21:44:02 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17286#discussion_r106583013
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala
 ---
    @@ -203,64 +205,46 @@ object JoinReorderDP extends PredicateHelper {
       private def buildJoin(
           oneJoinPlan: JoinPlan,
           otherJoinPlan: JoinPlan,
    -      conf: CatalystConf,
    +      conf: SQLConf,
           conditions: Set[Expression],
    -      topOutput: AttributeSet): JoinPlan = {
    +      topOutput: AttributeSet): Option[JoinPlan] = {
     
         val onePlan = oneJoinPlan.plan
         val otherPlan = otherJoinPlan.plan
    -    // Now both onePlan and otherPlan become intermediate joins, so the 
cost of the
    -    // new join should also include their own cardinalities and sizes.
    -    val newCost = if (isCartesianProduct(onePlan) || 
isCartesianProduct(otherPlan)) {
    -      // We consider cartesian product very expensive, thus set a very 
large cost for it.
    -      // This enables to plan all the cartesian products at the end, 
because having a cartesian
    -      // product as an intermediate join will significantly increase a 
plan's cost, making it
    -      // impossible to be selected as the best plan for the items, unless 
there's no other choice.
    -      Cost(
    -        rows = BigInt(Long.MaxValue) * BigInt(Long.MaxValue),
    -        size = BigInt(Long.MaxValue) * BigInt(Long.MaxValue))
    -    } else {
    -      val onePlanStats = onePlan.stats(conf)
    -      val otherPlanStats = otherPlan.stats(conf)
    -      Cost(
    -        rows = oneJoinPlan.cost.rows + onePlanStats.rowCount.get +
    -          otherJoinPlan.cost.rows + otherPlanStats.rowCount.get,
    -        size = oneJoinPlan.cost.size + onePlanStats.sizeInBytes +
    -          otherJoinPlan.cost.size + otherPlanStats.sizeInBytes)
    -    }
    -
    -    // Put the deeper side on the left, tend to build a left-deep tree.
    -    val (left, right) = if (oneJoinPlan.itemIds.size >= 
otherJoinPlan.itemIds.size) {
    -      (onePlan, otherPlan)
    -    } else {
    -      (otherPlan, onePlan)
    -    }
         val joinConds = conditions
           .filterNot(l => canEvaluate(l, onePlan))
           .filterNot(r => canEvaluate(r, otherPlan))
           .filter(e => e.references.subsetOf(onePlan.outputSet ++ 
otherPlan.outputSet))
    -    // We use inner join whether join condition is empty or not. Since 
cross join is
    -    // equivalent to inner join without condition.
    -    val newJoin = Join(left, right, Inner, joinConds.reduceOption(And))
    -    val collectedJoinConds = joinConds ++ oneJoinPlan.joinConds ++ 
otherJoinPlan.joinConds
    -    val remainingConds = conditions -- collectedJoinConds
    -    val neededAttr = AttributeSet(remainingConds.flatMap(_.references)) ++ 
topOutput
    -    val neededFromNewJoin = newJoin.outputSet.filter(neededAttr.contains)
    -    val newPlan =
    -      if ((newJoin.outputSet -- neededFromNewJoin).nonEmpty) {
    -        Project(neededFromNewJoin.toSeq, newJoin)
    +    if (joinConds.isEmpty) {
    +      // Cartesian product is very expensive, so we exclude them from 
candidate plans.
    +      // This also significantly reduces the search space.
    --- End diff --
    
    great! now we can safely apply this optimization :)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17286: [SPARK-19915][SQL] Exclude cartesian product cand...

Reply via email to