Tanel Kiis created SPARK-33935: ---------------------------------- Summary: Fix CBOs cost function Key: SPARK-33935 URL: https://issues.apache.org/jira/browse/SPARK-33935 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Tanel Kiis
The parameter spark.sql.cbo.joinReorder.card.weight is decumented as: {code:title=spark.sql.cbo.joinReorder.card.weight} The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight + size * (1 - weight). {code} But in the implementation the formula is a bit different: {code:title=Current implementation} def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { if (other.planCost.card == 0 || other.planCost.size == 0) { false } else { val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card) val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size) relativeRows * conf.joinReorderCardWeight + relativeSize * (1 - conf.joinReorderCardWeight) < 1 } } {code} This change has an unfortunate consequence: given two plans A and B, both A betterThan B and B betterThan A might give the same results. This happes when one has many rows with small sizes and other has few rows with large sizes. A example values, that have this fenomen with the default weight value (0.7): A.card = 500, B.card = 300 A.size = 30, B.size = 80 Both A betterThan B and B betterThan A would have score above 1 and would return false. A new implementation is proposed, that matches the documentation: {code:title=Proposed implementation} def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { val oldCost = BigDecimal(this.planCost.card) * conf.joinReorderCardWeight + BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight) val newCost = BigDecimal(other.planCost.card) * conf.joinReorderCardWeight + BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight) newCost < oldCost } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org