[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r98751522 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) --- End diff -- If we do it in the optimizer, it does not cover all the cases, right? Cartesian Product might be chosen based on the statistics. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r98751322 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) --- End diff -- If we do it in the optimizer, it does not cover all the cases, right? Cartesian Product might be chosen based on the statistics. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r98750834 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) --- End diff -- If we do it here, it does not cover all the cases, right? `Cartesian Product` might be chosen based on the statistics. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14866 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77383666 --- Diff: R/pkg/R/DataFrame.R --- @@ -2265,7 +2265,7 @@ setMethod("join", signature(x = "SparkDataFrame", y = "SparkDataFrame"), function(x, y, joinExpr = NULL, joinType = NULL) { if (is.null(joinExpr)) { - sdf <- callJMethod(x@sdf, "join", y@sdf) + sdf <- callJMethod(x@sdf, "crossJoin", y@sdf) } else { if (class(joinExpr) != "Column") stop("joinExpr must be a Column") if (is.null(joinType)) { --- End diff -- could you add "cross" to the API doc - L2246 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77380304 --- Diff: R/pkg/R/DataFrame.R --- @@ -2265,7 +2265,7 @@ setMethod("join", signature(x = "SparkDataFrame", y = "SparkDataFrame"), function(x, y, joinExpr = NULL, joinType = NULL) { if (is.null(joinExpr)) { - sdf <- callJMethod(x@sdf, "join", y@sdf) + sdf <- callJMethod(x@sdf, "crossJoin", y@sdf) } else { if (class(joinExpr) != "Column") stop("joinExpr must be a Column") if (is.null(joinType)) { --- End diff -- I've added cross to allowable join types. With a joinExpr I think inner is appropriate --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77290718 --- Diff: R/pkg/R/DataFrame.R --- @@ -2265,7 +2265,7 @@ setMethod("join", signature(x = "SparkDataFrame", y = "SparkDataFrame"), function(x, y, joinExpr = NULL, joinType = NULL) { if (is.null(joinExpr)) { - sdf <- callJMethod(x@sdf, "join", y@sdf) + sdf <- callJMethod(x@sdf, "crossJoin", y@sdf) } else { if (class(joinExpr) != "Column") stop("joinExpr must be a Column") if (is.null(joinType)) { --- End diff -- should the next line be "crossJoin" too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77219760 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) --- End diff -- As discussed, while it is probably best for this to be in analysis (does throw an analysis exception after all), leaving it here until we want to reorganize the analysis/optimizer rules. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77155243 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) --- End diff -- Yeah it is the earliest possible place. It is just not something I would expect from optimization. If we are to add it in QueryExecution then I should do this after optimization and before planning:https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L76-L81 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77099268 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -39,39 +38,46 @@ object ReorderJoin extends Rule[LogicalPlan] with PredicateHelper { * * The joined plan are picked from left to right, prefer those has at least one join condition. * - * @param input a list of LogicalPlans to join. + * @param input a list of LogicalPlans to inner join and the type of inner join. * @param conditions a list of condition for join. */ @tailrec - def createOrderedJoin(input: Seq[LogicalPlan], conditions: Seq[Expression]): LogicalPlan = { + def createOrderedJoin(input: Seq[(LogicalPlan, InnerLike)], conditions: Seq[Expression]) +: LogicalPlan = { assert(input.size >= 2) if (input.size == 2) { --- End diff -- Sameer helped me interpret this comment. I *think* I've addressed this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77098768 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala --- @@ -28,6 +28,7 @@ object JoinType { case "rightouter" | "right" => RightOuter case "leftsemi" => LeftSemi case "leftanti" => LeftAnti +case "cross" => Cross case _ => val supported = Seq( --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77098708 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) --- End diff -- This is just the earliest possible place to do the check. (In fact it ideally belongs in analysis). Could you elaborate on triggering this in QueryExecution ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77097569 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) +extends Rule[LogicalPlan] with PredicateHelper { + /** + * Check if a join is a cartesian product. Returns true if + * there are no join conditions involving references from both left and right. + */ + def isCartesianProduct(join: Join) + : Boolean = { --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77089271 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -39,39 +38,46 @@ object ReorderJoin extends Rule[LogicalPlan] with PredicateHelper { * * The joined plan are picked from left to right, prefer those has at least one join condition. * - * @param input a list of LogicalPlans to join. + * @param input a list of LogicalPlans to inner join and the type of inner join. * @param conditions a list of condition for join. */ @tailrec - def createOrderedJoin(input: Seq[LogicalPlan], conditions: Seq[Expression]): LogicalPlan = { + def createOrderedJoin(input: Seq[(LogicalPlan, InnerLike)], conditions: Seq[Expression]) +: LogicalPlan = { assert(input.size >= 2) if (input.size == 2) { --- End diff -- using a pattern match might be simpler here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77086020 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala --- @@ -28,6 +28,7 @@ object JoinType { case "rightouter" | "right" => RightOuter case "leftsemi" => LeftSemi case "leftanti" => LeftAnti +case "cross" => Cross case _ => val supported = Seq( --- End diff -- also add cross to the list. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77085681 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) --- End diff -- More general remark. Why is this part of optimization? We could for instance also trigger this check in `QueryExecution`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77084851 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,47 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) +extends Rule[LogicalPlan] with PredicateHelper { + /** + * Check if a join is a cartesian product. Returns true if + * there are no join conditions involving references from both left and right. + */ + def isCartesianProduct(join: Join) + : Boolean = { --- End diff -- nit: put boolean on the the same line as the def. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77069424 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,45 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) +extends Rule[LogicalPlan] with PredicateHelper { + /** + * Check if a join specified by left, right and condition is a cartesian product. Returns true if + * there are no join conditions involving references from both left and right. + */ + def isCartesianProduct(left: LogicalPlan, right: LogicalPlan, condition: Option[Expression]) + : Boolean = { +val conditions = condition.map(splitConjunctivePredicates).getOrElse(Nil) +!conditions.map(_.references).exists(refs => refs.exists(left.outputSet.contains) +&& refs.exists(right.outputSet.contains)) + } + + def apply(plan: LogicalPlan): LogicalPlan = +if (conf.allowCartesianProduct) { + plan +} else plan transform { + case j @ Join(left, right, Inner(false) | LeftOuter | RightOuter | FullOuter, condition) +if isCartesianProduct(left, right, condition) => + throw new AnalysisException( +s"""Detected cartesian product for ${j.joinType.sql} join between logical plans + |${left.treeString(false).trim} + |and + |${right.treeString(false).trim} + |Use a CROSS JOIN to allow cartesian products between these relations""".stripMargin) --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77067766 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -698,6 +698,8 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging { override def orderByOrdinal: Boolean = getConf(ORDER_BY_ORDINAL) override def groupByOrdinal: Boolean = getConf(GROUP_BY_ORDINAL) + + override def allowCartesianProduct: Boolean = getConf(CROSS_JOINS_ENABLED) --- End diff -- As discussed, I only have the crossJoinEnabled flag, which has been exposed in CatalystConf. Also as discussed, deprecating crossJoinEnabled and switching the conf flag to requireCrossJoinSyntax is a good idea, will address in a follow-up PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77067849 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala --- @@ -353,7 +353,7 @@ case class BroadcastNestedLoopJoinExec( val broadcastedRelation = broadcast.executeBroadcast[Array[InternalRow]]() --- End diff -- The physical plan checks have been removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77049987 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -39,39 +38,44 @@ object ReorderJoin extends Rule[LogicalPlan] with PredicateHelper { * * The joined plan are picked from left to right, prefer those has at least one join condition. * - * @param input a list of LogicalPlans to join. + * @param input a list LogicalPlans to inner join, and a bool speciying if an explicit cross join --- End diff -- fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77049867 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala --- @@ -303,7 +303,7 @@ package object dsl { def join( otherPlan: LogicalPlan, -joinType: JoinType = Inner, +joinType: JoinType = Inner(false), --- End diff -- Now renamed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r77049426 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,45 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) +extends Rule[LogicalPlan] with PredicateHelper { + /** + * Check if a join specified by left, right and condition is a cartesian product. Returns true if + * there are no join conditions involving references from both left and right. + */ + def isCartesianProduct(left: LogicalPlan, right: LogicalPlan, condition: Option[Expression]) --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user sameeragarwal commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76885747 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala --- @@ -39,39 +38,44 @@ object ReorderJoin extends Rule[LogicalPlan] with PredicateHelper { * * The joined plan are picked from left to right, prefer those has at least one join condition. * - * @param input a list of LogicalPlans to join. + * @param input a list LogicalPlans to inner join, and a bool speciying if an explicit cross join --- End diff -- nit: typo --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user sameeragarwal commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76885372 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala --- @@ -303,7 +303,7 @@ package object dsl { def join( otherPlan: LogicalPlan, -joinType: JoinType = Inner, +joinType: JoinType = Inner(false), --- End diff -- nit: not a big deal either way but explicitly naming the boolean parameters (`Inner(explicitCartesian = false)`) would be great for readability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user sameeragarwal commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76884744 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,45 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) +extends Rule[LogicalPlan] with PredicateHelper { + /** + * Check if a join specified by left, right and condition is a cartesian product. Returns true if + * there are no join conditions involving references from both left and right. + */ + def isCartesianProduct(left: LogicalPlan, right: LogicalPlan, condition: Option[Expression]) + : Boolean = { +val conditions = condition.map(splitConjunctivePredicates).getOrElse(Nil) +!conditions.map(_.references).exists(refs => refs.exists(left.outputSet.contains) +&& refs.exists(right.outputSet.contains)) + } + + def apply(plan: LogicalPlan): LogicalPlan = +if (conf.allowCartesianProduct) { + plan +} else plan transform { + case j @ Join(left, right, Inner(false) | LeftOuter | RightOuter | FullOuter, condition) +if isCartesianProduct(left, right, condition) => + throw new AnalysisException( +s"""Detected cartesian product for ${j.joinType.sql} join between logical plans + |${left.treeString(false).trim} + |and + |${right.treeString(false).trim} + |Use a CROSS JOIN to allow cartesian products between these relations""".stripMargin) --- End diff -- Should we also talk about the conf flag here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user sameeragarwal commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76884680 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -933,6 +936,45 @@ object CombineLimits extends Rule[LogicalPlan] { } /** + * Check if there any cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + * + * This rule must be run AFTER the ReorderJoin rule since the join conditions for each join must be + * collected before checking if it is a cartesian product. If you have + * SELECT * from R, S where R.r = S.s, + * the join between R and S is not a cartesian product and therefore should be allowed. + * The predicate R.r = S.s is not recognized as a join condition until the ReorderJoin rule. + */ +case class CheckCartesianProducts(conf: CatalystConf) +extends Rule[LogicalPlan] with PredicateHelper { + /** + * Check if a join specified by left, right and condition is a cartesian product. Returns true if + * there are no join conditions involving references from both left and right. + */ + def isCartesianProduct(left: LogicalPlan, right: LogicalPlan, condition: Option[Expression]) --- End diff -- nit: might be cleaner to just pass in the `Join` node directly as a parameter --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user sameeragarwal commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76878769 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala --- @@ -353,7 +353,7 @@ case class BroadcastNestedLoopJoinExec( val broadcastedRelation = broadcast.executeBroadcast[Array[InternalRow]]() --- End diff -- per offline discussion, depending on the semantics we decide, we should revisit whether we'd like to keep the physical plan check here L344 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user sameeragarwal commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76874085 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -698,6 +698,8 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging { override def orderByOrdinal: Boolean = getConf(ORDER_BY_ORDINAL) override def groupByOrdinal: Boolean = getConf(GROUP_BY_ORDINAL) + + override def allowCartesianProduct: Boolean = getConf(CROSS_JOINS_ENABLED) --- End diff -- Also, on a related note, do you think it'd make sense to rename this (and the underlying conf) to `requireExplicitCrossJoinSyntax` to better convey the implied semantics? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user sameeragarwal commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76873765 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -698,6 +698,8 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging { override def orderByOrdinal: Boolean = getConf(ORDER_BY_ORDINAL) override def groupByOrdinal: Boolean = getConf(GROUP_BY_ORDINAL) + + override def allowCartesianProduct: Boolean = getConf(CROSS_JOINS_ENABLED) --- End diff -- The `crossJoinEnabled` above also references the same underlying conf and is being used at multiple places in the code. Can we unify the two? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76857357 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -642,7 +642,7 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { checkColumnNames( """SELECT x.a, y.a, x.b, y.b |FROM (SELECT 1 AS a, 2 AS b) x -|INNER JOIN (SELECT 1 AS a, 2 AS b) y +|CROSS JOIN (SELECT 1 AS a, 2 AS b) y --- End diff -- The join condition is eliminated because it is "trivial" according to the foldable propagation rule. This is annoying, but an expected by-product of detecting joins in the optimization phase. I can check to see if there is a way to split up the operator optimization batch so the cartesian check can be inserted before the constant propagation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76720134 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/catalyst/LogicalPlanToSQLSuite.scala --- @@ -642,7 +642,7 @@ class LogicalPlanToSQLSuite extends SQLBuilderTest with SQLTestUtils { checkColumnNames( """SELECT x.a, y.a, x.b, y.b |FROM (SELECT 1 AS a, 2 AS b) x -|INNER JOIN (SELECT 1 AS a, 2 AS b) y +|CROSS JOIN (SELECT 1 AS a, 2 AS b) y --- End diff -- this is not a cross join, is it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76716719 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -159,23 +159,30 @@ object ExtractEquiJoinKeys extends Logging with PredicateHelper { */ object ExtractFiltersAndInnerJoins extends PredicateHelper { - // flatten all inner joins, which are next to each other - def flattenJoin(plan: LogicalPlan): (Seq[LogicalPlan], Seq[Expression]) = plan match { -case Join(left, right, Inner, cond) => - val (plans, conditions) = flattenJoin(left) - (plans ++ Seq(right), conditions ++ cond.toSeq) + /** + * Flatten all inner joins, which are next to each other. + * Return a list of logical plans to be joined with a boolean for each plan indicating if it + * was involved in an explicit cross join. Also returns the entire list of join conditions for + * the left-deep tree. + */ + def flattenJoin(plan: LogicalPlan, parentJoinType: Inner = Inner(false)) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76716740 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends Rule[LogicalPlan] { } /** + * Check if ther eany cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76716676 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends Rule[LogicalPlan] { } /** + * Check if ther eany cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + */ +case class CheckCartesianProducts(conf: CatalystConf) extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = +if (conf.allowCartesianProduct) { + plan +} else plan transform { + case j @ Join(left, right, joinType, conditions) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76716689 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends Rule[LogicalPlan] { } /** + * Check if ther eany cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + */ +case class CheckCartesianProducts(conf: CatalystConf) extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = +if (conf.allowCartesianProduct) { + plan +} else plan transform { + case j @ Join(left, right, joinType, conditions) + if Seq(Inner(false), LeftOuter, RightOuter, FullOuter).contains(joinType) => +val hasJoinCondition = --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user srinathshankar commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76683461 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends Rule[LogicalPlan] { } /** + * Check if ther eany cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + */ +case class CheckCartesianProducts(conf: CatalystConf) extends Rule[LogicalPlan] { --- End diff -- As discussed, must be done after ReorderJoin. Added a comment explaining why. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76670132 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends Rule[LogicalPlan] { } /** + * Check if ther eany cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. --- End diff -- As discussed offline. Please document that we need to run this after join reordering. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76665743 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -159,23 +159,30 @@ object ExtractEquiJoinKeys extends Logging with PredicateHelper { */ object ExtractFiltersAndInnerJoins extends PredicateHelper { - // flatten all inner joins, which are next to each other - def flattenJoin(plan: LogicalPlan): (Seq[LogicalPlan], Seq[Expression]) = plan match { -case Join(left, right, Inner, cond) => - val (plans, conditions) = flattenJoin(left) - (plans ++ Seq(right), conditions ++ cond.toSeq) + /** + * Flatten all inner joins, which are next to each other. + * Return a list of logical plans to be joined with a boolean for each plan indicating if it + * was involved in an explicit cross join. Also returns the entire list of join conditions for + * the left-deep tree. + */ + def flattenJoin(plan: LogicalPlan, parentJoinType: Inner = Inner(false)) --- End diff -- just pass in `explicitCartesian`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76665431 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends Rule[LogicalPlan] { } /** + * Check if ther eany cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + */ +case class CheckCartesianProducts(conf: CatalystConf) extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = +if (conf.allowCartesianProduct) { + plan +} else plan transform { + case j @ Join(left, right, joinType, conditions) + if Seq(Inner(false), LeftOuter, RightOuter, FullOuter).contains(joinType) => +val hasJoinCondition = --- End diff -- move this to a function and call that function in the case guard. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76665332 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends Rule[LogicalPlan] { } /** + * Check if ther eany cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + */ +case class CheckCartesianProducts(conf: CatalystConf) extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = +if (conf.allowCartesianProduct) { + plan +} else plan transform { + case j @ Join(left, right, joinType, conditions) --- End diff -- `j @ Join(left, right, Inner(false) | LeftOuter | RightOuter | FullOuter, condition) =>`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/14866#discussion_r76665008 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1534,6 +1542,34 @@ object SimplifyCaseConversionExpressions extends Rule[LogicalPlan] { } /** + * Check if ther eany cartesian products between joins of any type in the optimized plan tree. + * Throw an error if a cartesian product is found without an explicit cross join specified. + * This rule is effectively disabled if the CROSS_JOINS_ENABLED flag is true. + */ +case class CheckCartesianProducts(conf: CatalystConf) extends Rule[LogicalPlan] { --- End diff -- Shouldn't this be a part of CheckAnalysis? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14866: [SPARK-17298][SQL] Require explicit CROSS join fo...
GitHub user srinathshankar opened a pull request: https://github.com/apache/spark/pull/14866 [SPARK-17298][SQL] Require explicit CROSS join for cartesian products ## What changes were proposed in this pull request? Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the "spark.sql.crossJoin.enabled" configuration flag will disable this check and allow cartesian products without an explicit CROSS join. The new crossJoin DataFrame API must be used to specify explicit cross joins. The existing join(DataFrame) method will produce a INNER join that will require a subsequent join condition. That is df1.join(df2) is equivalent to select * from df1, df2. ## How was this patch tested? Added cross-join.sql to the SQLQueryTestSuite to test the check for cartesian products. Added a couple of tests to the DataFrameJoinSuite to test the crossJoin API. Modified various other test suites to explicitly specify a cross join where an INNER join or a comma-separated list was previously used. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srinathshankar/spark crossjoin Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14866.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14866 commit 250d96dcf43c09dbff821e5b20f3f0ea86f08887 Author: Srinath Shankar Date: 2016-08-25T16:43:08Z [SPARK-17298][SQL] Require explicit CROSS join for cartesian products Require the use of CROSS join syntax in SQL (and a new crossJoin DataFrame API) to specify explicit cartesian products between relations. By cartesian product we mean a join between relations R and S where there is no join condition involving columns from both R and S. If a cartesian product is detected in the absence of an explicit CROSS join, an error must be thrown. Turning on the "spark.sql.crossJoin.enabled" configuration flag will disable this check and allow cartesian products without an explicit CROSS join. The new crossJoin DataFrame API must be used to specify explicit cross joins. The existing join(DataFrame) method will produce a INNER join that will require a subsequent join condition. That is df1.join(df2) == select * from df1, df2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org