Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
cloud-fan closed pull request #42725: [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition URL: https://github.com/apache/spark/pull/42725 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
cloud-fan commented on PR #42725: URL: https://github.com/apache/spark/pull/42725#issuecomment-1763675871 yea they are unrelated, thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
andylam-db commented on PR #42725: URL: https://github.com/apache/spark/pull/42725#issuecomment-1761961814 @cloud-fan I think the build is failing because of unrelated failing tests -- timeouts and what not. Could you take a look and see if we can merge this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
jchen5 commented on code in PR #42725: URL: https://github.com/apache/spark/pull/42725#discussion_r1355693885 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4378,6 +4378,25 @@ object SQLConf { .booleanConf .createWithDefault(true) + val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled") + .internal() + .doc("Decorrelate predicate (in and exists) subqueries with correlated references in join " + +"predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(true) + + val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled") + .internal() + .doc("When true, optimize uncorrelated IN subqueries in join predicates by rewriting them " + +s"to joins. This interacts with ${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " + +"can rewrite IN predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(false) Review Comment: We can have this true by default now - LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR was flipped to the new behavior by default recently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
andylam-db commented on code in PR #42725: URL: https://github.com/apache/spark/pull/42725#discussion_r1355858190 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4378,6 +4378,25 @@ object SQLConf { .booleanConf .createWithDefault(true) + val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled") + .internal() + .doc("Decorrelate predicate (in and exists) subqueries with correlated references in join " + +"predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(true) + + val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled") + .internal() + .doc("When true, optimize uncorrelated IN subqueries in join predicates by rewriting them " + +s"to joins. This interacts with ${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " + +"can rewrite IN predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(false) Review Comment: Is it easier to just rely on LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR instead of creating this new flag? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
andylam-db commented on code in PR #42725: URL: https://github.com/apache/spark/pull/42725#discussion_r1355858190 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4378,6 +4378,25 @@ object SQLConf { .booleanConf .createWithDefault(true) + val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled") + .internal() + .doc("Decorrelate predicate (in and exists) subqueries with correlated references in join " + +"predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(true) + + val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled") + .internal() + .doc("When true, optimize uncorrelated IN subqueries in join predicates by rewriting them " + +s"to joins. This interacts with ${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " + +"can rewrite IN predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(false) Review Comment: Is it easier to just rely on LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR instead of creating this new flag? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
jchen5 commented on code in PR #42725: URL: https://github.com/apache/spark/pull/42725#discussion_r1355693885 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4378,6 +4378,25 @@ object SQLConf { .booleanConf .createWithDefault(true) + val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled") + .internal() + .doc("Decorrelate predicate (in and exists) subqueries with correlated references in join " + +"predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(true) + + val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled") + .internal() + .doc("When true, optimize uncorrelated IN subqueries in join predicates by rewriting them " + +s"to joins. This interacts with ${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " + +"can rewrite IN predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(false) Review Comment: We can have this true by default now - LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR was enabled by default recently. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
andylam-db commented on code in PR #42725: URL: https://github.com/apache/spark/pull/42725#discussion_r1355441657 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala: ## @@ -159,6 +160,66 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { Project(p.output, Filter(newCond.get, inputPlan)) } +// This case takes care of predicate subqueries in join conditions that are not pushed down +// to the children nodes by [[PushDownPredicates]]. Review Comment: Yes -- pushdown rules are in `operatorOptimizationBatch` and `RewriteSubquery` batch runs after those (almost the last batch in the Optimizer) ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4378,6 +4378,25 @@ object SQLConf { .booleanConf .createWithDefault(true) + val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled") + .internal() + .doc("Decorrelate predicate (in and exists) subqueries with correlated references in join " + +"predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(true) + + val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled") + .internal() + .doc("When true, optimize uncorrelated IN subqueries in join predicates by rewriting them " + +s"to joins. This interacts with ${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " + Review Comment: This happens because with this change, IN subqueries are rewritten to joins and therefore force the "correct" NULL-in-empty-list behavior. This config is added in case users want to use the legacy behavior. This has nothing to do with EXISTS. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
cloud-fan commented on code in PR #42725: URL: https://github.com/apache/spark/pull/42725#discussion_r1353833831 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -4378,6 +4378,25 @@ object SQLConf { .booleanConf .createWithDefault(true) + val DECORRELATE_PREDICATE_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.decorrelatePredicateSubqueriesInJoinPredicate.enabled") + .internal() + .doc("Decorrelate predicate (in and exists) subqueries with correlated references in join " + +"predicates.") + .version("4.0.0") + .booleanConf + .createWithDefault(true) + + val OPTIMIZE_UNCORRELATED_IN_SUBQUERIES_IN_JOIN_CONDITION = + buildConf("spark.sql.optimizer.optimizeUncorrelatedInSubqueriesInJoinCondition.enabled") + .internal() + .doc("When true, optimize uncorrelated IN subqueries in join predicates by rewriting them " + +s"to joins. This interacts with ${LEGACY_NULL_IN_EMPTY_LIST_BEHAVIOR.key} because it " + Review Comment: we don't do this for EXISTS because it's already fast to evaluate non-correlated EXISTS? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
cloud-fan commented on code in PR #42725: URL: https://github.com/apache/spark/pull/42725#discussion_r1353829912 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala: ## @@ -159,6 +160,66 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper { Project(p.output, Filter(newCond.get, inputPlan)) } +// This case takes care of predicate subqueries in join conditions that are not pushed down +// to the children nodes by [[PushDownPredicates]]. Review Comment: how is this guaranteed? we run this rule in a later batch after the batch containing pushdown rules? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-45009][SQL] Decorrelate predicate subqueries in join condition [spark]
andylam-db commented on PR #42725: URL: https://github.com/apache/spark/pull/42725#issuecomment-1747794562 Pinging for reviews! @allisonwang-db @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org