[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-168112377 I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3249 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-136913305 Also cc @chenghao-intel who wrote the similar patch #4812 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-136913268 @ravipesala can you answer @marmbrus' question and/or rebase this to master so we can decide how to proceed with it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-114973063 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-114973023 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-114973373 [Test build #35706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35706/consoleFull) for PR 3249 at commit [`7653eee`](https://github.com/apache/spark/commit/7653eee3d8ded847ceeb3a03e11a507b827c5066). * This patch **does not merge cleanly**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-114999781 Build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-114999674 [Test build #35706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35706/consoleFull) for PR 3249 at commit [`7653eee`](https://github.com/apache/spark/commit/7653eee3d8ded847ceeb3a03e11a507b827c5066). * This patch **passes all tests**. * This patch **does not merge cleanly**. * This patch adds the following public classes _(experimental)_: * `case class SubqueryExpression(subquery: LogicalPlan) extends Expression ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-89090615 Sorry for letting this languish. What is the status here and how does this relate to #4812? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-76770343 @chenghao-intel Thank you for reviewing it.I will go through your comments and fix it. And regarding ```not in``` case we can use ``` left outer join``` . I will try to add to same PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r25552305 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -422,6 +424,108 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to left semi join. + * select T1.x from T1 where T1.x in (select T2.y from T2) transformed to + * select T1.x from T1 left semi join T2 on T1.x = T2.y. + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = conditions.collect { + case In(exp, Seq(SubqueryExpression(subquery))) = (exp, subquery) +} +// Replace subqueries with a dummy true literal since they are evaluated separately now. +val transformedConds = conditions.transform { + case In(_, Seq(SubqueryExpression(_))) = Literal(true) +} +subqueryExprs match { + case Seq() = filter // No subqueries. + case Seq((exp, subquery)) = +createLeftSemiJoin( + child, + exp, + subquery, + transformedConds) + case _ = +throw new TreeNodeException(filter, Only one SubQuery expression is supported.) +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, +subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions(value, subquery) + // Add both parent query conditions and subquery conditions as join conditions + val allPredicates = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(allPredicates)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + // TODO : we only decorelate subqueries in very specific cases like the cases mentioned above + // in documentation. The more complex queries like using of subqueries inside subqueries can + // be supported in future. + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than one item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException( +project, +SubQuery can contain only one item in Select List) + } + val resolvedChild = ResolveRelations(child) + // Add the expressions to the projections which are used as filters in subquery + val toBeAddedExprs = f.references.filter{a = +resolvedChild.resolve(a.name, resolver) != None !project.outputSet.contains(a)} + val nameToExprMap = collection.mutable.Map[String, Alias]() + // Create aliases for all projection expressions. + val witAliases = (projectList ++ toBeAddedExprs).zipWithIndex.map { +case (exp, index) = + nameToExprMap.put(exp.name, Alias(exp, ssqc$index)()) + Alias(exp, ssqc$index)() + } + // Replace the condition column names with alias names. + val transformedConds = condition.transform { +case a: Attribute if resolvedChild.resolve(a.name, resolver) != None = + nameToExprMap.get(a.name).get.toAttribute + } + // Join the first projection column of subquery to the main query and add as condition + // TODO : We can avoid if the parent condition already has this condition. + expr += EqualTo(value, witAliases(0).toAttribute) + expr += transformedConds --- End diff -- Connect the subquery with the join condition doesn't make sense to me, as we will transform the whole logical plan as
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-76503896 Thank you @ravipesala for implementing this, however, this PR probably involve some unnecessary join condition transformation, probably you need to understand the rule of pushing down the join filter / condition first. Sorry, please correct me if I misunderstood something. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r25552219 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -422,6 +424,108 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to left semi join. + * select T1.x from T1 where T1.x in (select T2.y from T2) transformed to + * select T1.x from T1 left semi join T2 on T1.x = T2.y. + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = conditions.collect { + case In(exp, Seq(SubqueryExpression(subquery))) = (exp, subquery) +} +// Replace subqueries with a dummy true literal since they are evaluated separately now. +val transformedConds = conditions.transform { + case In(_, Seq(SubqueryExpression(_))) = Literal(true) +} +subqueryExprs match { + case Seq() = filter // No subqueries. + case Seq((exp, subquery)) = +createLeftSemiJoin( + child, + exp, + subquery, + transformedConds) + case _ = +throw new TreeNodeException(filter, Only one SubQuery expression is supported.) +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, +subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions(value, subquery) + // Add both parent query conditions and subquery conditions as join conditions + val allPredicates = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(allPredicates)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + // TODO : we only decorelate subqueries in very specific cases like the cases mentioned above + // in documentation. The more complex queries like using of subqueries inside subqueries can + // be supported in future. + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than one item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException( +project, +SubQuery can contain only one item in Select List) + } + val resolvedChild = ResolveRelations(child) + // Add the expressions to the projections which are used as filters in subquery + val toBeAddedExprs = f.references.filter{a = +resolvedChild.resolve(a.name, resolver) != None !project.outputSet.contains(a)} + val nameToExprMap = collection.mutable.Map[String, Alias]() + // Create aliases for all projection expressions. + val witAliases = (projectList ++ toBeAddedExprs).zipWithIndex.map { +case (exp, index) = + nameToExprMap.put(exp.name, Alias(exp, ssqc$index)()) + Alias(exp, ssqc$index)() + } + // Replace the condition column names with alias names. + val transformedConds = condition.transform { --- End diff -- I am not so sure why you cares about the subquery condition, as in Hive wiki ``` As of Hive 0.13 some types of subqueries are supported in the WHERE clause. Those are queries where the result of the query can be treated as a constant for IN and NOT IN statements (called uncorrelated subqueries because the subquery does not reference columns from the parent query): ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r25552386 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -422,6 +424,108 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to left semi join. + * select T1.x from T1 where T1.x in (select T2.y from T2) transformed to + * select T1.x from T1 left semi join T2 on T1.x = T2.y. + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = conditions.collect { + case In(exp, Seq(SubqueryExpression(subquery))) = (exp, subquery) +} +// Replace subqueries with a dummy true literal since they are evaluated separately now. +val transformedConds = conditions.transform { + case In(_, Seq(SubqueryExpression(_))) = Literal(true) +} +subqueryExprs match { + case Seq() = filter // No subqueries. + case Seq((exp, subquery)) = +createLeftSemiJoin( + child, + exp, + subquery, + transformedConds) + case _ = +throw new TreeNodeException(filter, Only one SubQuery expression is supported.) +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, +subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions(value, subquery) + // Add both parent query conditions and subquery conditions as join conditions + val allPredicates = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(allPredicates)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + // TODO : we only decorelate subqueries in very specific cases like the cases mentioned above + // in documentation. The more complex queries like using of subqueries inside subqueries can + // be supported in future. + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than one item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException( +project, +SubQuery can contain only one item in Select List) + } + val resolvedChild = ResolveRelations(child) + // Add the expressions to the projections which are used as filters in subquery + val toBeAddedExprs = f.references.filter{a = +resolvedChild.resolve(a.name, resolver) != None !project.outputSet.contains(a)} + val nameToExprMap = collection.mutable.Map[String, Alias]() + // Create aliases for all projection expressions. + val witAliases = (projectList ++ toBeAddedExprs).zipWithIndex.map { +case (exp, index) = + nameToExprMap.put(exp.name, Alias(exp, ssqc$index)()) + Alias(exp, ssqc$index)() + } + // Replace the condition column names with alias names. + val transformedConds = condition.transform { +case a: Attribute if resolvedChild.resolve(a.name, resolver) != None = + nameToExprMap.get(a.name).get.toAttribute + } + // Join the first projection column of subquery to the main query and add as condition + // TODO : We can avoid if the parent condition already has this condition. + expr += EqualTo(value, witAliases(0).toAttribute) + expr += transformedConds --- End diff --
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-76455136 [Test build #28083 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28083/consoleFull) for PR 3249 at commit [`7653eee`](https://github.com/apache/spark/commit/7653eee3d8ded847ceeb3a03e11a507b827c5066). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-76467964 [Test build #28083 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28083/consoleFull) for PR 3249 at commit [`7653eee`](https://github.com/apache/spark/commit/7653eee3d8ded847ceeb3a03e11a507b827c5066). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubqueryExpression(subquery: LogicalPlan) extends Expression ` * `logError(User class threw exception: + cause.getMessage, cause)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-76467971 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28083/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-76358262 @ravipesala I've created another PR for `[NOT]EXISTS`, I believe the `IN` can be transformed into `EXISTS` and the implementation can be much more simple, can you review the code for me? Please see #4812 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user chenghao-intel commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-76501277 @ravipesala, do you have any idea how to implement the `NOT IN`? I believe we should consider how to implement the `NOT IN` when doing `IN`, or should they come within the same PR? BTW, can you also enable the hive compatible suite like `subquery_in.q` or `subquery_in_having.q` if you think that's also supported in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r25551625 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubqueryExpression.scala --- @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan + +/** + * Evaluates whether `subquery` result contains `value`. + * For example : 'SELECT * FROM src a WHERE a.key in (SELECT b.key FROM src b)' + * @param subquery In the above example 'SELECT b.key FROM src b' is 'subquery' + */ +case class SubqueryExpression(subquery: LogicalPlan) extends Expression { --- End diff -- Instead of making the Subquery as a fake expression, a better idea probably create a new logical plan like ``` SubQueryIn(left: LogicalPlan, nested: LogicalPlan, isNotIn:Boolean) ``` That's also how I implement the `EXISTS` at https://github.com/apache/spark/pull/4812/files#diff-9a11e98e8f4bd1c4bb18ca6a7a7b8948R262 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r25551740 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -422,6 +424,108 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to left semi join. + * select T1.x from T1 where T1.x in (select T2.y from T2) transformed to + * select T1.x from T1 left semi join T2 on T1.x = T2.y. + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = --- End diff -- We are not going to handle the non `Subquery` case here right? how about ``` case filter @ Filter(In(expr, SubqueryExpression(subquery)), child) = ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user chenghao-intel commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r25551872 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -422,6 +424,108 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to left semi join. + * select T1.x from T1 where T1.x in (select T2.y from T2) transformed to + * select T1.x from T1 left semi join T2 on T1.x = T2.y. + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = conditions.collect { + case In(exp, Seq(SubqueryExpression(subquery))) = (exp, subquery) +} +// Replace subqueries with a dummy true literal since they are evaluated separately now. +val transformedConds = conditions.transform { + case In(_, Seq(SubqueryExpression(_))) = Literal(true) +} +subqueryExprs match { + case Seq() = filter // No subqueries. + case Seq((exp, subquery)) = +createLeftSemiJoin( + child, + exp, + subquery, + transformedConds) + case _ = +throw new TreeNodeException(filter, Only one SubQuery expression is supported.) +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, +subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions(value, subquery) + // Add both parent query conditions and subquery conditions as join conditions + val allPredicates = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(allPredicates)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + // TODO : we only decorelate subqueries in very specific cases like the cases mentioned above + // in documentation. The more complex queries like using of subqueries inside subqueries can + // be supported in future. + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than one item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException( +project, +SubQuery can contain only one item in Select List) + } + val resolvedChild = ResolveRelations(child) --- End diff -- Why do you resolve the relation here? I believe the subquery should be resolved already before entering the rule of `SubQueryExpressions`, is that the limitation of using the `SubQueryExpression`, which can not check if children logical plan if it's resolved? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74372645 [Test build #27482 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27482/consoleFull) for PR 3249 at commit [`9e404c0`](https://github.com/apache/spark/commit/9e404c053a9c43f5d5d6b2b696dd3cbd4f2c7c8f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74373266 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27483/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74373265 [Test build #27483 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27483/consoleFull) for PR 3249 at commit [`e70bc79`](https://github.com/apache/spark/commit/e70bc798337fbff98e2b7389d7143453c37b7054). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74372730 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27482/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74372729 [Test build #27482 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27482/consoleFull) for PR 3249 at commit [`9e404c0`](https://github.com/apache/spark/commit/9e404c053a9c43f5d5d6b2b696dd3cbd4f2c7c8f). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74373203 [Test build #27483 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27483/consoleFull) for PR 3249 at commit [`e70bc79`](https://github.com/apache/spark/commit/e70bc798337fbff98e2b7389d7143453c37b7054). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74378694 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27486/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74378691 [Test build #27486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27486/consoleFull) for PR 3249 at commit [`dbd0b87`](https://github.com/apache/spark/commit/dbd0b873acc6cb928140b3044ac3b989e9e54240). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubqueryExpression(subquery: LogicalPlan) extends Expression ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74378995 [Test build #27487 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27487/consoleFull) for PR 3249 at commit [`036f3c6`](https://github.com/apache/spark/commit/036f3c60bc1e190fddf3c4593b6b6f8f28e5d62b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74378280 [Test build #27486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27486/consoleFull) for PR 3249 at commit [`dbd0b87`](https://github.com/apache/spark/commit/dbd0b873acc6cb928140b3044ac3b989e9e54240). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74377584 [Test build #27484 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27484/consoleFull) for PR 3249 at commit [`0a90b21`](https://github.com/apache/spark/commit/0a90b218b37ab3618bc074956c0b9165a1cb5ce2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74377896 [Test build #27484 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27484/consoleFull) for PR 3249 at commit [`0a90b21`](https://github.com/apache/spark/commit/0a90b218b37ab3618bc074956c0b9165a1cb5ce2). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubqueryExpression(subquery: LogicalPlan) extends Expression ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74377898 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27484/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74381608 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27487/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-74381599 [Test build #27487 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27487/consoleFull) for PR 3249 at commit [`036f3c6`](https://github.com/apache/spark/commit/036f3c60bc1e190fddf3c4593b6b6f8f28e5d62b). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubqueryExpression(subquery: LogicalPlan) extends Expression ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-73280560 @marmbrus Please check whether it is ok. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-73187723 LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-73187137 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26897/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-73187135 [Test build #26897 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26897/consoleFull) for PR 3249 at commit [`eb0a13b`](https://github.com/apache/spark/commit/eb0a13b2d2fb87b04899c05b62ce82c237dff750). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubqueryExpression(subquery: LogicalPlan) extends Expression ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-73182749 [Test build #26897 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26897/consoleFull) for PR 3249 at commit [`eb0a13b`](https://github.com/apache/spark/commit/eb0a13b2d2fb87b04899c05b62ce82c237dff750). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-73100763 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26847/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-73100753 [Test build #26847 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26847/consoleFull) for PR 3249 at commit [`e49452b`](https://github.com/apache/spark/commit/e49452be83f0f7f1ca2cf252348e03d454137ed0). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-73098157 [Test build #26847 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26847/consoleFull) for PR 3249 at commit [`e49452b`](https://github.com/apache/spark/commit/e49452be83f0f7f1ca2cf252348e03d454137ed0). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-70603746 [Test build #25788 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25788/consoleFull) for PR 3249 at commit [`0347f02`](https://github.com/apache/spark/commit/0347f027cef940d987cd079357059c7695e9400f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubqueryExpression(subquery: LogicalPlan) extends Expression ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-70603747 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25788/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-70599526 [Test build #25788 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25788/consoleFull) for PR 3249 at commit [`0347f02`](https://github.com/apache/spark/commit/0347f027cef940d987cd079357059c7695e9400f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-70528755 @marmbrus Please review it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-70356119 [Test build #25697 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25697/consoleFull) for PR 3249 at commit [`86e98d5`](https://github.com/apache/spark/commit/86e98d5484d69167d1de7b1bc264016989924a59). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-70357598 [Test build #25697 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25697/consoleFull) for PR 3249 at commit [`86e98d5`](https://github.com/apache/spark/commit/86e98d5484d69167d1de7b1bc264016989924a59). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-70357600 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25697/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-68677425 [Test build #25049 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25049/consoleFull) for PR 3249 at commit [`09e471a`](https://github.com/apache/spark/commit/09e471aa670298bcb1e8e135d260da54a11de88e). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubqueryExpression(subquery: LogicalPlan) extends Expression ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-68677427 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25049/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-68673730 [Test build #25049 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/25049/consoleFull) for PR 3249 at commit [`09e471a`](https://github.com/apache/spark/commit/09e471aa670298bcb1e8e135d260da54a11de88e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22448850 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -414,6 +418,123 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[In]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +conditions.collect { + case s @ In(exp, Seq(SubqueryExpression(subquery))) = +subqueryExprs += s +} +val transformedConds = conditions.transform { + // Replace with dummy + case s @ In(exp,Seq(SubqueryExpression(subquery))) = +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, +subqueryExpr.value, +subqueryExpr.list(0).asInstanceOf[SubqueryExpression].subquery, +transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only one SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, +subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions(value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than one item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException( +project, +SubQuery can contain only one item in Select List) + } + val resolvedChild = ResolveRelations(child) + // Add the expressions to the projections which are used as filters in subquery + val toBeAddedExprs = f.references.filter{a = +resolvedChild.resolve(a.name, resolver) != None !project.outputSet.contains(a)} + val nameToExprMap = collection.mutable.Map[String, Alias]() + // Create aliases for all projection expressions. + val witAliases = (projectList ++ toBeAddedExprs).zipWithIndex.map { +case (exp, index) = + nameToExprMap.put(exp.name, Alias(exp, ssqc$index)()) + Alias(exp, ssqc$index)() + } + // Replace the condition column names with alias names. + val transformedConds = condition.transform { +
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22448848 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubqueryExpression.scala --- @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan + +/** + * Evaluates whether `subquery` result contains `value`. + * For example : 'SELECT * FROM src a WHERE a.key in (SELECT b.key FROM src b)' + * @param subquery In the above example 'SELECT b.key FROM src b' is 'subquery' + */ +case class SubqueryExpression(subquery: LogicalPlan) extends Expression { + + type EvaluatedType = Any + def dataType = subquery.output.head.dataType + override def foldable = false + def nullable = true + override def toString = sSubqueryExpression($subquery) --- End diff -- Ok. I will change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-68673618 Thank you for reviewing it. Fixed the review comments. And added the TODO for future expansion of complex queries. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22448842 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -414,6 +418,123 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[In]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +conditions.collect { + case s @ In(exp, Seq(SubqueryExpression(subquery))) = +subqueryExprs += s +} +val transformedConds = conditions.transform { + // Replace with dummy + case s @ In(exp,Seq(SubqueryExpression(subquery))) = +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, +subqueryExpr.value, +subqueryExpr.list(0).asInstanceOf[SubqueryExpression].subquery, +transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only one SubQuery expression is supported.) +} else { + filter +} --- End diff -- Thanks for code snippet. It is good I will add it like this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22448876 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -414,6 +418,123 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[In]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +conditions.collect { + case s @ In(exp, Seq(SubqueryExpression(subquery))) = +subqueryExprs += s +} +val transformedConds = conditions.transform { + // Replace with dummy + case s @ In(exp,Seq(SubqueryExpression(subquery))) = +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, +subqueryExpr.value, +subqueryExpr.list(0).asInstanceOf[SubqueryExpression].subquery, +transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only one SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, +subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions(value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = --- End diff -- Yes , this type of queries cannot be evaluated in present code. May be we can expand it in future. I will add the TODO. And even hive also does not seems work this type of queries. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22363285 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -414,6 +418,123 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[In]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +conditions.collect { --- End diff -- Collect returns the matching expressions: ```scala val subqueryExprs = conditions.collect { case s @ In(exp, Seq(SubqueryExpression(subquery))) = s } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22363305 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -414,6 +418,123 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[In]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() --- End diff -- This is unused? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22363362 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -414,6 +418,123 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[In]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +conditions.collect { + case s @ In(exp, Seq(SubqueryExpression(subquery))) = --- End diff -- Also I'd consider extracting to a tuple `(exp, subquery)` here to avoid the weirdness of `.list(0)` below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22363511 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -414,6 +418,123 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[In]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +conditions.collect { + case s @ In(exp, Seq(SubqueryExpression(subquery))) = +subqueryExprs += s +} +val transformedConds = conditions.transform { + // Replace with dummy + case s @ In(exp,Seq(SubqueryExpression(subquery))) = +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, +subqueryExpr.value, +subqueryExpr.list(0).asInstanceOf[SubqueryExpression].subquery, +transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only one SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, +subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions(value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) --- End diff -- This is a bit of a nit, but unify has [a pretty specific meaning](http://en.wikipedia.org/wiki/Unification_%28computer_science%29) when talking about predicates, and I don't think that is what is happening here. Maybe `allPredicates` or something? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22365235 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -414,6 +418,123 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[In]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +conditions.collect { + case s @ In(exp, Seq(SubqueryExpression(subquery))) = +subqueryExprs += s +} +val transformedConds = conditions.transform { + // Replace with dummy + case s @ In(exp,Seq(SubqueryExpression(subquery))) = +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, +subqueryExpr.value, +subqueryExpr.list(0).asInstanceOf[SubqueryExpression].subquery, +transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only one SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, +subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions(value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = --- End diff -- This works for the two queries you specified but is going to fail as soon as things get even a little more complicated. For example ```scala SELECT a.key FROM src a WHERE a.key in (SELECT b FROM (SELECT b.key FROM src b WHERE b.key in (230)and a.value=b.value) a) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22365363 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -414,6 +418,123 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[In]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +conditions.collect { + case s @ In(exp, Seq(SubqueryExpression(subquery))) = +subqueryExprs += s +} +val transformedConds = conditions.transform { + // Replace with dummy + case s @ In(exp,Seq(SubqueryExpression(subquery))) = +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, +subqueryExpr.value, +subqueryExpr.list(0).asInstanceOf[SubqueryExpression].subquery, +transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only one SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, +subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions(value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than one item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException( +project, +SubQuery can contain only one item in Select List) + } + val resolvedChild = ResolveRelations(child) + // Add the expressions to the projections which are used as filters in subquery + val toBeAddedExprs = f.references.filter{a = +resolvedChild.resolve(a.name, resolver) != None !project.outputSet.contains(a)} + val nameToExprMap = collection.mutable.Map[String, Alias]() + // Create aliases for all projection expressions. + val witAliases = (projectList ++ toBeAddedExprs).zipWithIndex.map { +case (exp, index) = + nameToExprMap.put(exp.name, Alias(exp, ssqc$index)()) + Alias(exp, ssqc$index)() + } + // Replace the condition column names with alias names. + val transformedConds = condition.transform { +
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22365419 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -414,6 +418,123 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[In]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +conditions.collect { + case s @ In(exp, Seq(SubqueryExpression(subquery))) = +subqueryExprs += s +} +val transformedConds = conditions.transform { + // Replace with dummy + case s @ In(exp,Seq(SubqueryExpression(subquery))) = +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, +subqueryExpr.value, +subqueryExpr.list(0).asInstanceOf[SubqueryExpression].subquery, +transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only one SubQuery expression is supported.) +} else { + filter +} --- End diff -- Maybe simpler as: ```scala val subqueryExprs = conditions.collect { case In(exp, Seq(SubqueryExpression(subquery))) = (exp, subquery) } // Replace subqueries with a dummy true literal since they are evaluated separately now. val transformedConds = conditions.transform { case In(_, Seq(SubqueryExpression(_))) = Literal(true) } subqueryExprs match { case Seq() = filter // No subqueries. case Seq((exp, subquery)) = createLeftSemiJoin( child, exp, subquery, transformedConds) case _ = throw new TreeNodeException(filter, Only one SubQuery expression is supported.) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22365551 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -414,6 +418,123 @@ class Analyzer(catalog: Catalog, Generate(g, join = false, outer = false, None, child) } } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case p: LogicalPlan if !p.childrenResolved = p + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[In]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +conditions.collect { + case s @ In(exp, Seq(SubqueryExpression(subquery))) = +subqueryExprs += s +} +val transformedConds = conditions.transform { + // Replace with dummy + case s @ In(exp,Seq(SubqueryExpression(subquery))) = +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, +subqueryExpr.value, +subqueryExpr.list(0).asInstanceOf[SubqueryExpression].subquery, +transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only one SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, +subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions(value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = --- End diff -- Okay so this is a pretty hard problem. Perhaps its okay to add these simple cases and then try and expand support in the future. Out of curiosity, does hive support the query above? We should at least leave a TODO here that says we only decorelate subqueries in very specific cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22365598 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubqueryExpression.scala --- @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan + +/** + * Evaluates whether `subquery` result contains `value`. + * For example : 'SELECT * FROM src a WHERE a.key in (SELECT b.key FROM src b)' + * @param subquery In the above example 'SELECT b.key FROM src b' is 'subquery' + */ +case class SubqueryExpression(subquery: LogicalPlan) extends Expression { + + type EvaluatedType = Any + def dataType = subquery.output.head.dataType + override def foldable = false + def nullable = true + override def toString = sSubqueryExpression($subquery) --- End diff -- This doesn't print nicely since `subquery` has newlines in it. Perhaps just `${subquery.output.mkString(,)}` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-68400682 Okay, I think this is getting closer. Thanks for working on this. As I commented I'm worried that this only handles a few very specific cases. However, since this is pretty clean / isolated perhaps that is okay and we can expand it in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22148838 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ --- End diff -- Ok. Done in two steps. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22148840 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) --- End diff -- Ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22148841 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) --- End diff -- Ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22148845 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions( + value, subquery) --- End diff -- Ok. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-6306 [Test build #24684 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24684/consoleFull) for PR 3249 at commit [`8cae35b`](https://github.com/apache/spark/commit/8cae35bc70094011d9e62f465cdae87aacbf9240). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22148859 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions( + value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than 1 item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException(project, SubQuery can contain only 1 item in Select List) + } + val resolvedChild = ResolveRelations(child) --- End diff -- Here guarding may not work because SubqueryExpression does not resolve with main query so I guess we need to resolve it on the need basis. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r2214 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions( + value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than 1 item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException(project, SubQuery can contain only 1 item in Select List) + } + val resolvedChild = ResolveRelations(child) + // Add the expressions to the projections which are used as filters in subquery + val toBeAddedExprs = f.references.filter( + a=resolvedChild.resolve(a.name, resolver) != None !projectList.contains(a)) --- End diff -- Here guarding may not work because SubqueryExpression does not resolve with main query so I guess we need to resolve it on the need basis. And I used ```project.outputSet``` to filter --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22148896 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions( + value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than 1 item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException(project, SubQuery can contain only 1 item in Select List) + } + val resolvedChild = ResolveRelations(child) + // Add the expressions to the projections which are used as filters in subquery + val toBeAddedExprs = f.references.filter( + a=resolvedChild.resolve(a.name, resolver) != None !projectList.contains(a)) + val cache = collection.mutable.Map[String, String]() + // Create aliases for all projection expressions. + val witAliases = (projectList ++ toBeAddedExprs).zipWithIndex.map { +case (exp, index) = + cache.put(exp.name, ssqc$index) + Alias(exp, ssqc$index)() + } + // Replace the condition column names with alias names. + val transformedConds = condition.transform{ +case a: Attribute if resolvedChild.resolve(a.name, resolver) != None = + UnresolvedAttribute(subquery. + cache.get(a.name).get) --- End diff -- Ok . Added ```Alias``` in all places and removed
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-6608 [Test build #24685 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24685/consoleFull) for PR 3249 at commit [`11a238d`](https://github.com/apache/spark/commit/11a238d06320940f0278eae621df36d05436970f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22148903 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubqueryExpression.scala --- @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan + +/** + * Evaluates whether `subquery` result contains `value`. + * For example : 'SELECT * FROM src a WHERE a.key in (SELECT b.key FROM src b)' + * @param value In the above example 'a.key' is 'value' + * @param subquery In the above example 'SELECT b.key FROM src b' is 'subquery' + */ +case class SubqueryExpression(value: Expression, subquery: LogicalPlan) extends Expression { --- End diff -- Ok. Added like ```In('a.key, SubqueryExpression(...))``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22148913 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubqueryExpression.scala --- @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan + +/** + * Evaluates whether `subquery` result contains `value`. + * For example : 'SELECT * FROM src a WHERE a.key in (SELECT b.key FROM src b)' + * @param value In the above example 'a.key' is 'value' + * @param subquery In the above example 'SELECT b.key FROM src b' is 'subquery' + */ +case class SubqueryExpression(value: Expression, subquery: LogicalPlan) extends Expression { + + type EvaluatedType = Any + def dataType = value.dataType + override def foldable = value.foldable + def nullable = value.nullable + override def toString = sSubqueryExpression($value, $subquery) + override lazy val resolved = childrenResolved --- End diff -- Yes, it is always unresolved. It will only be converted not executed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22148917 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubqueryExpression.scala --- @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan + +/** + * Evaluates whether `subquery` result contains `value`. + * For example : 'SELECT * FROM src a WHERE a.key in (SELECT b.key FROM src b)' + * @param value In the above example 'a.key' is 'value' + * @param subquery In the above example 'SELECT b.key FROM src b' is 'subquery' + */ +case class SubqueryExpression(value: Expression, subquery: LogicalPlan) extends Expression { + + type EvaluatedType = Any + def dataType = value.dataType + override def foldable = value.foldable + def nullable = value.nullable --- End diff -- Yes. I guess it is nullable --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22149004 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions( + value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than 1 item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException(project, SubQuery can contain only 1 item in Select List) + } + val resolvedChild = ResolveRelations(child) + // Add the expressions to the projections which are used as filters in subquery + val toBeAddedExprs = f.references.filter( + a=resolvedChild.resolve(a.name, resolver) != None !projectList.contains(a)) + val cache = collection.mutable.Map[String, String]() --- End diff -- I may not able to use ```AttributeMap``` without resolving the subquery completly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user ravipesala commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-67778115 Thank you for reviewing it. I have worked on review comments.Please review it. I guess the ```SubqueryExpression``` may not be resolved along with main query and also we may not able to resolve it separately as it may contain the main query references like ```select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) ```. Here R1.X is used inside subquery. So I guess we can resolve it on the need basis. Please comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-67779525 [Test build #24684 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24684/consoleFull) for PR 3249 at commit [`8cae35b`](https://github.com/apache/spark/commit/8cae35bc70094011d9e62f465cdae87aacbf9240). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubqueryExpression(subquery: LogicalPlan) extends Expression ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-67779527 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24684/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-67779852 [Test build #24685 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24685/consoleFull) for PR 3249 at commit [`11a238d`](https://github.com/apache/spark/commit/11a238d06320940f0278eae621df36d05436970f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SubqueryExpression(subquery: LogicalPlan) extends Expression ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3249#issuecomment-67779856 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24685/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22004038 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ --- End diff -- Space before `{`. Also I would consider doing this in two steps to avoid depending on transform for side effects: a collect to get the list and then a transform to replace with `true`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22004108 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) --- End diff -- Only one subquery expression is supported --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22004302 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) --- End diff -- If you need to wrap then I'd put each arg on its own line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22004338 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions( + value, subquery) --- End diff -- Does this need to wrap? If so you should break at the `=`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22004386 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { --- End diff -- Same as above, if you need to wrap then put each thing on its own line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22004443 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions( + value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than 1 item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException(project, SubQuery can contain only 1 item in Select List) + } + val resolvedChild = ResolveRelations(child) --- End diff -- Instead of doing this, I'd propose you guard the rule at the top to only apply once the child plan is resolved. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22004962 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions( + value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than 1 item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException(project, SubQuery can contain only 1 item in Select List) + } + val resolvedChild = ResolveRelations(child) + // Add the expressions to the projections which are used as filters in subquery + val toBeAddedExprs = f.references.filter( + a=resolvedChild.resolve(a.name, resolver) != None !projectList.contains(a)) --- End diff -- A few things here - I'd add the guard above so we don't have to resolve here. Hand resolution like you are doing here will work for simple cases but doesn't run to fixed point. - Never compare attributes by name, and don't inspect the project list to see what the output will be. The reasons are as follows: names can be ambiguous and the project list could have aliases and other things that confuse the inspection. Instead use `outputSet` which returns an `AttributeSet` that does comparisons correctly. - if you need multi line lambas, prefer the following syntax: ``` list.filter { a = a } ``` All together I think you want something like:
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22005225 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions( + value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than 1 item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException(project, SubQuery can contain only 1 item in Select List) + } + val resolvedChild = ResolveRelations(child) + // Add the expressions to the projections which are used as filters in subquery + val toBeAddedExprs = f.references.filter( + a=resolvedChild.resolve(a.name, resolver) != None !projectList.contains(a)) + val cache = collection.mutable.Map[String, String]() + // Create aliases for all projection expressions. + val witAliases = (projectList ++ toBeAddedExprs).zipWithIndex.map { +case (exp, index) = + cache.put(exp.name, ssqc$index) + Alias(exp, ssqc$index)() + } + // Replace the condition column names with alias names. + val transformedConds = condition.transform{ --- End diff -- Space before `{` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22005471 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions( + value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than 1 item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException(project, SubQuery can contain only 1 item in Select List) + } + val resolvedChild = ResolveRelations(child) + // Add the expressions to the projections which are used as filters in subquery + val toBeAddedExprs = f.references.filter( + a=resolvedChild.resolve(a.name, resolver) != None !projectList.contains(a)) + val cache = collection.mutable.Map[String, String]() + // Create aliases for all projection expressions. + val witAliases = (projectList ++ toBeAddedExprs).zipWithIndex.map { +case (exp, index) = + cache.put(exp.name, ssqc$index) + Alias(exp, ssqc$index)() + } + // Replace the condition column names with alias names. + val transformedConds = condition.transform{ +case a: Attribute if resolvedChild.resolve(a.name, resolver) != None = + UnresolvedAttribute(subquery. + cache.get(a.name).get) --- End diff -- We should avoid creating more unresolved attributes
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22005513 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -314,6 +318,113 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool protected def containsStar(exprs: Seq[Expression]): Boolean = exprs.collect { case _: Star = true }.nonEmpty } + + /** + * Transforms the query which has subquery expressions in where clause to join queries. + * Case 1 Uncorelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2) + * -- rewritten query + * Select C from R1 left semi join (select B as sqc0 from R2) subquery on R1.A = subquery.sqc0 + * + * Case 2 Corelated queries + * -- original query + * select C from R1 where R1.A in (Select B from R2 where R1.X = R2.Y) + * -- rewritten query + * select C from R1 left semi join (select B as sqc0, R2.Y as sqc1 from R2) subquery + * on R1.X = subquery.sqc1 and R1.A = subquery.sqc0 + * + * Refer: https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf + */ + object SubQueryExpressions extends Rule[LogicalPlan] { + +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case filter @ Filter(conditions, child) = +val subqueryExprs = new scala.collection.mutable.ArrayBuffer[SubqueryExpression]() +val nonSubQueryConds = new scala.collection.mutable.ArrayBuffer[Expression]() +val transformedConds = conditions.transform{ + // Replace with dummy + case s @ SubqueryExpression(exp,subquery) = +subqueryExprs += s +Literal(true) +} +if (subqueryExprs.size == 1) { + val subqueryExpr = subqueryExprs.remove(0) + createLeftSemiJoin( +child, subqueryExpr.value, +subqueryExpr.subquery, transformedConds) +} else if (subqueryExprs.size 1) { + // Only one subquery expression is supported. + throw new TreeNodeException(filter, Only 1 SubQuery expression is supported.) +} else { + filter +} +} + +/** + * Create LeftSemi join with parent query to the subquery which is mentioned in 'IN' predicate + * And combine the subquery conditions and parent query conditions. + */ +def createLeftSemiJoin(left: LogicalPlan, +value: Expression, subquery: LogicalPlan, +parentConds: Expression) : LogicalPlan = { + val (transformedPlan, subqueryConds) = transformAndGetConditions( + value, subquery) + // Unify the parent query conditions and subquery conditions and add these as join conditions + val unifyConds = And(parentConds, subqueryConds) + Join(left, transformedPlan, LeftSemi, Some(unifyConds)) +} + +/** + * Transform the subquery LogicalPlan and add the expressions which are used as filters to the + * projection. And also return filter conditions used in subquery + */ +def transformAndGetConditions(value: Expression, + subquery: LogicalPlan): (LogicalPlan, Expression) = { + val expr = new scala.collection.mutable.ArrayBuffer[Expression]() + val transformedPlan = subquery transform { +case project @ Project(projectList, f @ Filter(condition, child)) = + // Don't support more than 1 item in select list of subquery + if(projectList.size 1) { +throw new TreeNodeException(project, SubQuery can contain only 1 item in Select List) + } + val resolvedChild = ResolveRelations(child) + // Add the expressions to the projections which are used as filters in subquery + val toBeAddedExprs = f.references.filter( + a=resolvedChild.resolve(a.name, resolver) != None !projectList.contains(a)) + val cache = collection.mutable.Map[String, String]() + // Create aliases for all projection expressions. + val witAliases = (projectList ++ toBeAddedExprs).zipWithIndex.map { +case (exp, index) = + cache.put(exp.name, ssqc$index) + Alias(exp, ssqc$index)() + } + // Replace the condition column names with alias names. + val transformedConds = condition.transform{ +case a: Attribute if resolvedChild.resolve(a.name, resolver) != None = + UnresolvedAttribute(subquery. + cache.get(a.name).get) + } + // Join the first projection column of subquery
[GitHub] spark pull request: [SPARK-4226][SQL] SparkSQL - Add support for s...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3249#discussion_r22005662 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SubqueryExpression.scala --- @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan + +/** + * Evaluates whether `subquery` result contains `value`. + * For example : 'SELECT * FROM src a WHERE a.key in (SELECT b.key FROM src b)' + * @param value In the above example 'a.key' is 'value' + * @param subquery In the above example 'SELECT b.key FROM src b' is 'subquery' + */ +case class SubqueryExpression(value: Expression, subquery: LogicalPlan) extends Expression { --- End diff -- Why do we need both the `value` and the `subquery`. Could we do something like subquery.output.head.dataType? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org