[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176686941 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176686945 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50368/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176686339 **[Test build #50368 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50368/consoleFull)** for PR 10630 at commit [`b600089`](https://github.com/apache/spark/commit/b600089d4736e9768a8968fb4dead08d28014ac1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176653934 Thank you! Just cleaned the codes. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176659772 **[Test build #50368 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50368/consoleFull)** for PR 10630 at commit [`b600089`](https://github.com/apache/spark/commit/b600089d4736e9768a8968fb4dead08d28014ac1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176656084 LGTM, pending test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176921008 Thanks - I'm going to merge this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10630 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user cloud-fan commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176628957 LGTM. we can merge it first and @gatorsmile can address remaining comments in a follow-up PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176630203 This is not that big. Let's just do it together here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51233504 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceOperatorSuite.scala --- @@ -0,0 +1,46 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.plans.{LeftSemi, PlanTest} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules.RuleExecutor + +class ReplaceOperatorSuite extends PlanTest { + + object Optimize extends RuleExecutor[LogicalPlan] { +val batches = + Batch("Intersect", FixedPoint(100), +ReplaceIntersectWithSemiJoin) :: Nil + } + + test("replace Intersect with Left-semi Join") { --- End diff -- also move the test for `ReplaceDistinctWithAggregate` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51233589 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisErrorSuite.scala --- @@ -270,7 +270,8 @@ class AnalysisErrorSuite extends AnalysisTest { val error = intercept[AnalysisException] { SimpleAnalyzer.checkAnalysis(join) } -assert(error.message.contains("Failure when resolving conflicting references in Join")) +assert(error.message.contains("Failure when resolving conflicting references\n" + --- End diff -- we can revert this after revert the error message change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51233135 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -214,12 +214,22 @@ trait CheckAnalysis { s"""Only a single table generating function is allowed in a SELECT clause, found: | ${exprs.map(_.prettyString).mkString(",")}""".stripMargin) - // Special handling for cases when self-join introduce duplicate expression ids. - case j @ Join(left, right, _, _) if left.outputSet.intersect(right.outputSet).nonEmpty => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) + case j: Join if !j.duplicateResolved => +val conflictingAttributes = j.left.outputSet.intersect(j.right.outputSet) failAnalysis( s""" - |Failure when resolving conflicting references in Join: --- End diff -- now we can keep this message as it only checks join :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51233162 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala --- @@ -214,12 +214,22 @@ trait CheckAnalysis { s"""Only a single table generating function is allowed in a SELECT clause, found: | ${exprs.map(_.prettyString).mkString(",")}""".stripMargin) - // Special handling for cases when self-join introduce duplicate expression ids. - case j @ Join(left, right, _, _) if left.outputSet.intersect(right.outputSet).nonEmpty => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) + case j: Join if !j.duplicateResolved => +val conflictingAttributes = j.left.outputSet.intersect(j.right.outputSet) failAnalysis( s""" - |Failure when resolving conflicting references in Join: + |Failure when resolving conflicting references + |in operator ${operator.simpleString}: + |$plan + |Conflicting attributes: ${conflictingAttributes.mkString(",")} + |""".stripMargin) + + case i: Intersect if !i.duplicateResolved => +val conflictingAttributes = i.left.outputSet.intersect(i.right.outputSet) +failAnalysis( + s""" + |Failure when resolving conflicting references + |in operator ${operator.simpleString}: --- End diff -- same here, we could say `Intersect` directly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51233323 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -315,4 +315,7 @@ abstract class BinaryNode extends LogicalPlan { def right: LogicalPlan override def children: Seq[LogicalPlan] = Seq(left, right) + + override lazy val resolved: Boolean = +expressions.forall(_.resolved) && childrenResolved --- End diff -- why override this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176441985 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176441989 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50298/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176487231 **[Test build #50313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50313/consoleFull)** for PR 10630 at commit [`e51de8f`](https://github.com/apache/spark/commit/e51de8f98b25281b338ca57f0f13645b626c7673). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176507818 **[Test build #50313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50313/consoleFull)** for PR 10630 at commit [`e51de8f`](https://github.com/apache/spark/commit/e51de8f98b25281b338ca57f0f13645b626c7673). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class SetOperation(left: LogicalPlan, right: LogicalPlan) extends BinaryNode` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176508356 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50313/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176482358 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176508353 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51091050 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -388,57 +445,18 @@ class Analyzer( .map(_.asInstanceOf[NamedExpression]) a.copy(aggregateExpressions = expanded) - // Special handling for cases when self-join introduce duplicate expression ids. - case j @ Join(left, right, _, _) if !j.selfJoinResolved => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) -logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j") - -right.collect { - // Handle base relations that might appear more than once. - case oldVersion: MultiInstanceRelation - if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty => -val newVersion = oldVersion.newInstance() -(oldVersion, newVersion) - - // Handle projects that create conflicting aliases. - case oldVersion @ Project(projectList, _) - if findAliases(projectList).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(projectList = newAliases(projectList))) - - case oldVersion @ Aggregate(_, aggregateExpressions, _) - if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions))) - - case oldVersion: Generate - if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty => -val newOutput = oldVersion.generatorOutput.map(_.newInstance()) -(oldVersion, oldVersion.copy(generatorOutput = newOutput)) - - case oldVersion @ Window(_, windowExpressions, _, _, child) - if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes) -.nonEmpty => -(oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))) -} -// Only handle first case, others will be fixed on the next pass. -.headOption match { - case None => -/* - * No result implies that there is a logical plan node that produces new references - * that this rule cannot handle. When that is the case, there must be another rule - * that resolves these conflicts. Otherwise, the analysis will fail. - */ -j - case Some((oldRelation, newRelation)) => -val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output)) -val newRight = right transformUp { - case r if r == oldRelation => newRelation -} transformUp { - case other => other transformExpressions { -case a: Attribute => attributeRewrites.get(a).getOrElse(a) - } -} -j.copy(right = newRight) -} + // To resolve duplicate expression IDs for all the BinaryNode + case b: BinaryNode if !b.duplicateResolved => b match { +case j @ Join(left, right, _, _) => + j.copy(right = dedupRight(left, right)) +case i @ Intersect(left, right) => + i.copy(right = dedupRight(left, right)) +case e @ Except(left, right) => + e.copy(right = dedupRight(left, right)) +case cg: CoGroup => --- End diff -- ah makes sense, let's add some comment to explain this :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51163511 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -388,57 +445,18 @@ class Analyzer( .map(_.asInstanceOf[NamedExpression]) a.copy(aggregateExpressions = expanded) - // Special handling for cases when self-join introduce duplicate expression ids. - case j @ Join(left, right, _, _) if !j.selfJoinResolved => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) -logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j") - -right.collect { - // Handle base relations that might appear more than once. - case oldVersion: MultiInstanceRelation - if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty => -val newVersion = oldVersion.newInstance() -(oldVersion, newVersion) - - // Handle projects that create conflicting aliases. - case oldVersion @ Project(projectList, _) - if findAliases(projectList).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(projectList = newAliases(projectList))) - - case oldVersion @ Aggregate(_, aggregateExpressions, _) - if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions))) - - case oldVersion: Generate - if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty => -val newOutput = oldVersion.generatorOutput.map(_.newInstance()) -(oldVersion, oldVersion.copy(generatorOutput = newOutput)) - - case oldVersion @ Window(_, windowExpressions, _, _, child) - if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes) -.nonEmpty => -(oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))) -} -// Only handle first case, others will be fixed on the next pass. -.headOption match { - case None => -/* - * No result implies that there is a logical plan node that produces new references - * that this rule cannot handle. When that is the case, there must be another rule - * that resolves these conflicts. Otherwise, the analysis will fail. - */ -j - case Some((oldRelation, newRelation)) => -val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output)) -val newRight = right transformUp { - case r if r == oldRelation => newRelation -} transformUp { - case other => other transformExpressions { -case a: Attribute => attributeRewrites.get(a).getOrElse(a) - } -} -j.copy(right = newRight) -} + // To resolve duplicate expression IDs for all the BinaryNode + case b: BinaryNode if !b.duplicateResolved => b match { +case j @ Join(left, right, _, _) => + j.copy(right = dedupRight(left, right)) +case i @ Intersect(left, right) => + i.copy(right = dedupRight(left, right)) +case e @ Except(left, right) => + e.copy(right = dedupRight(left, right)) +case cg: CoGroup => --- End diff -- Will do. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51061765 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -388,57 +445,18 @@ class Analyzer( .map(_.asInstanceOf[NamedExpression]) a.copy(aggregateExpressions = expanded) - // Special handling for cases when self-join introduce duplicate expression ids. - case j @ Join(left, right, _, _) if !j.selfJoinResolved => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) -logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j") - -right.collect { - // Handle base relations that might appear more than once. - case oldVersion: MultiInstanceRelation - if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty => -val newVersion = oldVersion.newInstance() -(oldVersion, newVersion) - - // Handle projects that create conflicting aliases. - case oldVersion @ Project(projectList, _) - if findAliases(projectList).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(projectList = newAliases(projectList))) - - case oldVersion @ Aggregate(_, aggregateExpressions, _) - if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions))) - - case oldVersion: Generate - if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty => -val newOutput = oldVersion.generatorOutput.map(_.newInstance()) -(oldVersion, oldVersion.copy(generatorOutput = newOutput)) - - case oldVersion @ Window(_, windowExpressions, _, _, child) - if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes) -.nonEmpty => -(oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))) -} -// Only handle first case, others will be fixed on the next pass. -.headOption match { - case None => -/* - * No result implies that there is a logical plan node that produces new references - * that this rule cannot handle. When that is the case, there must be another rule - * that resolves these conflicts. Otherwise, the analysis will fail. - */ -j - case Some((oldRelation, newRelation)) => -val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output)) -val newRight = right transformUp { - case r if r == oldRelation => newRelation -} transformUp { - case other => other transformExpressions { -case a: Attribute => attributeRewrites.get(a).getOrElse(a) - } -} -j.copy(right = newRight) -} + // To resolve duplicate expression IDs for all the BinaryNode + case b: BinaryNode if !b.duplicateResolved => b match { +case j @ Join(left, right, _, _) => + j.copy(right = dedupRight(left, right)) +case i @ Intersect(left, right) => + i.copy(right = dedupRight(left, right)) +case e @ Except(left, right) => + e.copy(right = dedupRight(left, right)) +case cg: CoGroup => --- End diff -- For other operators, can we construct test cases that can make them fail without de-duplication? If we can, then we should create a JIRA and fix it in another PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51059167 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -388,57 +445,18 @@ class Analyzer( .map(_.asInstanceOf[NamedExpression]) a.copy(aggregateExpressions = expanded) - // Special handling for cases when self-join introduce duplicate expression ids. - case j @ Join(left, right, _, _) if !j.selfJoinResolved => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) -logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j") - -right.collect { - // Handle base relations that might appear more than once. - case oldVersion: MultiInstanceRelation - if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty => -val newVersion = oldVersion.newInstance() -(oldVersion, newVersion) - - // Handle projects that create conflicting aliases. - case oldVersion @ Project(projectList, _) - if findAliases(projectList).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(projectList = newAliases(projectList))) - - case oldVersion @ Aggregate(_, aggregateExpressions, _) - if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions))) - - case oldVersion: Generate - if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty => -val newOutput = oldVersion.generatorOutput.map(_.newInstance()) -(oldVersion, oldVersion.copy(generatorOutput = newOutput)) - - case oldVersion @ Window(_, windowExpressions, _, _, child) - if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes) -.nonEmpty => -(oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))) -} -// Only handle first case, others will be fixed on the next pass. -.headOption match { - case None => -/* - * No result implies that there is a logical plan node that produces new references - * that this rule cannot handle. When that is the case, there must be another rule - * that resolves these conflicts. Otherwise, the analysis will fail. - */ -j - case Some((oldRelation, newRelation)) => -val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output)) -val newRight = right transformUp { - case r if r == oldRelation => newRelation -} transformUp { - case other => other transformExpressions { -case a: Attribute => attributeRewrites.get(a).getOrElse(a) - } -} -j.copy(right = newRight) -} + // To resolve duplicate expression IDs for all the BinaryNode + case b: BinaryNode if !b.duplicateResolved => b match { +case j @ Join(left, right, _, _) => + j.copy(right = dedupRight(left, right)) +case i @ Intersect(left, right) => + i.copy(right = dedupRight(left, right)) +case e @ Except(left, right) => + e.copy(right = dedupRight(left, right)) +case cg: CoGroup => --- End diff -- In this case, it should work! Let me know if we should deduplicate the expression IDs for the other operators. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51084606 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -388,57 +445,18 @@ class Analyzer( .map(_.asInstanceOf[NamedExpression]) a.copy(aggregateExpressions = expanded) - // Special handling for cases when self-join introduce duplicate expression ids. - case j @ Join(left, right, _, _) if !j.selfJoinResolved => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) -logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j") - -right.collect { - // Handle base relations that might appear more than once. - case oldVersion: MultiInstanceRelation - if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty => -val newVersion = oldVersion.newInstance() -(oldVersion, newVersion) - - // Handle projects that create conflicting aliases. - case oldVersion @ Project(projectList, _) - if findAliases(projectList).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(projectList = newAliases(projectList))) - - case oldVersion @ Aggregate(_, aggregateExpressions, _) - if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions))) - - case oldVersion: Generate - if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty => -val newOutput = oldVersion.generatorOutput.map(_.newInstance()) -(oldVersion, oldVersion.copy(generatorOutput = newOutput)) - - case oldVersion @ Window(_, windowExpressions, _, _, child) - if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes) -.nonEmpty => -(oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))) -} -// Only handle first case, others will be fixed on the next pass. -.headOption match { - case None => -/* - * No result implies that there is a logical plan node that produces new references - * that this rule cannot handle. When that is the case, there must be another rule - * that resolves these conflicts. Otherwise, the analysis will fail. - */ -j - case Some((oldRelation, newRelation)) => -val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output)) -val newRight = right transformUp { - case r if r == oldRelation => newRelation -} transformUp { - case other => other transformExpressions { -case a: Attribute => attributeRewrites.get(a).getOrElse(a) - } -} -j.copy(right = newRight) -} + // To resolve duplicate expression IDs for all the BinaryNode + case b: BinaryNode if !b.duplicateResolved => b match { +case j @ Join(left, right, _, _) => + j.copy(right = dedupRight(left, right)) +case i @ Intersect(left, right) => + i.copy(right = dedupRight(left, right)) +case e @ Except(left, right) => + e.copy(right = dedupRight(left, right)) +case cg: CoGroup => --- End diff -- @cloud-fan After code changes, I just realized the problem. If we replace intersect with left-semi join during analysis, it does not work. We have to de-duplicate the left and the right children at first; otherwise, the tree will be like ``` == Analyzed Logical Plan == id: int Distinct +- Join LeftSemi, Some((id#1 <=> id#1)) :- Sample 0.0, 0.4, false, 1 : +- Sort [id#1 ASC], false : +- Project [_1#0 AS id#1] :+- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22 +- Sample 0.4, 1.0, false, 1 +- Sort [id#4 ASC], false +- Project [_1#0 AS id#4] +- LogicalRDD [_1#0], MapPartitionsRDD[2] at apply at Transformer.scala:22 ``` The Join condition is invalid. Should we keep the rule in the optimizer? BTW, I will remove the `Except` and `CoGroup`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51074011 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -388,57 +445,18 @@ class Analyzer( .map(_.asInstanceOf[NamedExpression]) a.copy(aggregateExpressions = expanded) - // Special handling for cases when self-join introduce duplicate expression ids. - case j @ Join(left, right, _, _) if !j.selfJoinResolved => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) -logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j") - -right.collect { - // Handle base relations that might appear more than once. - case oldVersion: MultiInstanceRelation - if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty => -val newVersion = oldVersion.newInstance() -(oldVersion, newVersion) - - // Handle projects that create conflicting aliases. - case oldVersion @ Project(projectList, _) - if findAliases(projectList).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(projectList = newAliases(projectList))) - - case oldVersion @ Aggregate(_, aggregateExpressions, _) - if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions))) - - case oldVersion: Generate - if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty => -val newOutput = oldVersion.generatorOutput.map(_.newInstance()) -(oldVersion, oldVersion.copy(generatorOutput = newOutput)) - - case oldVersion @ Window(_, windowExpressions, _, _, child) - if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes) -.nonEmpty => -(oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))) -} -// Only handle first case, others will be fixed on the next pass. -.headOption match { - case None => -/* - * No result implies that there is a logical plan node that produces new references - * that this rule cannot handle. When that is the case, there must be another rule - * that resolves these conflicts. Otherwise, the analysis will fail. - */ -j - case Some((oldRelation, newRelation)) => -val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output)) -val newRight = right transformUp { - case r if r == oldRelation => newRelation -} transformUp { - case other => other transformExpressions { -case a: Attribute => attributeRewrites.get(a).getOrElse(a) - } -} -j.copy(right = newRight) -} + // To resolve duplicate expression IDs for all the BinaryNode + case b: BinaryNode if !b.duplicateResolved => b match { +case j @ Join(left, right, _, _) => + j.copy(right = dedupRight(left, right)) +case i @ Intersect(left, right) => + i.copy(right = dedupRight(left, right)) +case e @ Except(left, right) => + e.copy(right = dedupRight(left, right)) +case cg: CoGroup => --- End diff -- So far, I am unable to create such test cases. Let me revert it back and enable de-duplication for Join only. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176013909 **[Test build #50257 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50257/consoleFull)** for PR 10630 at commit [`3be78c4`](https://github.com/apache/spark/commit/3be78c4197519edb4659d8064468125b5223dd29). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176036668 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176036575 **[Test build #50257 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50257/consoleFull)** for PR 10630 at commit [`3be78c4`](https://github.com/apache/spark/commit/3be78c4197519edb4659d8064468125b5223dd29). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-176036669 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50257/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51040321 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -111,6 +113,7 @@ object SamplePushDown extends Rule[LogicalPlan] { } /** +<<< HEAD --- End diff -- remove this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51041095 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -388,57 +445,18 @@ class Analyzer( .map(_.asInstanceOf[NamedExpression]) a.copy(aggregateExpressions = expanded) - // Special handling for cases when self-join introduce duplicate expression ids. - case j @ Join(left, right, _, _) if !j.selfJoinResolved => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) -logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j") - -right.collect { - // Handle base relations that might appear more than once. - case oldVersion: MultiInstanceRelation - if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty => -val newVersion = oldVersion.newInstance() -(oldVersion, newVersion) - - // Handle projects that create conflicting aliases. - case oldVersion @ Project(projectList, _) - if findAliases(projectList).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(projectList = newAliases(projectList))) - - case oldVersion @ Aggregate(_, aggregateExpressions, _) - if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions))) - - case oldVersion: Generate - if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty => -val newOutput = oldVersion.generatorOutput.map(_.newInstance()) -(oldVersion, oldVersion.copy(generatorOutput = newOutput)) - - case oldVersion @ Window(_, windowExpressions, _, _, child) - if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes) -.nonEmpty => -(oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))) -} -// Only handle first case, others will be fixed on the next pass. -.headOption match { - case None => -/* - * No result implies that there is a logical plan node that produces new references - * that this rule cannot handle. When that is the case, there must be another rule - * that resolves these conflicts. Otherwise, the analysis will fail. - */ -j - case Some((oldRelation, newRelation)) => -val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output)) -val newRight = right transformUp { - case r if r == oldRelation => newRelation -} transformUp { - case other => other transformExpressions { -case a: Attribute => attributeRewrites.get(a).getOrElse(a) - } -} -j.copy(right = newRight) -} + // To resolve duplicate expression IDs for all the BinaryNode + case b: BinaryNode if !b.duplicateResolved => b match { +case j @ Join(left, right, _, _) => + j.copy(right = dedupRight(left, right)) +case i @ Intersect(left, right) => + i.copy(right = dedupRight(left, right)) +case e @ Except(left, right) => + e.copy(right = dedupRight(left, right)) +case cg: CoGroup => --- End diff -- do we have test cases to prove that duplicated expression IDs in set operations will cause problems? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51054458 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -388,57 +445,18 @@ class Analyzer( .map(_.asInstanceOf[NamedExpression]) a.copy(aggregateExpressions = expanded) - // Special handling for cases when self-join introduce duplicate expression ids. - case j @ Join(left, right, _, _) if !j.selfJoinResolved => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) -logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j") - -right.collect { - // Handle base relations that might appear more than once. - case oldVersion: MultiInstanceRelation - if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty => -val newVersion = oldVersion.newInstance() -(oldVersion, newVersion) - - // Handle projects that create conflicting aliases. - case oldVersion @ Project(projectList, _) - if findAliases(projectList).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(projectList = newAliases(projectList))) - - case oldVersion @ Aggregate(_, aggregateExpressions, _) - if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions))) - - case oldVersion: Generate - if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty => -val newOutput = oldVersion.generatorOutput.map(_.newInstance()) -(oldVersion, oldVersion.copy(generatorOutput = newOutput)) - - case oldVersion @ Window(_, windowExpressions, _, _, child) - if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes) -.nonEmpty => -(oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))) -} -// Only handle first case, others will be fixed on the next pass. -.headOption match { - case None => -/* - * No result implies that there is a logical plan node that produces new references - * that this rule cannot handle. When that is the case, there must be another rule - * that resolves these conflicts. Otherwise, the analysis will fail. - */ -j - case Some((oldRelation, newRelation)) => -val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output)) -val newRight = right transformUp { - case r if r == oldRelation => newRelation -} transformUp { - case other => other transformExpressions { -case a: Attribute => attributeRewrites.get(a).getOrElse(a) - } -} -j.copy(right = newRight) -} + // To resolve duplicate expression IDs for all the BinaryNode + case b: BinaryNode if !b.duplicateResolved => b match { +case j @ Join(left, right, _, _) => + j.copy(right = dedupRight(left, right)) +case i @ Intersect(left, right) => + i.copy(right = dedupRight(left, right)) +case e @ Except(left, right) => + e.copy(right = dedupRight(left, right)) +case cg: CoGroup => --- End diff -- After we convert Intersect to Join, the query can return a wrong answer due to the pushdown of Join predicate in Optimizer. That is why I did the change. Based on the comments https://github.com/apache/spark/pull/10630#discussion_r49227379 , I also tried to resolve this issue in the other operators. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51055023 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -111,6 +113,7 @@ object SamplePushDown extends Rule[LogicalPlan] { } /** +<<< HEAD --- End diff -- will do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r51057617 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -388,57 +445,18 @@ class Analyzer( .map(_.asInstanceOf[NamedExpression]) a.copy(aggregateExpressions = expanded) - // Special handling for cases when self-join introduce duplicate expression ids. - case j @ Join(left, right, _, _) if !j.selfJoinResolved => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) -logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j") - -right.collect { - // Handle base relations that might appear more than once. - case oldVersion: MultiInstanceRelation - if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty => -val newVersion = oldVersion.newInstance() -(oldVersion, newVersion) - - // Handle projects that create conflicting aliases. - case oldVersion @ Project(projectList, _) - if findAliases(projectList).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(projectList = newAliases(projectList))) - - case oldVersion @ Aggregate(_, aggregateExpressions, _) - if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions))) - - case oldVersion: Generate - if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty => -val newOutput = oldVersion.generatorOutput.map(_.newInstance()) -(oldVersion, oldVersion.copy(generatorOutput = newOutput)) - - case oldVersion @ Window(_, windowExpressions, _, _, child) - if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes) -.nonEmpty => -(oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))) -} -// Only handle first case, others will be fixed on the next pass. -.headOption match { - case None => -/* - * No result implies that there is a logical plan node that produces new references - * that this rule cannot handle. When that is the case, there must be another rule - * that resolves these conflicts. Otherwise, the analysis will fail. - */ -j - case Some((oldRelation, newRelation)) => -val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output)) -val newRight = right transformUp { - case r if r == oldRelation => newRelation -} transformUp { - case other => other transformExpressions { -case a: Attribute => attributeRewrites.get(a).getOrElse(a) - } -} -j.copy(right = newRight) -} + // To resolve duplicate expression IDs for all the BinaryNode + case b: BinaryNode if !b.duplicateResolved => b match { +case j @ Join(left, right, _, _) => + j.copy(right = dedupRight(left, right)) +case i @ Intersect(left, right) => + i.copy(right = dedupRight(left, right)) +case e @ Except(left, right) => + e.copy(right = dedupRight(left, right)) +case cg: CoGroup => --- End diff -- If we replace intersect with left-semi join during analysis, then we can get de-duplication for free(the rule for join can do the work), will it help here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-175469365 **[Test build #50176 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50176/consoleFull)** for PR 10630 at commit [`e566d79`](https://github.com/apache/spark/commit/e566d79b12ab1d9ed9e3124fec8290cdba99549b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-175432369 **[Test build #50176 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50176/consoleFull)** for PR 10630 at commit [`e566d79`](https://github.com/apache/spark/commit/e566d79b12ab1d9ed9e3124fec8290cdba99549b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-175469557 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-175469558 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50176/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r50897086 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -125,17 +128,15 @@ object EliminateSerialization extends Rule[LogicalPlan] { /** * Pushes certain operations to both sides of a Union, Intersect or Except operator. +=== + * Pushes certain operations to both sides of a Union or Except operator. +>>> IntersectBySemiJoinMerged --- End diff -- Seems we need to remove this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r50899426 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1055,6 +1045,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] { } /** + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator. + * {{{ + * SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 + * ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 + * }}} + */ +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { --- End diff -- I think we need to add a comment at here to mention that this rewrite is just for `INTERSECT` (i.e. `INTERSECT DISTINCT`) and is not applicable to `INTERSECT ALL`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r50901811 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -125,17 +128,15 @@ object EliminateSerialization extends Rule[LogicalPlan] { /** * Pushes certain operations to both sides of a Union, Intersect or Except operator. +=== + * Pushes certain operations to both sides of a Union or Except operator. +>>> IntersectBySemiJoinMerged --- End diff -- yeah, sure, will do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r50901898 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1055,6 +1045,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] { } /** + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator. + * {{{ + * SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 + * ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 + * }}} + */ +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { --- End diff -- Yeah, will do it. Actually, I will also implement `INTERSECT ALL` after this one. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-174630543 I don't think its a problem for there to be conflicting attribute ids for set operations, this is because only on childs attribute references need to be propagated up (unlike with a join). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-174665556 Yeah, agree! Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-174165179 When resolving the conflicts, I realized the multi-children `Union` might have duplicate `exprId`. So far, I did not add a function to de-duplicate them. This is not a trivial work, if needed. When the children has hundreds of nodes, it is infeasible to use the current per-pair de-duplication. That means, we need to rewrite the whole function `dedup`. Let me know if we need to open a separate PR to do it now. So far, unlike `Intersect`, we did not hit any issue even if there exist duplicate `exprId` values. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-174165441 **[Test build #49936 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49936/consoleFull)** for PR 10630 at commit [`6a7979d`](https://github.com/apache/spark/commit/6a7979d35d5cf89d9cd14eb6d6a2a4418a44a669). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-174171695 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49936/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-174171694 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-174171631 **[Test build #49936 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49936/consoleFull)** for PR 10630 at commit [`6a7979d`](https://github.com/apache/spark/commit/6a7979d35d5cf89d9cd14eb6d6a2a4418a44a669). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-174148184 sure, let me do it tonight. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-174144971 @gatorsmile there is a conflict now - mind bringing it up to date? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170220929 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170208865 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170208817 **[Test build #49038 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49038/consoleFull)** for PR 10630 at commit [`f820c61`](https://github.com/apache/spark/commit/f820c616fe217494ccaed0bf74a0a7410ce503cf). * This patch **fails from timeout after a configured wait of \`250m\`**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170208866 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49038/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170209280 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170209903 **[Test build #49044 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49044/consoleFull)** for PR 10630 at commit [`4372170`](https://github.com/apache/spark/commit/4372170f600eb25996c3aa4f09d569312c263686). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170220857 **[Test build #49044 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49044/consoleFull)** for PR 10630 at commit [`4372170`](https://github.com/apache/spark/commit/4372170f600eb25996c3aa4f09d569312c263686). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170220930 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49044/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169926993 **[Test build #49008 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49008/consoleFull)** for PR 10630 at commit [`27192be`](https://github.com/apache/spark/commit/27192be27708265aba6707d1100eefca537fb2b8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169927020 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49008/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169927573 Both failure are not related to this PR. They are caused by the test case `randomSplit on reordered partitions` in https://github.com/apache/spark/pull/10626. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169931210 **[Test build #49012 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49012/consoleFull)** for PR 10630 at commit [`a932cdb`](https://github.com/apache/spark/commit/a932cdb05c889ded8de72024b2eef19723bdb0f4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169936735 **[Test build #49012 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49012/consoleFull)** for PR 10630 at commit [`a932cdb`](https://github.com/apache/spark/commit/a932cdb05c889ded8de72024b2eef19723bdb0f4). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class JavaSampleActorReceiver extends JavaActorReceiver ` * `public class JavaActorWordCount ` * `abstract class ActorReceiver extends Actor ` * `abstract class JavaActorReceiver extends UntypedActor ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169936845 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49012/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169936843 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169927019 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170082147 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170081871 **[Test build #49020 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49020/consoleFull)** for PR 10630 at commit [`04a26bd`](https://github.com/apache/spark/commit/04a26bd3beaab9d8ea3a4b300eb9f413b5eaa704). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r49227379 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -388,57 +441,14 @@ class Analyzer( .map(_.asInstanceOf[NamedExpression]) a.copy(aggregateExpressions = expanded) - // Special handling for cases when self-join introduce duplicate expression ids. + // Special handling for cases when self-join introduces duplicate expression IDs. case j @ Join(left, right, _, _) if !j.selfJoinResolved => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) -logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j") - -right.collect { - // Handle base relations that might appear more than once. - case oldVersion: MultiInstanceRelation - if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty => -val newVersion = oldVersion.newInstance() -(oldVersion, newVersion) - - // Handle projects that create conflicting aliases. - case oldVersion @ Project(projectList, _) - if findAliases(projectList).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(projectList = newAliases(projectList))) - - case oldVersion @ Aggregate(_, aggregateExpressions, _) - if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions))) - - case oldVersion: Generate - if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty => -val newOutput = oldVersion.generatorOutput.map(_.newInstance()) -(oldVersion, oldVersion.copy(generatorOutput = newOutput)) - - case oldVersion @ Window(_, windowExpressions, _, _, child) - if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes) -.nonEmpty => -(oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))) -} -// Only handle first case, others will be fixed on the next pass. -.headOption match { - case None => -/* - * No result implies that there is a logical plan node that produces new references - * that this rule cannot handle. When that is the case, there must be another rule - * that resolves these conflicts. Otherwise, the analysis will fail. - */ -j - case Some((oldRelation, newRelation)) => -val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output)) -val newRight = right transformUp { - case r if r == oldRelation => newRelation -} transformUp { - case other => other transformExpressions { -case a: Attribute => attributeRewrites.get(a).getOrElse(a) - } -} -j.copy(right = newRight) -} +j.copy(right = buildRightChild4Deduplication(left, right)) + + // Special handling for cases when self-intersect introduces duplicate expression IDs. + // In Optimizer, Intersect will be converted to semi join. + case i @ Intersect(left, right) if !i.duplicateResolved => +i.copy(right = buildRightChild4Deduplication(left, right)) --- End diff -- Do we need to do this for all non-leaf/unary nodes? I think the only reason this wasn't a problem in the past was there were no optimizer rules for those nodes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r49239333 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -344,6 +344,59 @@ class Analyzer( } } +def buildRightChild4Deduplication (left: LogicalPlan, right: LogicalPlan): LogicalPlan = { --- End diff -- Sure. : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r49239389 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala --- @@ -116,7 +116,18 @@ case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(lef } } -case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right) +case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right) { + def duplicateResolved: Boolean = left.outputSet.intersect(right.outputSet).isEmpty + + // Intersect is only resolved if they don't introduce ambiguous expression ids, + // since it will be converted to semi Join by optimizer. --- End diff -- Will update the comments in the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r49227764 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala --- @@ -116,7 +116,18 @@ case class Union(left: LogicalPlan, right: LogicalPlan) extends SetOperation(lef } } -case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right) +case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation(left, right) { + def duplicateResolved: Boolean = left.outputSet.intersect(right.outputSet).isEmpty + + // Intersect is only resolved if they don't introduce ambiguous expression ids, + // since it will be converted to semi Join by optimizer. --- End diff -- This check might belong more generally in `childrenResolved` as I mention above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r49239278 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -388,57 +441,14 @@ class Analyzer( .map(_.asInstanceOf[NamedExpression]) a.copy(aggregateExpressions = expanded) - // Special handling for cases when self-join introduce duplicate expression ids. + // Special handling for cases when self-join introduces duplicate expression IDs. case j @ Join(left, right, _, _) if !j.selfJoinResolved => -val conflictingAttributes = left.outputSet.intersect(right.outputSet) -logDebug(s"Conflicting attributes ${conflictingAttributes.mkString(",")} in $j") - -right.collect { - // Handle base relations that might appear more than once. - case oldVersion: MultiInstanceRelation - if oldVersion.outputSet.intersect(conflictingAttributes).nonEmpty => -val newVersion = oldVersion.newInstance() -(oldVersion, newVersion) - - // Handle projects that create conflicting aliases. - case oldVersion @ Project(projectList, _) - if findAliases(projectList).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(projectList = newAliases(projectList))) - - case oldVersion @ Aggregate(_, aggregateExpressions, _) - if findAliases(aggregateExpressions).intersect(conflictingAttributes).nonEmpty => -(oldVersion, oldVersion.copy(aggregateExpressions = newAliases(aggregateExpressions))) - - case oldVersion: Generate - if oldVersion.generatedSet.intersect(conflictingAttributes).nonEmpty => -val newOutput = oldVersion.generatorOutput.map(_.newInstance()) -(oldVersion, oldVersion.copy(generatorOutput = newOutput)) - - case oldVersion @ Window(_, windowExpressions, _, _, child) - if AttributeSet(windowExpressions.map(_.toAttribute)).intersect(conflictingAttributes) -.nonEmpty => -(oldVersion, oldVersion.copy(windowExpressions = newAliases(windowExpressions))) -} -// Only handle first case, others will be fixed on the next pass. -.headOption match { - case None => -/* - * No result implies that there is a logical plan node that produces new references - * that this rule cannot handle. When that is the case, there must be another rule - * that resolves these conflicts. Otherwise, the analysis will fail. - */ -j - case Some((oldRelation, newRelation)) => -val attributeRewrites = AttributeMap(oldRelation.output.zip(newRelation.output)) -val newRight = right transformUp { - case r if r == oldRelation => newRelation -} transformUp { - case other => other transformExpressions { -case a: Attribute => attributeRewrites.get(a).getOrElse(a) - } -} -j.copy(right = newRight) -} +j.copy(right = buildRightChild4Deduplication(left, right)) + + // Special handling for cases when self-intersect introduces duplicate expression IDs. + // In Optimizer, Intersect will be converted to semi join. + case i @ Intersect(left, right) if !i.duplicateResolved => +i.copy(right = buildRightChild4Deduplication(left, right)) --- End diff -- I see, will do a change to make it more general. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170082149 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49020/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r49227485 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -344,6 +344,59 @@ class Analyzer( } } +def buildRightChild4Deduplication (left: LogicalPlan, right: LogicalPlan): LogicalPlan = { --- End diff -- extra space. Also I'm not a huge fan of `4`. How about `dedupRight` with some scala doc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170200677 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49039/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170200645 **[Test build #49039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49039/consoleFull)** for PR 10630 at commit [`4372170`](https://github.com/apache/spark/commit/4372170f600eb25996c3aa4f09d569312c263686). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170200676 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170193612 **[Test build #49039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49039/consoleFull)** for PR 10630 at commit [`4372170`](https://github.com/apache/spark/commit/4372170f600eb25996c3aa4f09d569312c263686). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170189725 **[Test build #49038 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49038/consoleFull)** for PR 10630 at commit [`f820c61`](https://github.com/apache/spark/commit/f820c616fe217494ccaed0bf74a0a7410ce503cf). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170189804 In this update, the code changes include - Fixed a bug in `Range` by inheriting the trait `MultiInstanceRelation`. - Added a de-duplication resolution for all the binary nodes: `Except`, `Union` and `Co-Group`, besides `Intersect` and `Join`. - Added a new function `duplicateResolved` for all the binary nodes. - Improved the analysis exception message when failure to resolve conflicting references - Resolved all the other comments. The analysis procedure is kind of tricky. I am unable to directly include `duplicateResolved` into `childrenResolved`. `resolve` is lazy evaluated. The resolution procedure need to follow the order: children at first, then itself, and then deduplicate the attributes' expression IDs in tis children. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170013936 Found the root cause of this failure. We also need to change the analyzer to handle duplicate expression ids. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170052294 The latest update is for resolving the ambiguous attributes, which could trigger invalid predicate pushdown by the Optimizer rule `PushPredicateThroughJoin`. The solution is the same as what we are doing for self join in analyzer. Thank you! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-170053763 **[Test build #49020 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49020/consoleFull)** for PR 10630 at commit [`04a26bd`](https://github.com/apache/spark/commit/04a26bd3beaab9d8ea3a4b300eb9f413b5eaa704). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169591104 **[Test build #48916 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48916/consoleFull)** for PR 10630 at commit [`e4c34f0`](https://github.com/apache/spark/commit/e4c34f01acbb626a8084c99a5f0ad26fbfb175e2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169618123 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169635223 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48912/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169618125 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48916/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169719510 **[Test build #48947 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48947/consoleFull)** for PR 10630 at commit [`e4c34f0`](https://github.com/apache/spark/commit/e4c34f01acbb626a8084c99a5f0ad26fbfb175e2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/10630#discussion_r49094926 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -966,6 +952,22 @@ object ReplaceDistinctWithAggregate extends Rule[LogicalPlan] { } /** + * Replaces logical [[Intersect]] operator with a left-semi [[Join]] operator. + * {{{ + * SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 + * ==> SELECT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 + * }}} + */ +object ReplaceIntersectWithSemiJoin extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case Intersect(left, right) => + assert(left.output.size == right.output.size) + val joinCond = left.output.zip(right.output).map { case (l, r) => EqualNullSafe(l, r) } + Join(left, right, LeftSemi, joinCond.reduceLeftOption(And)) --- End diff -- You are right! It should remove duplicates. I just tried it in DB2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169913011 **[Test build #49007 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49007/consoleFull)** for PR 10630 at commit [`27192be`](https://github.com/apache/spark/commit/27192be27708265aba6707d1100eefca537fb2b8). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169916605 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49007/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169916531 **[Test build #49007 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49007/consoleFull)** for PR 10630 at commit [`27192be`](https://github.com/apache/spark/commit/27192be27708265aba6707d1100eefca537fb2b8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169916600 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169913342 In this update, the code changes include - Added `DISTINCT` on top of `Semi Join` for `Intersect` - Added a test case for detecting if the duplicate values are removed after `Intersect` - Threw an exception if the `Intersect` is not converted before conversion from logical plan to physical plan - Merge two rules `ReplaceIntersectWithSemiJoin` and `ReplaceDistinctWithAggregate` in the same batch `"Replace Operators"` for ensuring `ReplaceIntersectWithSemiJoin` should be called before `ReplaceDistinctWithAggregate` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12656] [SQL] Implement Intersect with L...
Github user gatorsmile commented on the pull request: https://github.com/apache/spark/pull/10630#issuecomment-169917411 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org