[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-68402804 Thanks, merged to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3442 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-68024274 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24753/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-68024272 [Test build #24753 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24753/consoleFull) for PR 3442 at commit [`a4a43c9`](https://github.com/apache/spark/commit/a4a43c99b49156ef90fa3b9493b008823dbc01d3). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-68021287 [Test build #24753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24753/consoleFull) for PR 3442 at commit [`a4a43c9`](https://github.com/apache/spark/commit/a4a43c99b49156ef90fa3b9493b008823dbc01d3). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-67283254 [Test build #24533 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24533/consoleFull) for PR 3442 at commit [`3a58191`](https://github.com/apache/spark/commit/3a58191f3ec01724dc43ec73d8ee0815d58dd0e2). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class BroadcastLeftSemiJoinHash(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-67283248 [Test build #24533 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24533/consoleFull) for PR 3442 at commit [`3a58191`](https://github.com/apache/spark/commit/3a58191f3ec01724dc43ec73d8ee0815d58dd0e2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-67283256 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24533/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65187645 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24025/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65187643 [Test build #24025 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24025/consoleFull) for PR 3442 at commit [`d410f67`](https://github.com/apache/spark/commit/d410f675091685288e3965a2e9e78ace3f244d04). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class BroadcastLeftSemiJoinHash(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65183880 [Test build #24025 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24025/consoleFull) for PR 3442 at commit [`d410f67`](https://github.com/apache/spark/commit/d410f675091685288e3965a2e9e78ace3f244d04). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/3442#discussion_r21139300 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala --- @@ -377,4 +378,39 @@ class JoinSuite extends QueryTest with BeforeAndAfterEach { """.stripMargin), (null, 10) :: Nil) } + test("broadcasted left semi join operator selection") { +clearCache() +sql("CACHE TABLE testData") +val tmp = autoBroadcastJoinThreshold + +sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=10""") +Seq( + ("SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a", classOf[BroadcastLeftSemiJoinHash]) +).foreach { + case (query, joinClass) => assertJoin(query, joinClass) +} + +sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=-1""") + +Seq( + ("SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a", classOf[LeftSemiJoinHash]) +).foreach { + case (query, joinClass) => assertJoin(query, joinClass) +} + +sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=-$tmp""") --- End diff -- `-$tmp`: typo? And we can just use `setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, tmp.toString)` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user wangxiaojing commented on a diff in the pull request: https://github.com/apache/spark/pull/3442#discussion_r21136073 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala --- @@ -377,4 +378,39 @@ class JoinSuite extends QueryTest with BeforeAndAfterEach { """.stripMargin), (null, 10) :: Nil) } + test("broadcasted left semi join operator selection") { +clearCache() +sql("CACHE TABLE testData") +val tmp = autoBroadcastJoinThreshold + +sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=10""") --- End diff -- Because the testData2 size is more than `SQLConf.AUTO_BROADCASTJOIN_THRESHOLD` 1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65160209 Thanks for working on this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3442#discussion_r21129847 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -193,4 +193,70 @@ class StatisticsSuite extends QueryTest with BeforeAndAfterAll { ) } + test("auto converts to broadcast left semi join, by size estimate of a relation") { +def mkTest( +before: () => Unit, +after: () => Unit, +query: String, +expectedAnswer: Seq[Any], +ct: ClassTag[_]) = { --- End diff -- Only indent arguments 4 spaces. Also, this function seems like overkill given that it is only called once and the first two arguments are always no-ops. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3442#discussion_r21129765 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala --- @@ -193,4 +193,70 @@ class StatisticsSuite extends QueryTest with BeforeAndAfterAll { ) } + test("auto converts to broadcast left semi join, by size estimate of a relation") { +def mkTest( +before: () => Unit, +after: () => Unit, +query: String, +expectedAnswer: Seq[Any], +ct: ClassTag[_]) = { + before() + + var rdd = sql(query) + + // Assert src has a size smaller than the threshold. + val sizes = rdd.queryExecution.analyzed.collect { +case r if ct.runtimeClass.isAssignableFrom(r.getClass) => r.statistics.sizeInBytes + } + assert(sizes.size === 2 && sizes(1) <= autoBroadcastJoinThreshold +&& sizes(0) <= autoBroadcastJoinThreshold, +s"query should contain two relations, each of which has size smaller than autoConvertSize") + + // Using `sparkPlan` because for relevant patterns in HashJoin to be + // matched, other strategies need to be applied. + var bhj = rdd.queryExecution.sparkPlan.collect { +case j: BroadcastLeftSemiJoinHash => j + } + assert(bhj.size === 1, +s"actual query plans do not contain broadcast join: ${rdd.queryExecution}") + + checkAnswer(rdd, expectedAnswer) // check correctness of output + + TestHive.settings.synchronized { +val tmp = autoBroadcastJoinThreshold + +sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=-1""") +rdd = sql(query) +bhj = rdd.queryExecution.sparkPlan.collect { + case j: BroadcastLeftSemiJoinHash => j +} +assert(bhj.isEmpty, "BroadcastHashJoin still planned even though it is switched off") + +val shj = rdd.queryExecution.sparkPlan.collect { + case j: LeftSemiJoinHash => j +} +assert(shj.size === 1, + "LeftSemiJoinHash should be planned when BroadcastHashJoin is turned off") + +sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=$tmp""") + } + + after() +} + +/** Tests for MetastoreRelation */ + val leftSemiJoinQuery = + """SELECT * FROM src a +|left semi JOIN src b ON a.key=86 and a.key = b.key""".stripMargin +val Answer =(86, "val_86") ::Nil --- End diff -- Indention is off and use lowercase letters for the start of variable names. Also, space after "=". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3442#discussion_r21129679 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastLeftSemiJoinHash.scala --- @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.joins + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.sql.catalyst.expressions.{Expression, Row} +import org.apache.spark.sql.catalyst.plans.physical.ClusteredDistribution +import org.apache.spark.sql.execution.{BinaryNode, SparkPlan} + +/** + * :: DeveloperApi :: + * Build the right table's join keys into a HashSet, and iteratively go through the left + * table, to find the if join keys are in the Hash set. + */ +@DeveloperApi +case class BroadcastLeftSemiJoinHash( +leftKeys: Seq[Expression], +rightKeys: Seq[Expression], +left: SparkPlan, +right: SparkPlan) extends BinaryNode with HashJoin { + + override val buildSide = BuildRight + + override def output = left.output + + override def execute() = { + +val buildIter= buildPlan.execute().map(_.copy()).collect().toIterator +val hashSet = new java.util.HashSet[Row]() +var currentRow: Row = null + +// Create a Hash set of buildKeys +while (buildIter.hasNext) { + currentRow = buildIter.next() + val rowKey = buildSideKeyGenerator(currentRow) + if (!rowKey.anyNull) { +val keyExists = hashSet.contains(rowKey) +if (!keyExists) { + hashSet.add(rowKey) +} + } +} + +val broadcastedRelation = sparkContext.broadcast(hashSet) + +streamedPlan.execute().mapPartitions { streamIter => + + val joinKeys = streamSideKeyGenerator() --- End diff -- Remove blank line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3442#discussion_r21129717 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala --- @@ -377,4 +378,39 @@ class JoinSuite extends QueryTest with BeforeAndAfterEach { """.stripMargin), (null, 10) :: Nil) } + test("broadcasted left semi join operator selection") { +clearCache() +sql("CACHE TABLE testData") +val tmp = autoBroadcastJoinThreshold + +sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=10""") --- End diff -- Why do we need to up the threshold? Can we just write the tests against relations that we have statistics on? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/3442#discussion_r21129672 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastLeftSemiJoinHash.scala --- @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.joins + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.sql.catalyst.expressions.{Expression, Row} +import org.apache.spark.sql.catalyst.plans.physical.ClusteredDistribution +import org.apache.spark.sql.execution.{BinaryNode, SparkPlan} + +/** + * :: DeveloperApi :: + * Build the right table's join keys into a HashSet, and iteratively go through the left + * table, to find the if join keys are in the Hash set. + */ +@DeveloperApi +case class BroadcastLeftSemiJoinHash( +leftKeys: Seq[Expression], +rightKeys: Seq[Expression], +left: SparkPlan, +right: SparkPlan) extends BinaryNode with HashJoin { + + override val buildSide = BuildRight + + override def output = left.output + + override def execute() = { + +val buildIter= buildPlan.execute().map(_.copy()).collect().toIterator --- End diff -- Remove blank line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65041697 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23983/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65041688 [Test build #23983 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23983/consoleFull) for PR 3442 at commit [`3a63ecb`](https://github.com/apache/spark/commit/3a63ecb81aa02a02dc53d014ed3358927a95a376). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class BroadcastLeftSemiJoinHash(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65035043 [Test build #23983 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23983/consoleFull) for PR 3442 at commit [`3a63ecb`](https://github.com/apache/spark/commit/3a63ecb81aa02a02dc53d014ed3358927a95a376). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65034719 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user wangxiaojing commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65030850 @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65025832 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23972/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65025829 [Test build #23972 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23972/consoleFull) for PR 3442 at commit [`3a63ecb`](https://github.com/apache/spark/commit/3a63ecb81aa02a02dc53d014ed3358927a95a376). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class BroadcastLeftSemiJoinHash(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65022514 [Test build #23972 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23972/consoleFull) for PR 3442 at commit [`3a63ecb`](https://github.com/apache/spark/commit/3a63ecb81aa02a02dc53d014ed3358927a95a376). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65022451 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user wangxiaojing commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-65011633 @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3442#issuecomment-64307902 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash
GitHub user wangxiaojing opened a pull request: https://github.com/apache/spark/pull/3442 [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash JIRA issue: [SPARK-4570](https://issues.apache.org/jira/browse/SPARK-4570) We are planning to create a `BroadcastLeftSemiJoinHash` to implement the broadcast join for `left semijoin` In left semijoin : If the size of data from right side is smaller than the user-settable threshold `AUTO_BROADCASTJOIN_THRESHOLD`, the planner would mark it as the `broadcast` relation and mark the other relation as the stream side. The broadcast table will be broadcasted to all of the executors involved in the join, as a `org.apache.spark.broadcast.Broadcast` object. It will use `joins.BroadcastLeftSemiJoinHash`.,else it will use `joins.LeftSemiJoinHash`. The benchmark suggests these made the optimized version 4x faster when `left semijoin` Original: left semi join : 9288 ms Optimized: left semi join : 1963 ms The micro benchmark load `data1/kv3.txt` into a normal Hive table. Benchmark code: def benchmark(f: => Unit) = { val begin = System.currentTimeMillis() f val end = System.currentTimeMillis() end - begin } val sc = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new HiveContext(sc) import hiveContext._ sql("drop table if exists left_table") sql("drop table if exists right_table") sql( """create table left_table (key int, value string) """.stripMargin) sql( s"""load data local inpath "/data1/kv3.txt" into table left_table""") sql( """create table right_table (key int, value string) """.stripMargin) sql( """ |from left_table |insert overwrite table right_table |select left_table.key, left_table.value """.stripMargin) val leftSimeJoin = sql( """select a.key from left_table a |left semi join right_table b on a.key = b.key""".stripMargin) val leftSemiJoinDuration = benchmark(leftSimeJoin.count()) println(s"left semi join : $leftSemiJoinDuration ms ") You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangxiaojing/spark SPARK-4570 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3442.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3442 commit 5d58772aa0bd7fd55a9b9495efbff5cc0b36aeae Author: wangxiaojing Date: 2014-11-25T04:04:05Z add BroadcastLeftSemiJoinHash --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org