[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-30 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-68402804
  
Thanks, merged to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3442


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-68024274
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24753/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-68024272
  
  [Test build #24753 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24753/consoleFull)
 for   PR 3442 at commit 
[`a4a43c9`](https://github.com/apache/spark/commit/a4a43c99b49156ef90fa3b9493b008823dbc01d3).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-68021287
  
  [Test build #24753 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24753/consoleFull)
 for   PR 3442 at commit 
[`a4a43c9`](https://github.com/apache/spark/commit/a4a43c99b49156ef90fa3b9493b008823dbc01d3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-67283254
  
  [Test build #24533 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24533/consoleFull)
 for   PR 3442 at commit 
[`3a58191`](https://github.com/apache/spark/commit/3a58191f3ec01724dc43ec73d8ee0815d58dd0e2).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class BroadcastLeftSemiJoinHash(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-67283248
  
  [Test build #24533 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24533/consoleFull)
 for   PR 3442 at commit 
[`3a58191`](https://github.com/apache/spark/commit/3a58191f3ec01724dc43ec73d8ee0815d58dd0e2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-67283256
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24533/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65187645
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24025/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65187643
  
  [Test build #24025 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24025/consoleFull)
 for   PR 3442 at commit 
[`d410f67`](https://github.com/apache/spark/commit/d410f675091685288e3965a2e9e78ace3f244d04).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class BroadcastLeftSemiJoinHash(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65183880
  
  [Test build #24025 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24025/consoleFull)
 for   PR 3442 at commit 
[`d410f67`](https://github.com/apache/spark/commit/d410f675091685288e3965a2e9e78ace3f244d04).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/3442#discussion_r21139300
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala ---
@@ -377,4 +378,39 @@ class JoinSuite extends QueryTest with 
BeforeAndAfterEach {
 """.stripMargin),
   (null, 10) :: Nil)
   }
+ test("broadcasted left semi join operator selection") {
+clearCache()
+sql("CACHE TABLE testData")
+val tmp = autoBroadcastJoinThreshold
+
+sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=10""")
+Seq(
+  ("SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a", 
classOf[BroadcastLeftSemiJoinHash])
+).foreach {
+  case (query, joinClass) => assertJoin(query, joinClass)
+}
+
+sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=-1""")
+
+Seq(
+  ("SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a", 
classOf[LeftSemiJoinHash])
+).foreach {
+  case (query, joinClass) => assertJoin(query, joinClass)
+}
+
+sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=-$tmp""")
--- End diff --

`-$tmp`: typo? And we can just use 
`setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, tmp.toString)` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread wangxiaojing
Github user wangxiaojing commented on a diff in the pull request:

https://github.com/apache/spark/pull/3442#discussion_r21136073
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala ---
@@ -377,4 +378,39 @@ class JoinSuite extends QueryTest with 
BeforeAndAfterEach {
 """.stripMargin),
   (null, 10) :: Nil)
   }
+ test("broadcasted left semi join operator selection") {
+clearCache()
+sql("CACHE TABLE testData")
+val tmp = autoBroadcastJoinThreshold
+
+sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=10""")
--- End diff --

Because the testData2 size is more than 
`SQLConf.AUTO_BROADCASTJOIN_THRESHOLD` 1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65160209
  
Thanks for working on this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3442#discussion_r21129847
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -193,4 +193,70 @@ class StatisticsSuite extends QueryTest with 
BeforeAndAfterAll {
 )
   }
 
+  test("auto converts to broadcast left semi join, by size estimate of a 
relation") {
+def mkTest(
+before: () => Unit,
+after: () => Unit,
+query: String,
+expectedAnswer: Seq[Any],
+ct: ClassTag[_]) = {
--- End diff --

Only indent arguments 4 spaces.  Also, this function seems like overkill 
given that it is only called once and the first two arguments are always no-ops.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3442#discussion_r21129765
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -193,4 +193,70 @@ class StatisticsSuite extends QueryTest with 
BeforeAndAfterAll {
 )
   }
 
+  test("auto converts to broadcast left semi join, by size estimate of a 
relation") {
+def mkTest(
+before: () => Unit,
+after: () => Unit,
+query: String,
+expectedAnswer: Seq[Any],
+ct: ClassTag[_]) = {
+  before()
+
+  var rdd = sql(query)
+
+  // Assert src has a size smaller than the threshold.
+  val sizes = rdd.queryExecution.analyzed.collect {
+case r if ct.runtimeClass.isAssignableFrom(r.getClass) => 
r.statistics.sizeInBytes
+  }
+  assert(sizes.size === 2 && sizes(1) <= autoBroadcastJoinThreshold
+&& sizes(0) <= autoBroadcastJoinThreshold,
+s"query should contain two relations, each of which has size 
smaller than autoConvertSize")
+
+  // Using `sparkPlan` because for relevant patterns in HashJoin to be
+  // matched, other strategies need to be applied.
+  var bhj = rdd.queryExecution.sparkPlan.collect {
+case j: BroadcastLeftSemiJoinHash => j
+  }
+  assert(bhj.size === 1,
+s"actual query plans do not contain broadcast join: 
${rdd.queryExecution}")
+
+  checkAnswer(rdd, expectedAnswer) // check correctness of output
+
+  TestHive.settings.synchronized {
+val tmp = autoBroadcastJoinThreshold
+
+sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=-1""")
+rdd = sql(query)
+bhj = rdd.queryExecution.sparkPlan.collect {
+  case j: BroadcastLeftSemiJoinHash => j
+}
+assert(bhj.isEmpty, "BroadcastHashJoin still planned even though 
it is switched off")
+
+val shj = rdd.queryExecution.sparkPlan.collect {
+  case j: LeftSemiJoinHash => j
+}
+assert(shj.size === 1,
+  "LeftSemiJoinHash should be planned when BroadcastHashJoin is 
turned off")
+
+sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=$tmp""")
+  }
+
+  after()
+}
+
+/** Tests for MetastoreRelation */
+   val leftSemiJoinQuery =
+  """SELECT * FROM src a
+|left semi JOIN src b ON a.key=86 and a.key = b.key""".stripMargin
+val Answer =(86, "val_86") ::Nil 
--- End diff --

Indention is off and use lowercase letters for the start of variable names. 
 Also, space after "=".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3442#discussion_r21129679
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastLeftSemiJoinHash.scala
 ---
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.joins
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.sql.catalyst.expressions.{Expression, Row}
+import org.apache.spark.sql.catalyst.plans.physical.ClusteredDistribution
+import org.apache.spark.sql.execution.{BinaryNode, SparkPlan}
+
+/**
+ * :: DeveloperApi ::
+ * Build the right table's join keys into a HashSet, and iteratively go 
through the left
+ * table, to find the if join keys are in the Hash set.
+ */
+@DeveloperApi
+case class BroadcastLeftSemiJoinHash(
+leftKeys: Seq[Expression],
+rightKeys: Seq[Expression],
+left: SparkPlan,
+right: SparkPlan) extends BinaryNode with HashJoin {
+
+  override val buildSide = BuildRight
+
+  override def output = left.output
+
+  override def execute() = {
+
+val buildIter= buildPlan.execute().map(_.copy()).collect().toIterator
+val hashSet = new java.util.HashSet[Row]()
+var currentRow: Row = null
+
+// Create a Hash set of buildKeys
+while (buildIter.hasNext) {
+  currentRow = buildIter.next()
+  val rowKey = buildSideKeyGenerator(currentRow)
+  if (!rowKey.anyNull) {
+val keyExists = hashSet.contains(rowKey)
+if (!keyExists) {
+  hashSet.add(rowKey)
+}
+  }
+}
+
+val broadcastedRelation = sparkContext.broadcast(hashSet)
+
+streamedPlan.execute().mapPartitions { streamIter =>
+
+  val joinKeys = streamSideKeyGenerator()
--- End diff --

Remove blank line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3442#discussion_r21129717
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala ---
@@ -377,4 +378,39 @@ class JoinSuite extends QueryTest with 
BeforeAndAfterEach {
 """.stripMargin),
   (null, 10) :: Nil)
   }
+ test("broadcasted left semi join operator selection") {
+clearCache()
+sql("CACHE TABLE testData")
+val tmp = autoBroadcastJoinThreshold
+
+sql( s"""SET ${SQLConf.AUTO_BROADCASTJOIN_THRESHOLD}=10""")
--- End diff --

Why do we need to up the threshold?  Can we just write the tests against 
relations that we have statistics on?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/3442#discussion_r21129672
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastLeftSemiJoinHash.scala
 ---
@@ -0,0 +1,69 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.joins
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.sql.catalyst.expressions.{Expression, Row}
+import org.apache.spark.sql.catalyst.plans.physical.ClusteredDistribution
+import org.apache.spark.sql.execution.{BinaryNode, SparkPlan}
+
+/**
+ * :: DeveloperApi ::
+ * Build the right table's join keys into a HashSet, and iteratively go 
through the left
+ * table, to find the if join keys are in the Hash set.
+ */
+@DeveloperApi
+case class BroadcastLeftSemiJoinHash(
+leftKeys: Seq[Expression],
+rightKeys: Seq[Expression],
+left: SparkPlan,
+right: SparkPlan) extends BinaryNode with HashJoin {
+
+  override val buildSide = BuildRight
+
+  override def output = left.output
+
+  override def execute() = {
+
+val buildIter= buildPlan.execute().map(_.copy()).collect().toIterator
--- End diff --

Remove blank line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65041697
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23983/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65041688
  
  [Test build #23983 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23983/consoleFull)
 for   PR 3442 at commit 
[`3a63ecb`](https://github.com/apache/spark/commit/3a63ecb81aa02a02dc53d014ed3358927a95a376).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class BroadcastLeftSemiJoinHash(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65035043
  
  [Test build #23983 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23983/consoleFull)
 for   PR 3442 at commit 
[`3a63ecb`](https://github.com/apache/spark/commit/3a63ecb81aa02a02dc53d014ed3358927a95a376).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65034719
  
retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-12-01 Thread wangxiaojing
Github user wangxiaojing commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65030850
  
@liancheng 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65025832
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23972/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-11-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65025829
  
  [Test build #23972 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23972/consoleFull)
 for   PR 3442 at commit 
[`3a63ecb`](https://github.com/apache/spark/commit/3a63ecb81aa02a02dc53d014ed3358927a95a376).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class BroadcastLeftSemiJoinHash(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-11-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65022514
  
  [Test build #23972 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23972/consoleFull)
 for   PR 3442 at commit 
[`3a63ecb`](https://github.com/apache/spark/commit/3a63ecb81aa02a02dc53d014ed3358927a95a376).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-11-30 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65022451
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-11-30 Thread wangxiaojing
Github user wangxiaojing commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-65011633
  
@liancheng


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-11-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3442#issuecomment-64307902
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

2014-11-24 Thread wangxiaojing
GitHub user wangxiaojing opened a pull request:

https://github.com/apache/spark/pull/3442

[SPARK-4570][SQL]add BroadcastLeftSemiJoinHash

JIRA issue: [SPARK-4570](https://issues.apache.org/jira/browse/SPARK-4570)
We are planning to create a `BroadcastLeftSemiJoinHash` to implement the 
broadcast join for `left semijoin`
In left semijoin :
If the size of data from right side is smaller than the user-settable 
threshold `AUTO_BROADCASTJOIN_THRESHOLD`, 
the planner would mark it as the `broadcast` relation and mark the other 
relation as the stream side. The broadcast table will be broadcasted to all of 
the executors involved in the join, as a `org.apache.spark.broadcast.Broadcast` 
object. It will use `joins.BroadcastLeftSemiJoinHash`.,else it will use 
`joins.LeftSemiJoinHash`.

The benchmark suggests these  made the optimized version 4x faster  when 
`left semijoin` 

Original:
left semi join : 9288 ms 
Optimized:
left semi join : 1963 ms 

The micro benchmark load `data1/kv3.txt` into a normal Hive table.
Benchmark code:

 def benchmark(f: => Unit) = {
val begin = System.currentTimeMillis()
f
val end = System.currentTimeMillis()
end - begin
  }
  val sc = new SparkContext(
new SparkConf()
  .setMaster("local")
  .setAppName(getClass.getSimpleName.stripSuffix("$")))
  val hiveContext = new HiveContext(sc)
  import hiveContext._
  sql("drop table if exists left_table")
  sql("drop table if exists right_table")
  sql( """create table left_table (key int, value string)
   """.stripMargin)
  sql( s"""load data local inpath "/data1/kv3.txt" into table left_table""")
  sql( """create table right_table (key int, value string)
   """.stripMargin)
  sql(
"""
  |from left_table
  |insert overwrite table right_table
  |select left_table.key, left_table.value
""".stripMargin)

  val leftSimeJoin = sql(
"""select a.key from left_table a
  |left semi join right_table b on a.key = b.key""".stripMargin)
  val leftSemiJoinDuration = benchmark(leftSimeJoin.count())
  println(s"left semi join : $leftSemiJoinDuration ms ")


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangxiaojing/spark SPARK-4570

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3442.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3442


commit 5d58772aa0bd7fd55a9b9495efbff5cc0b36aeae
Author: wangxiaojing 
Date:   2014-11-25T04:04:05Z

add BroadcastLeftSemiJoinHash




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org