subject:"\[GitHub\] spark pull request #20384\: \[SPARK\-23195\] \[SQL\] Keep the Hint of Cached Data"

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-24 Thread gatorsmile

Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/20384


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-24 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20384#discussion_r163755332
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
   }
 
   test("broadcast hint is retained after using the cached data") {
-withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
-  val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value")
-  val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value")
-  df2.cache()
-  val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
-  val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
-case b: BroadcastHashJoinExec => b
-  }.size
-  assert(numBroadCastHashJoin === 1)
+try {
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+val df1 = spark.createDataFrame(Seq((1, "4"), (2, 
"2"))).toDF("key", "value")
+val df2 = spark.createDataFrame(Seq((1, "1"), (2, 
"2"))).toDF("key", "value")
+df2.cache()
+val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
+val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect 
{
+  case b: BroadcastHashJoinExec => b
+}.size
+assert(numBroadCastHashJoin === 1)
+  }
+} finally {
+  spark.catalog.clearCache()
--- End diff --

Yeah. That should be a separate bug. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-24 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20384#discussion_r163750722
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
   }
 
   test("broadcast hint is retained after using the cached data") {
-withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
-  val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value")
-  val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value")
-  df2.cache()
-  val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
-  val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
-case b: BroadcastHashJoinExec => b
-  }.size
-  assert(numBroadCastHashJoin === 1)
+try {
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+val df1 = spark.createDataFrame(Seq((1, "4"), (2, 
"2"))).toDF("key", "value")
+val df2 = spark.createDataFrame(Seq((1, "1"), (2, 
"2"))).toDF("key", "value")
+df2.cache()
+val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
+val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect 
{
+  case b: BroadcastHashJoinExec => b
+}.size
+assert(numBroadCastHashJoin === 1)
+  }
+} finally {
+  spark.catalog.clearCache()
--- End diff --

do you mean we can remove the cache cleaning here after you fix that bug?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-24 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20384#discussion_r163602179
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
   }
 
   test("broadcast hint is retained after using the cached data") {
-withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
-  val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value")
-  val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value")
-  df2.cache()
-  val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
-  val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
-case b: BroadcastHashJoinExec => b
-  }.size
-  assert(numBroadCastHashJoin === 1)
+try {
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+val df1 = spark.createDataFrame(Seq((1, "4"), (2, 
"2"))).toDF("key", "value")
+val df2 = spark.createDataFrame(Seq((1, "1"), (2, 
"2"))).toDF("key", "value")
+df2.cache()
+val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
+val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect 
{
+  case b: BroadcastHashJoinExec => b
+}.size
+assert(numBroadCastHashJoin === 1)
+  }
+} finally {
+  spark.catalog.clearCache()
--- End diff --

This test suite is not for testing cache. Thus, it is fine to do it for 
these two test cases.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-24 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/20384#discussion_r163586713
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
   }
 
   test("broadcast hint is retained after using the cached data") {
-withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
-  val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value")
-  val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value")
-  df2.cache()
-  val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
-  val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
-case b: BroadcastHashJoinExec => b
-  }.size
-  assert(numBroadCastHashJoin === 1)
+try {
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+val df1 = spark.createDataFrame(Seq((1, "4"), (2, 
"2"))).toDF("key", "value")
+val df2 = spark.createDataFrame(Seq((1, "1"), (2, 
"2"))).toDF("key", "value")
+df2.cache()
+val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
+val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect 
{
+  case b: BroadcastHashJoinExec => b
+}.size
+assert(numBroadCastHashJoin === 1)
+  }
+} finally {
+  spark.catalog.clearCache()
--- End diff --

If we have to clear the cache, can we add `clearCache` into `afterEach` in 
general instead of adding this case-by-case?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-24 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20384#discussion_r163581427
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
   }
 
   test("broadcast hint is retained after using the cached data") {
-withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
-  val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value")
-  val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value")
-  df2.cache()
-  val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
-  val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect {
-case b: BroadcastHashJoinExec => b
-  }.size
-  assert(numBroadCastHashJoin === 1)
+try {
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+val df1 = spark.createDataFrame(Seq((1, "4"), (2, 
"2"))).toDF("key", "value")
+val df2 = spark.createDataFrame(Seq((1, "1"), (2, 
"2"))).toDF("key", "value")
+df2.cache()
+val df3 = df1.join(broadcast(df2), Seq("key"), "inner")
+val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect 
{
+  case b: BroadcastHashJoinExec => b
+}.size
+assert(numBroadCastHashJoin === 1)
+  }
+} finally {
+  spark.catalog.clearCache()
--- End diff --

We have to clear the cache. We have another bug to fix in the cache. Will 
submit another PR to fix the cache plan matching. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-24 Thread gatorsmile

GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/20384

[SPARK-23195] [SQL] Keep the Hint of Cached Data

## What changes were proposed in this pull request?
The broadcast hint of the cached plan is lost if we cache the plan. This PR 
is to correct it.

```Scala
  val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value")
  val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value")
  broadcast(df2).cache()
  df2.collect()
  val df3 = df1.join(df2, Seq("key"), "inner")
```

## How was this patch tested?
Added a test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark test33

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20384.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20384


commit 4ef18b76b13c35dbec0d9e8ecfa753c9b4a80d2c
Author: gatorsmile 
Date:   2018-01-24T15:27:52Z

fix

commit 3bbec210bcfd759d36a6f7e809ed0f1e3d0a03d8
Author: gatorsmile 
Date:   2018-01-24T15:28:06Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data

7 matches

Site Navigation

Mail list logo

Footer information