[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-24 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163610413
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 
+  test("broadcast hint is retained in a cached plan") {
+Seq(true, false).foreach { materialized =>
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+val df1 = spark.createDataFrame(Seq((1, "4"), (2, 
"2"))).toDF("key", "value")
+val df2 = spark.createDataFrame(Seq((1, "1"), (2, 
"2"))).toDF("key", "value")
+broadcast(df2).cache()
+if (materialized) df2.collect()
+val df3 = df1.join(df2, Seq("key"), "inner")
--- End diff --

All the other cases are creating Dataframes like this. Anyway, I changed 
all of them in the new PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163442656
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 
+  test("broadcast hint is retained in a cached plan") {
+Seq(true, false).foreach { materialized =>
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+val df1 = spark.createDataFrame(Seq((1, "4"), (2, 
"2"))).toDF("key", "value")
+val df2 = spark.createDataFrame(Seq((1, "1"), (2, 
"2"))).toDF("key", "value")
+broadcast(df2).cache()
+if (materialized) df2.collect()
+val df3 = df1.join(df2, Seq("key"), "inner")
--- End diff --

I tend to agree that tests are also examples for Spark users, we should 
pick the recommended usages.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20368


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163404464
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
@@ -63,7 +63,7 @@ case class InMemoryRelation(
 tableName: Option[String])(
 @transient var _cachedColumnBuffers: RDD[CachedBatch] = null,
 val batchStats: LongAccumulator = 
child.sqlContext.sparkContext.longAccumulator,
-statsOfPlanToCache: Statistics = null)
+statsOfPlanToCache: Statistics)
--- End diff --

Setting `null` by default is risky, because we might hit 
`NullPointerException `. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163402210
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
@@ -63,7 +63,7 @@ case class InMemoryRelation(
 tableName: Option[String])(
 @transient var _cachedColumnBuffers: RDD[CachedBatch] = null,
 val batchStats: LongAccumulator = 
child.sqlContext.sparkContext.longAccumulator,
-statsOfPlanToCache: Statistics = null)
+statsOfPlanToCache: Statistics)
--- End diff --

leave no default value is fine, we do not any default value actually


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163377179
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 
+  test("broadcast hint is retained in a cached plan") {
+Seq(true, false).foreach { materialized =>
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+val df1 = spark.createDataFrame(Seq((1, "4"), (2, 
"2"))).toDF("key", "value")
+val df2 = spark.createDataFrame(Seq((1, "1"), (2, 
"2"))).toDF("key", "value")
+broadcast(df2).cache()
+if (materialized) df2.collect()
+val df3 = df1.join(df2, Seq("key"), "inner")
--- End diff --

That should not matter.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163377142
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 
+  test("broadcast hint is retained in a cached plan") {
+Seq(true, false).foreach { materialized =>
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+val df1 = spark.createDataFrame(Seq((1, "4"), (2, 
"2"))).toDF("key", "value")
--- End diff --

That should not matter. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163377070
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
@@ -77,7 +77,7 @@ case class InMemoryRelation(
   // Underlying columnar RDD hasn't been materialized, use the stats 
from the plan to cache
   statsOfPlanToCache
 } else {
-  Statistics(sizeInBytes = batchStats.value.longValue)
+  Statistics(sizeInBytes = batchStats.value.longValue, hints = 
statsOfPlanToCache.hints)
--- End diff --

That is misleading. Conceptually, that is wrong.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163376001
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 
+  test("broadcast hint is retained in a cached plan") {
+Seq(true, false).foreach { materialized =>
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+val df1 = spark.createDataFrame(Seq((1, "4"), (2, 
"2"))).toDF("key", "value")
+val df2 = spark.createDataFrame(Seq((1, "1"), (2, 
"2"))).toDF("key", "value")
+broadcast(df2).cache()
+if (materialized) df2.collect()
+val df3 = df1.join(df2, Seq("key"), "inner")
--- End diff --

`val df3 = df1.join(df2, "key")`? `inner` is implied, isn't it? (I'm 
proposing the change as this and other tests could be easily used as a learning 
tool to master Spark SQL's API)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163375534
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 
+  test("broadcast hint is retained in a cached plan") {
+Seq(true, false).foreach { materialized =>
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+val df1 = spark.createDataFrame(Seq((1, "4"), (2, 
"2"))).toDF("key", "value")
--- End diff --

Is `spark.createDataFrame(...)` wrapper really required? I thought `Seq((1, 
"4"), (2, "2")).toDF("key", "value")` would just work fine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163375216
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
@@ -77,7 +77,7 @@ case class InMemoryRelation(
   // Underlying columnar RDD hasn't been materialized, use the stats 
from the plan to cache
   statsOfPlanToCache
 } else {
-  Statistics(sizeInBytes = batchStats.value.longValue)
+  Statistics(sizeInBytes = batchStats.value.longValue, hints = 
statsOfPlanToCache.hints)
--- End diff --

Why don't you simply `statsOfPlanToCache.copy(sizeInBytes = 
batchStats.value.longValue)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20368#discussion_r163373026
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala
 ---
@@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 
+  test("broadcast hint is retained in a cached plan") {
+Seq(true, false).foreach { materialized =>
+  withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
+val df1 = spark.createDataFrame(Seq((1, "4"), (2, 
"2"))).toDF("key", "value")
+val df2 = spark.createDataFrame(Seq((1, "1"), (2, 
"2"))).toDF("key", "value")
+broadcast(df2).cache()
+if (materialized) df2.collect()
--- End diff --

This PR https://github.com/apache/spark/pull/19864 accidentally fixes the 
issue when the plan is not materialized. However, it does not resolve the issue 
when the cached plan is materialized. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/20368

[SPARK-23195] [SQL] Keep the Hint of Cached Data

## What changes were proposed in this pull request?
The broadcast hint of the cached plan is lost if we cache the plan. This PR 
is to correct it.

```Scala
  val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", 
"value")
  val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", 
"value")
  broadcast(df2).cache()
  df2.collect()
  val df3 = df1.join(df2, Seq("key"), "inner")
```

## How was this patch tested?
Added a test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark cachedBroadcastHint

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20368.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20368


commit 21e5321d072c312e243407af08eeb9c1a796ab4d
Author: gatorsmile 
Date:   2018-01-23T20:40:04Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org