[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20368#discussion_r163610413 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala --- @@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils { } } + test("broadcast hint is retained in a cached plan") { +Seq(true, false).foreach { materialized => + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { +val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") +val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") +broadcast(df2).cache() +if (materialized) df2.collect() +val df3 = df1.join(df2, Seq("key"), "inner") --- End diff -- All the other cases are creating Dataframes like this. Anyway, I changed all of them in the new PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20368#discussion_r163442656 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala --- @@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils { } } + test("broadcast hint is retained in a cached plan") { +Seq(true, false).foreach { materialized => + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { +val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") +val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") +broadcast(df2).cache() +if (materialized) df2.collect() +val df3 = df1.join(df2, Seq("key"), "inner") --- End diff -- I tend to agree that tests are also examples for Spark users, we should pick the recommended usages. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20368 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20368#discussion_r163404464 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala --- @@ -63,7 +63,7 @@ case class InMemoryRelation( tableName: Option[String])( @transient var _cachedColumnBuffers: RDD[CachedBatch] = null, val batchStats: LongAccumulator = child.sqlContext.sparkContext.longAccumulator, -statsOfPlanToCache: Statistics = null) +statsOfPlanToCache: Statistics) --- End diff -- Setting `null` by default is risky, because we might hit `NullPointerException `. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user CodingCat commented on a diff in the pull request: https://github.com/apache/spark/pull/20368#discussion_r163402210 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala --- @@ -63,7 +63,7 @@ case class InMemoryRelation( tableName: Option[String])( @transient var _cachedColumnBuffers: RDD[CachedBatch] = null, val batchStats: LongAccumulator = child.sqlContext.sparkContext.longAccumulator, -statsOfPlanToCache: Statistics = null) +statsOfPlanToCache: Statistics) --- End diff -- leave no default value is fine, we do not any default value actually --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20368#discussion_r163377179 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala --- @@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils { } } + test("broadcast hint is retained in a cached plan") { +Seq(true, false).foreach { materialized => + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { +val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") +val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") +broadcast(df2).cache() +if (materialized) df2.collect() +val df3 = df1.join(df2, Seq("key"), "inner") --- End diff -- That should not matter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20368#discussion_r163377142 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala --- @@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils { } } + test("broadcast hint is retained in a cached plan") { +Seq(true, false).foreach { materialized => + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { +val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") --- End diff -- That should not matter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20368#discussion_r163377070 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala --- @@ -77,7 +77,7 @@ case class InMemoryRelation( // Underlying columnar RDD hasn't been materialized, use the stats from the plan to cache statsOfPlanToCache } else { - Statistics(sizeInBytes = batchStats.value.longValue) + Statistics(sizeInBytes = batchStats.value.longValue, hints = statsOfPlanToCache.hints) --- End diff -- That is misleading. Conceptually, that is wrong. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/20368#discussion_r163376001 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala --- @@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils { } } + test("broadcast hint is retained in a cached plan") { +Seq(true, false).foreach { materialized => + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { +val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") +val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") +broadcast(df2).cache() +if (materialized) df2.collect() +val df3 = df1.join(df2, Seq("key"), "inner") --- End diff -- `val df3 = df1.join(df2, "key")`? `inner` is implied, isn't it? (I'm proposing the change as this and other tests could be easily used as a learning tool to master Spark SQL's API) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/20368#discussion_r163375534 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala --- @@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils { } } + test("broadcast hint is retained in a cached plan") { +Seq(true, false).foreach { materialized => + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { +val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") --- End diff -- Is `spark.createDataFrame(...)` wrapper really required? I thought `Seq((1, "4"), (2, "2")).toDF("key", "value")` would just work fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/20368#discussion_r163375216 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala --- @@ -77,7 +77,7 @@ case class InMemoryRelation( // Underlying columnar RDD hasn't been materialized, use the stats from the plan to cache statsOfPlanToCache } else { - Statistics(sizeInBytes = batchStats.value.longValue) + Statistics(sizeInBytes = batchStats.value.longValue, hints = statsOfPlanToCache.hints) --- End diff -- Why don't you simply `statsOfPlanToCache.copy(sizeInBytes = batchStats.value.longValue)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20368#discussion_r163373026 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala --- @@ -126,6 +126,22 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils { } } + test("broadcast hint is retained in a cached plan") { +Seq(true, false).foreach { materialized => + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { +val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") +val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") +broadcast(df2).cache() +if (materialized) df2.collect() --- End diff -- This PR https://github.com/apache/spark/pull/19864 accidentally fixes the issue when the plan is not materialized. However, it does not resolve the issue when the cached plan is materialized. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/20368 [SPARK-23195] [SQL] Keep the Hint of Cached Data ## What changes were proposed in this pull request? The broadcast hint of the cached plan is lost if we cache the plan. This PR is to correct it. ```Scala val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") broadcast(df2).cache() df2.collect() val df3 = df1.join(df2, Seq("key"), "inner") ``` ## How was this patch tested? Added a test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark cachedBroadcastHint Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20368.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20368 commit 21e5321d072c312e243407af08eeb9c1a796ab4d Author: gatorsmileDate: 2018-01-23T20:40:04Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org