[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user gatorsmile closed the pull request at: https://github.com/apache/spark/pull/20384 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20384#discussion_r163755332 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala --- @@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils { } test("broadcast hint is retained after using the cached data") { -withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { - val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") - val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") - df2.cache() - val df3 = df1.join(broadcast(df2), Seq("key"), "inner") - val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { -case b: BroadcastHashJoinExec => b - }.size - assert(numBroadCastHashJoin === 1) +try { + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { +val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") +val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") +df2.cache() +val df3 = df1.join(broadcast(df2), Seq("key"), "inner") +val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { + case b: BroadcastHashJoinExec => b +}.size +assert(numBroadCastHashJoin === 1) + } +} finally { + spark.catalog.clearCache() --- End diff -- Yeah. That should be a separate bug. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20384#discussion_r163750722 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala --- @@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils { } test("broadcast hint is retained after using the cached data") { -withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { - val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") - val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") - df2.cache() - val df3 = df1.join(broadcast(df2), Seq("key"), "inner") - val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { -case b: BroadcastHashJoinExec => b - }.size - assert(numBroadCastHashJoin === 1) +try { + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { +val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") +val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") +df2.cache() +val df3 = df1.join(broadcast(df2), Seq("key"), "inner") +val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { + case b: BroadcastHashJoinExec => b +}.size +assert(numBroadCastHashJoin === 1) + } +} finally { + spark.catalog.clearCache() --- End diff -- do you mean we can remove the cache cleaning here after you fix that bug? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20384#discussion_r163602179 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala --- @@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils { } test("broadcast hint is retained after using the cached data") { -withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { - val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") - val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") - df2.cache() - val df3 = df1.join(broadcast(df2), Seq("key"), "inner") - val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { -case b: BroadcastHashJoinExec => b - }.size - assert(numBroadCastHashJoin === 1) +try { + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { +val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") +val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") +df2.cache() +val df3 = df1.join(broadcast(df2), Seq("key"), "inner") +val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { + case b: BroadcastHashJoinExec => b +}.size +assert(numBroadCastHashJoin === 1) + } +} finally { + spark.catalog.clearCache() --- End diff -- This test suite is not for testing cache. Thus, it is fine to do it for these two test cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20384#discussion_r163586713 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala --- @@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils { } test("broadcast hint is retained after using the cached data") { -withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { - val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") - val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") - df2.cache() - val df3 = df1.join(broadcast(df2), Seq("key"), "inner") - val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { -case b: BroadcastHashJoinExec => b - }.size - assert(numBroadCastHashJoin === 1) +try { + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { +val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") +val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") +df2.cache() +val df3 = df1.join(broadcast(df2), Seq("key"), "inner") +val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { + case b: BroadcastHashJoinExec => b +}.size +assert(numBroadCastHashJoin === 1) + } +} finally { + spark.catalog.clearCache() --- End diff -- If we have to clear the cache, can we add `clearCache` into `afterEach` in general instead of adding this case-by-case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20384#discussion_r163581427 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/joins/BroadcastJoinSuite.scala --- @@ -110,15 +110,39 @@ class BroadcastJoinSuite extends QueryTest with SQLTestUtils { } test("broadcast hint is retained after using the cached data") { -withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { - val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") - val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") - df2.cache() - val df3 = df1.join(broadcast(df2), Seq("key"), "inner") - val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { -case b: BroadcastHashJoinExec => b - }.size - assert(numBroadCastHashJoin === 1) +try { + withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") { +val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") +val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") +df2.cache() +val df3 = df1.join(broadcast(df2), Seq("key"), "inner") +val numBroadCastHashJoin = df3.queryExecution.executedPlan.collect { + case b: BroadcastHashJoinExec => b +}.size +assert(numBroadCastHashJoin === 1) + } +} finally { + spark.catalog.clearCache() --- End diff -- We have to clear the cache. We have another bug to fix in the cache. Will submit another PR to fix the cache plan matching. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20384: [SPARK-23195] [SQL] Keep the Hint of Cached Data
GitHub user gatorsmile opened a pull request: https://github.com/apache/spark/pull/20384 [SPARK-23195] [SQL] Keep the Hint of Cached Data ## What changes were proposed in this pull request? The broadcast hint of the cached plan is lost if we cache the plan. This PR is to correct it. ```Scala val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") broadcast(df2).cache() df2.collect() val df3 = df1.join(df2, Seq("key"), "inner") ``` ## How was this patch tested? Added a test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gatorsmile/spark test33 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20384.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20384 commit 4ef18b76b13c35dbec0d9e8ecfa753c9b4a80d2c Author: gatorsmileDate: 2018-01-24T15:27:52Z fix commit 3bbec210bcfd759d36a6f7e809ed0f1e3d0a03d8 Author: gatorsmile Date: 2018-01-24T15:28:06Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org