[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/13419 I will prefer refreshing the dataset every time a dataset is reloaded but keeping existing ones unchanged. ~~~scala val df1 = sqlContext.read.parquet(dir).cache() df1.count() // outputs 1000 sqlContext.range(10).write.mode("overwrite").parquet(dir) val df2 = sqlContext.read.parquet(dir).count() // outputs 10 df2.count() // outputs 10 df1.count() // still outputs 1000 because it was cached ~~~ Neither approach is perfectly safe. So I don't have no strong preference on either. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites
Github user sameeragarwal commented on the pull request: https://github.com/apache/spark/pull/13419 @dongjoon-hyun no reason; old habits. I'll fix this. Thanks! :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites
Github user dongjoon-hyun commented on the pull request: https://github.com/apache/spark/pull/13419 Hi, @sameeragarwal . Is there any reason to use `SQLContext` instead of `SparkSession` in this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/13419 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59668/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/13419 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13419 **[Test build #59668 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59668/consoleFull)** for PR 13419 at commit [`ee631d2`](https://github.com/apache/spark/commit/ee631d2d98f72d99da00d8922fc4cf6a66cf063c). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/13419#discussion_r65251560 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -67,6 +67,28 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext TableIdentifier("tmp"), ignoreIfNotExists = true) } + test("drop cache on overwrite") { +withTempDir { dir => + val path = dir.toString + spark.range(1000).write.mode("overwrite").parquet(path) + val df = sqlContext.read.parquet(path).cache() + assert(df.count() == 1000) + sqlContext.range(10).write.mode("overwrite").parquet(path) --- End diff -- sqlContext -> spark --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/13419#discussion_r65251574 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -67,6 +67,28 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext TableIdentifier("tmp"), ignoreIfNotExists = true) } + test("drop cache on overwrite") { +withTempDir { dir => + val path = dir.toString + spark.range(1000).write.mode("overwrite").parquet(path) + val df = sqlContext.read.parquet(path).cache() + assert(df.count() == 1000) + sqlContext.range(10).write.mode("overwrite").parquet(path) + assert(sqlContext.read.parquet(path).count() == 10) +} + } + + test("drop cache on append") { +withTempDir { dir => + val path = dir.toString + spark.range(1000).write.mode("append").parquet(path) + val df = sqlContext.read.parquet(path).cache() + assert(df.count() == 1000) + sqlContext.range(10).write.mode("append").parquet(path) --- End diff -- sqlContext -> spark --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites
Github user sameeragarwal commented on the pull request: https://github.com/apache/spark/pull/13419 @yhuai @mengxr what are your thoughts on this approach? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/13419 **[Test build #59668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59668/consoleFull)** for PR 13419 at commit [`ee631d2`](https://github.com/apache/spark/commit/ee631d2d98f72d99da00d8922fc4cf6a66cf063c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites
GitHub user sameeragarwal opened a pull request: https://github.com/apache/spark/pull/13419 [SPARK-15678][SQL] Drop cache on appends and overwrites ## What changes were proposed in this pull request? SparkSQL currently doesn't drop caches if the underlying data is overwritten. This PR fixes that behavior. ```scala val dir = "/tmp/test" sqlContext.range(1000).write.mode("overwrite").parquet(dir) val df = sqlContext.read.parquet(dir).cache() df.count() // outputs 1000 sqlContext.range(10).write.mode("overwrite").parquet(dir) sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We are still using the cached dataset ``` ## How was this patch tested? Unit tests for overwrites and appends in `ParquetQuerySuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sameeragarwal/spark drop-cache-on-write Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13419.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13419 commit ee631d2d98f72d99da00d8922fc4cf6a66cf063c Author: Sameer Agarwal Date: 2016-05-31T18:27:41Z Drop cache on appends and overwrites --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org