[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

2016-05-31 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/13419
  
I will prefer refreshing the dataset every time a dataset is reloaded but 
keeping existing ones unchanged.

~~~scala
val df1 = sqlContext.read.parquet(dir).cache()
df1.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
val df2 = sqlContext.read.parquet(dir).count() // outputs 10
df2.count() // outputs 10
df1.count() // still outputs 1000 because it was cached
~~~

Neither approach is perfectly safe. So I don't have no strong preference on 
either.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

2016-05-31 Thread sameeragarwal
Github user sameeragarwal commented on the pull request:

https://github.com/apache/spark/pull/13419
  
@dongjoon-hyun no reason; old habits. I'll fix this. Thanks! :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

2016-05-31 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the pull request:

https://github.com/apache/spark/pull/13419
  
Hi, @sameeragarwal .
Is there any reason to use `SQLContext` instead of `SparkSession` in this 
PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

2016-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/13419
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59668/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

2016-05-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/13419
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

2016-05-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/13419
  
**[Test build #59668 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59668/consoleFull)**
 for PR 13419 at commit 
[`ee631d2`](https://github.com/apache/spark/commit/ee631d2d98f72d99da00d8922fc4cf6a66cf063c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

2016-05-31 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13419#discussion_r65251560
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -67,6 +67,28 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   TableIdentifier("tmp"), ignoreIfNotExists = true)
   }
 
+  test("drop cache on overwrite") {
+withTempDir { dir =>
+  val path = dir.toString
+  spark.range(1000).write.mode("overwrite").parquet(path)
+  val df = sqlContext.read.parquet(path).cache()
+  assert(df.count() == 1000)
+  sqlContext.range(10).write.mode("overwrite").parquet(path)
--- End diff --

sqlContext -> spark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

2016-05-31 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13419#discussion_r65251574
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -67,6 +67,28 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   TableIdentifier("tmp"), ignoreIfNotExists = true)
   }
 
+  test("drop cache on overwrite") {
+withTempDir { dir =>
+  val path = dir.toString
+  spark.range(1000).write.mode("overwrite").parquet(path)
+  val df = sqlContext.read.parquet(path).cache()
+  assert(df.count() == 1000)
+  sqlContext.range(10).write.mode("overwrite").parquet(path)
+  assert(sqlContext.read.parquet(path).count() == 10)
+}
+  }
+
+  test("drop cache on append") {
+withTempDir { dir =>
+  val path = dir.toString
+  spark.range(1000).write.mode("append").parquet(path)
+  val df = sqlContext.read.parquet(path).cache()
+  assert(df.count() == 1000)
+  sqlContext.range(10).write.mode("append").parquet(path)
--- End diff --

sqlContext -> spark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

2016-05-31 Thread sameeragarwal
Github user sameeragarwal commented on the pull request:

https://github.com/apache/spark/pull/13419
  
@yhuai @mengxr what are your thoughts on this approach?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

2016-05-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/13419
  
**[Test build #59668 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59668/consoleFull)**
 for PR 13419 at commit 
[`ee631d2`](https://github.com/apache/spark/commit/ee631d2d98f72d99da00d8922fc4cf6a66cf063c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15678][SQL] Drop cache on appends and overwrites

2016-05-31 Thread sameeragarwal
GitHub user sameeragarwal opened a pull request:

https://github.com/apache/spark/pull/13419

[SPARK-15678][SQL] Drop cache on appends and overwrites

## What changes were proposed in this pull request?

SparkSQL currently doesn't drop caches if the underlying data is 
overwritten. This PR fixes that behavior.

```scala
val dir = "/tmp/test"
sqlContext.range(1000).write.mode("overwrite").parquet(dir)
val df = sqlContext.read.parquet(dir).cache()
df.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
are still using the cached dataset
```

## How was this patch tested?

Unit tests for overwrites and appends in `ParquetQuerySuite`.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sameeragarwal/spark drop-cache-on-write

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13419


commit ee631d2d98f72d99da00d8922fc4cf6a66cf063c
Author: Sameer Agarwal 
Date:   2016-05-31T18:27:41Z

Drop cache on appends and overwrites




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org