[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-05 Thread markhamstra
Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/14039
  
I haven't got anything more concrete to offer at this time than the 
descriptions in the relevant JIRA's, but I do have this running in production 
with 1.6, and it does work.  Essentially, you build a cache in your application 
whose keys are a canonicalization of query fragments and whose values are RDDs 
associated with that fragment of the logical plan, and which produce the 
shuffle files.  For as long as you hold the references to those RDDs in your 
cache, Spark won't remove the shuffle files.  For as long as you have 
sufficient memory available to the OS, those shuffle files will be accessed via 
the OS buffer cache, which is actually pretty quick and doesn't require any of 
Java heap management and garbage collection.  That was the original motivation 
behind using shuffle files in this way and before off-heap caching and unified 
memory management were available.  It's less necessary now (at least once I 
figure out how to do the mapping between logical plan fragments and tables c
 ached off-heap), but it is still a valid alternative caching mechanism.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14039
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14039
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61738/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14039
  
**[Test build #61738 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61738/consoleFull)**
 for PR 14039 at commit 
[`55c8e03`](https://github.com/apache/spark/commit/55c8e034f9a4e231d49c79a77631da58e6130afd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/14039
  
@markhamstra Thanks for the comment. I think the reuse of fragments highly 
depends on user's queries, catalyst optimizer, cluster resources... Reusing 
`ShuffledRowRDD` shuffle data in a single job is a good idea though, it seems 
difficult to stay the data in multiple jobs because spark cannot know when the 
data should be garbaged-collected and it possibly eats much disk space. I think 
caching mechanism is a better idea to reuse fragments in multiple jobs. Or,  do 
u have any smart/concrete idea to reuse the shuffle data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/14039
  
@srowen My understanding is that shuffle data in stages are possibly shared 
in a job. However, once the job is finished, the current implementation cannot 
reuse the shuffle data anymore. So, we can safely remove them. Is this 
incorrect? Spark can reuse them between different jobs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14039
  
**[Test build #61738 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61738/consoleFull)**
 for PR 14039 at commit 
[`55c8e03`](https://github.com/apache/spark/commit/55c8e034f9a4e231d49c79a77631da58e6130afd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14039
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61717/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14039
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14039
  
**[Test build #61717 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61717/consoleFull)**
 for PR 14039 at commit 
[`daa859a`](https://github.com/apache/spark/commit/daa859aaa47d1fba502c8621751d7e49fe55c9fe).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14039
  
**[Test build #61717 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61717/consoleFull)**
 for PR 14039 at commit 
[`daa859a`](https://github.com/apache/spark/commit/daa859aaa47d1fba502c8621751d7e49fe55c9fe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14039
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61715/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14039
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14039
  
**[Test build #61715 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61715/consoleFull)**
 for PR 14039 at commit 
[`891a100`](https://github.com/apache/spark/commit/891a1007a9bf8afdc9b1945ff597ccc458123ed7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14039
  
**[Test build #61715 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61715/consoleFull)**
 for PR 14039 at commit 
[`891a100`](https://github.com/apache/spark/commit/891a1007a9bf8afdc9b1945ff597ccc458123ed7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread markhamstra
Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/14039
  
Actually, they can be reused -- not in Spark as distributed, but it is an 
open question whether reusing shuffle files within Spark SQL is something that 
we should be doing and want to support.  It can be an effective alternative 
means of caching.  https://issues.apache.org/jira/browse/SPARK-13756

Until that issue is definitively decided, we should not pre-empt the 
possibility with this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/14039
  
@srowen thanks for the comment. Yea, I noticed that and I'm fixing this to 
remove only shuffle files generated by `ShuffleExchange`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/14039
  
I don't think we do this in general. The shuffle files are supposed to 
remain to potentially be reused if the stage needs to be re-executed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14039
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14039
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61702/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14039
  
**[Test build #61702 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61702/consoleFull)**
 for PR 14039 at commit 
[`4e56d5b`](https://github.com/apache/spark/commit/4e56d5bb596954349093de3702420e51194ffa42).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14039: [SPARK-15896][SQL] Clean up shuffle files just after job...

2016-07-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14039
  
**[Test build #61702 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61702/consoleFull)**
 for PR 14039 at commit 
[`4e56d5b`](https://github.com/apache/spark/commit/4e56d5bb596954349093de3702420e51194ffa42).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org