[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14137 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62464/ Test PASSed. ---

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14137 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14137 **[Test build #62464 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62464/consoleFull)** for PR 14137 at commit

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-18 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/14137 **[Test build #62464 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62464/consoleFull)** for PR 14137 at commit

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-18 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14137 LGTM, will leave open for a bit for comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-18 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14137 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-18 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 Is it sufficient right now or should I do something more? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-15 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 Added comment explaining if. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-15 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14137 I don't think you can reason about whether something's evicted as it's up to the runtime. Here the RDD has to be materialized before the method returns because its predecessors will have been

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-15 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 I removed counts at the end of outside loop. I added it before because without it I still encountered problems, but I guess something else must have been wrong. I reasoned that despite the fact

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 I agree it is possible however quite hard in this case. You loose reference to workGraph few times before materialization occurs, and then few graphs get materialized at the same time. Testing

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-13 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14137 Is it now not necessary to cache the 'work graph' as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-13 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14137 I was thinking it should be perfectly possible to unpersist scWorkGraph RDDs that are persisted along the way in the same way? it should be the same pattern. It's possible to leave the final result

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 I don't know if I understand you correctly. Last work RDD is explicitly materialized because iteration's depend on number of vertices left in graph. My aim with latest solution was to leave

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-13 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14137 Ack right I mean does it need to be explicitly materialized? I suppose the new code also has the effect of materializing the last work RDD, but then I imagine we need to unpersist it too. --- If

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 'work graph' is already cached and it is indeed necessary. To summarize - do you propose to replace sccGraph.triplets.take(1) with sccGraph.vertices.count() and sccGraph.edges.count()?

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 I added unpersists to additionally created caches and checked performance. scc.run is slightly longer, but returned graph is cached (both vertices and edges). To fully optimize it and

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-12 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14137 Yes, elsewhere in graphx I remember trying this and it was tricky because you had to hold a reference to the old RDD, evaluate and persist the next one that depends on it, and _then_ unpersist the

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-12 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 Great we're on the same page finally. I can try to add unpersists however it is indeed tricky and at first try I wasn't able to unpersist every RDD and at the return had few RDD still in cache.

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-12 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14137 Oh good point, it does get materialized after `cache()` already because `numVertices` will call `count`. That does mean there's more than one call to evaluate the RDD, and that quite changes things.

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-11 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 That's how i see it: First you have workGraph defined and marked to cache: `var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) }.cache()` then it is

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-11 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14137 I don't think that's true. Each iteration makes a new RDD that depends on the previous iteration's RDD. The execution is actually just a long chain of RDDs if you 'unrolled' it. That's why I think

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-11 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 Whole computation is iterative and depends on previous state of sccWorkGraph. It iterates on sccWorkGraph.numVertices which is action under the hood and without caching whole algorithm would be

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-11 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14137 You are right, the code already had a `.cache()` call without `.unpersist()`. It eventually gets cleaned up, but not good form and probably didn't actually do much either. Except perhaps break the

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-11 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 Let me introduce some data first: 1. SCC run computed on randomly generated graph just like one provided by me on databrics notebook takes about 120s 2. When doing

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-11 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/14137 You have introduced a new problem though -- you're not unpersisting the RDDs you cache, and, you're doing a needless count (minor). Of course it's faster to operate on the final RDD at the end: you

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/14137 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this