[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-18 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 Is it sufficient right now or should I do something more? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-15 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 Added comment explaining if. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-15 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 I removed counts at the end of outside loop. I added it before because without it I still encountered problems, but I guess something else must have been wrong. I reasoned that despite the fact

[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70665554 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -106,6 +116,16 @@ object

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 I agree it is possible however quite hard in this case. You loose reference to workGraph few times before materialization occurs, and then few graphs get materialized at the same time. Testing

[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70643061 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -44,6 +44,11 @@ object StronglyConnectedComponents

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 I don't know if I understand you correctly. Last work RDD is explicitly materialized because iteration's depend on number of vertices left in graph. My aim with latest solution wa

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 'work graph' is already cached and it is indeed necessary. To summarize - do you propose to replace sccGraph.triplets.take(1) with sccGraph.vertices.count() and sccGraph.e

[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70637549 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -64,11 +69,20 @@ object

[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70638560 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -44,6 +44,11 @@ object StronglyConnectedComponents

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 I added unpersists to additionally created caches and checked performance. scc.run is slightly longer, but returned graph is cached (both vertices and edges). To fully optimize it and

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-12 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 Great we're on the same page finally. I can try to add unpersists however it is indeed tricky and at first try I wasn't able to unpersist every RDD and at the return had few RDD stil

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-11 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 That's how i see it: First you have workGraph defined and marked to cache: `var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) }.cache()` th

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-11 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 Whole computation is iterative and depends on previous state of sccWorkGraph. It iterates on sccWorkGraph.numVertices which is action under the hood and without caching whole algorithm would be

[GitHub] spark issue #14137: SPARK-16478 graphX (added graph caching in strongly conn...

2016-07-11 Thread wesolowskim
Github user wesolowskim commented on the issue: https://github.com/apache/spark/pull/14137 Let me introduce some data first: 1. SCC run computed on randomly generated graph just like one provided by me on databrics notebook takes about 120s 2. When doing

[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-11 Thread wesolowskim
GitHub user wesolowskim opened a pull request: https://github.com/apache/spark/pull/14137 SPARK-16478 graphX (added graph caching in strongly connected components) ## What changes were proposed in this pull request? I added caching in every iteration for sccGraph that is