Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
Is it sufficient right now or should I do something more?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
Added comment explaining if.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
I removed counts at the end of outside loop. I added it before because
without it I still encountered problems, but I guess something else must have
been wrong. I reasoned that despite the fact
Github user wesolowskim commented on a diff in the pull request:
https://github.com/apache/spark/pull/14137#discussion_r70665554
--- Diff:
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
---
@@ -106,6 +116,16 @@ object
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
I agree it is possible however quite hard in this case. You loose reference
to workGraph few times before materialization occurs, and then few graphs get
materialized at the same time. Testing
Github user wesolowskim commented on a diff in the pull request:
https://github.com/apache/spark/pull/14137#discussion_r70643061
--- Diff:
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
---
@@ -44,6 +44,11 @@ object StronglyConnectedComponents
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
I don't know if I understand you correctly. Last work RDD is explicitly
materialized because iteration's depend on number of vertices left in graph.
My aim with latest solution wa
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
'work graph' is already cached and it is indeed necessary.
To summarize - do you propose to replace sccGraph.triplets.take(1) with
sccGraph.vertices.count() and sccGraph.e
Github user wesolowskim commented on a diff in the pull request:
https://github.com/apache/spark/pull/14137#discussion_r70637549
--- Diff:
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
---
@@ -64,11 +69,20 @@ object
Github user wesolowskim commented on a diff in the pull request:
https://github.com/apache/spark/pull/14137#discussion_r70638560
--- Diff:
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
---
@@ -44,6 +44,11 @@ object StronglyConnectedComponents
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
I added unpersists to additionally created caches and checked performance.
scc.run is slightly longer, but returned graph is cached (both vertices and
edges).
To fully optimize it and
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
Great we're on the same page finally. I can try to add unpersists however
it is indeed tricky and at first try I wasn't able to unpersist every RDD and
at the return had few RDD stil
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
That's how i see it:
First you have workGraph defined and marked to cache:
`var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false)
}.cache()`
th
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
Whole computation is iterative and depends on previous state of
sccWorkGraph. It iterates on sccWorkGraph.numVertices which is action under the
hood and without caching whole algorithm would be
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
Let me introduce some data first:
1. SCC run computed on randomly generated graph just like one provided by
me on databrics notebook takes about 120s
2. When doing
GitHub user wesolowskim opened a pull request:
https://github.com/apache/spark/pull/14137
SPARK-16478 graphX (added graph caching in strongly connected components)
## What changes were proposed in this pull request?
I added caching in every iteration for sccGraph that is
16 matches
Mail list logo