Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14137
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62464/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14137
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14137
**[Test build #62464 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62464/consoleFull)**
for PR 14137 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/14137
**[Test build #62464 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62464/consoleFull)**
for PR 14137 at commit
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14137
LGTM, will leave open for a bit for comments
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14137
Jenkins retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
Is it sufficient right now or should I do something more?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
Added comment explaining if.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14137
I don't think you can reason about whether something's evicted as it's up
to the runtime. Here the RDD has to be materialized before the method returns
because its predecessors will have been
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
I removed counts at the end of outside loop. I added it before because
without it I still encountered problems, but I guess something else must have
been wrong. I reasoned that despite the fact
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
I agree it is possible however quite hard in this case. You loose reference
to workGraph few times before materialization occurs, and then few graphs get
materialized at the same time. Testing
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14137
Is it now not necessary to cache the 'work graph' as well?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14137
I was thinking it should be perfectly possible to unpersist scWorkGraph
RDDs that are persisted along the way in the same way? it should be the same
pattern. It's possible to leave the final result
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
I don't know if I understand you correctly. Last work RDD is explicitly
materialized because iteration's depend on number of vertices left in graph.
My aim with latest solution was to leave
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14137
Ack right I mean does it need to be explicitly materialized? I suppose the
new code also has the effect of materializing the last work RDD, but then I
imagine we need to unpersist it too.
---
If
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
'work graph' is already cached and it is indeed necessary.
To summarize - do you propose to replace sccGraph.triplets.take(1) with
sccGraph.vertices.count() and sccGraph.edges.count()?
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
I added unpersists to additionally created caches and checked performance.
scc.run is slightly longer, but returned graph is cached (both vertices and
edges).
To fully optimize it and
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14137
Yes, elsewhere in graphx I remember trying this and it was tricky because
you had to hold a reference to the old RDD, evaluate and persist the next one
that depends on it, and _then_ unpersist the
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
Great we're on the same page finally. I can try to add unpersists however
it is indeed tricky and at first try I wasn't able to unpersist every RDD and
at the return had few RDD still in cache.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14137
Oh good point, it does get materialized after `cache()` already because
`numVertices` will call `count`. That does mean there's more than one call to
evaluate the RDD, and that quite changes things.
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
That's how i see it:
First you have workGraph defined and marked to cache:
`var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false)
}.cache()`
then it is
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14137
I don't think that's true. Each iteration makes a new RDD that depends on
the previous iteration's RDD. The execution is actually just a long chain of
RDDs if you 'unrolled' it. That's why I think
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
Whole computation is iterative and depends on previous state of
sccWorkGraph. It iterates on sccWorkGraph.numVertices which is action under the
hood and without caching whole algorithm would be
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14137
You are right, the code already had a `.cache()` call without
`.unpersist()`. It eventually gets cleaned up, but not good form and probably
didn't actually do much either. Except perhaps break the
Github user wesolowskim commented on the issue:
https://github.com/apache/spark/pull/14137
Let me introduce some data first:
1. SCC run computed on randomly generated graph just like one provided by
me on databrics notebook takes about 120s
2. When doing
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/14137
You have introduced a new problem though -- you're not unpersisting the
RDDs you cache, and, you're doing a needless count (minor). Of course it's
faster to operate on the final RDD at the end: you
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/14137
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
27 matches
Mail list logo