[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/14137 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70672178 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -106,6 +116,11 @@ object StronglyConnectedComponents { } }, (final1, final2) => final1 || final2) + +// ensure sccGraph's rdd are marked as recently used to not be evicted between iterations --- End diff -- OK, the rest of the changes much looks like an improvement, at the least. I am not as sure why this is necessary. It's already materialized at this point. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...
Github user wesolowskim commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70665554 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -106,6 +116,16 @@ object StronglyConnectedComponents { } }, (final1, final2) => final1 || final2) + +// ensure sccGraph's rdd are marked as recently used +sccGraph.vertices.count() +sccGraph.edges.count() +// sccGraph materialized so, unpersist can be done on previous +if(prevSccGraph != sccGraph) --- End diff -- I forgot to push one commit. Nor If statement neither cache() were necessary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70664836 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -106,6 +116,16 @@ object StronglyConnectedComponents { } }, (final1, final2) => final1 || final2) + +// ensure sccGraph's rdd are marked as recently used +sccGraph.vertices.count() +sccGraph.edges.count() +// sccGraph materialized so, unpersist can be done on previous +if(prevSccGraph != sccGraph) --- End diff -- I think this will fail style checks but let's see. Hm, I think it's straightforward to follow the same pattern for the other RDD, but yes it's updated in enough places that it's a fair bit of extra code. It might be possible to refactor this nicely in some method. Not essential, maybe. This code isn't used much anyway. A partial improvement may be a sufficient win. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...
Github user wesolowskim commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70643061 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -44,6 +44,11 @@ object StronglyConnectedComponents { // graph we are going to work with in our iterations var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) }.cache() +// helper variables to unpersist cached graphs +var prevSccGraph1 = sccGraph +var prevSccGraph2 = sccGraph --- End diff -- You're totally right. I'll correct it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...
Github user wesolowskim commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70637549 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -64,11 +69,20 @@ object StronglyConnectedComponents { // write values to sccGraph sccGraph = sccGraph.outerJoinVertices(finalVertices) { (vid, scc, opt) => opt.getOrElse(scc) -} +}.cache() +// take triplet to materialize vertices and edges (faster then count on both) +sccGraph.triplets.take(1) --- End diff -- I wasn't sure about this take to be honest. I checked both solutions - with take on triplets and with count on vertices and edges separately. After return from scc count on vertices and edges showed it was generally faster with take, but it could be due to small amount of partitions. Count on both vertices and edges with randomly generated graph of 500k vertices and 2mln vertices adds about 20s to scc, but may be worth it (didn't show during my tests but I guess you could be right). It could also be that one partition of triplet depends on whole verticesRDD and edgesRDD so, it would be sufficient to make take(1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70638869 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -44,6 +44,11 @@ object StronglyConnectedComponents { // graph we are going to work with in our iterations var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) }.cache() +// helper variables to unpersist cached graphs +var prevSccGraph1 = sccGraph +var prevSccGraph2 = sccGraph --- End diff -- I guess my logic is that it's easier to have one variable following this single variable throughout rather than two -- which of the two refers to which previous RDD? are they different? the effect of this as well is that you unpersist and the re-persist the RDD once during each loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...
Github user wesolowskim commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70638560 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -44,6 +44,11 @@ object StronglyConnectedComponents { // graph we are going to work with in our iterations var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) }.cache() +// helper variables to unpersist cached graphs +var prevSccGraph1 = sccGraph +var prevSccGraph2 = sccGraph --- End diff -- I guess I could avoid to have this two vars, but it was hard to reason about it code with one only. One sccGraph is persisted inside do-while loop, and the second at the end of outside while loop. With two variables it is clearer which graph is unpersisted. For example when do-while loop is left you have to unpersist prevSccGraph1, but not necessarily prevSccGraph2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70636044 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -44,6 +44,11 @@ object StronglyConnectedComponents { // graph we are going to work with in our iterations var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) }.cache() +// helper variables to unpersist cached graphs +var prevSccGraph1 = sccGraph +var prevSccGraph2 = sccGraph --- End diff -- Do you need two vars? seems like they both track the same variable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14137#discussion_r70635683 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala --- @@ -64,11 +69,20 @@ object StronglyConnectedComponents { // write values to sccGraph sccGraph = sccGraph.outerJoinVertices(finalVertices) { (vid, scc, opt) => opt.getOrElse(scc) -} +}.cache() +// take triplet to materialize vertices and edges (faster then count on both) +sccGraph.triplets.take(1) --- End diff -- It's faster because it only materializes a single partition. I think you have to call something like count. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...
GitHub user wesolowskim opened a pull request: https://github.com/apache/spark/pull/14137 SPARK-16478 graphX (added graph caching in strongly connected components) ## What changes were proposed in this pull request? I added caching in every iteration for sccGraph that is returned in strongly connected components. Without this cache strongly connected components returned graph that needed to be computed from scratch when some intermediary caches didn't existed anymore. ## How was this patch tested? I tested it by running code similar to the one [on databrics](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4889410027417133/3634650767364730/3117184429335832/latest.html). Basically I generated large graph and computed strongly connected components with changed code, than simply run count on vertices and edges. Count after this update takes few seconds instead 20 minutes. # statement contribution is my original work and I license the work to the project under the project's open source license. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wesolowskim/spark SPARK-16478 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14137.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14137 commit f9e28830755535926f7792375cedadb0518ecb31 Author: MichaÅ WesoÅowskiDate: 2016-07-11T11:20:26Z corrected SPARK-16478 (caching in strongly connected components) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org