[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14137


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14137#discussion_r70672178
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
 ---
@@ -106,6 +116,11 @@ object StronglyConnectedComponents {
   }
 },
 (final1, final2) => final1 || final2)
+
+// ensure sccGraph's rdd are marked as recently used to not be 
evicted between iterations
--- End diff --

OK, the rest of the changes much looks like an improvement, at the least. I 
am not as sure why this is necessary. It's already materialized at this point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on a diff in the pull request:

https://github.com/apache/spark/pull/14137#discussion_r70665554
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
 ---
@@ -106,6 +116,16 @@ object StronglyConnectedComponents {
   }
 },
 (final1, final2) => final1 || final2)
+
+// ensure sccGraph's rdd are marked as recently used
+sccGraph.vertices.count()
+sccGraph.edges.count()
+// sccGraph materialized so, unpersist can be done on previous
+if(prevSccGraph != sccGraph)
--- End diff --

I forgot to push one commit. Nor If statement neither cache() were 
necessary. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14137#discussion_r70664836
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
 ---
@@ -106,6 +116,16 @@ object StronglyConnectedComponents {
   }
 },
 (final1, final2) => final1 || final2)
+
+// ensure sccGraph's rdd are marked as recently used
+sccGraph.vertices.count()
+sccGraph.edges.count()
+// sccGraph materialized so, unpersist can be done on previous
+if(prevSccGraph != sccGraph)
--- End diff --

I think this will fail style checks but let's see.
Hm, I think it's straightforward to follow the same pattern for the other 
RDD, but yes it's updated in enough places that it's a fair bit of extra code. 
It might be possible to refactor this nicely in some method.
Not essential, maybe. This code isn't used much anyway. A partial 
improvement may be a sufficient win.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on a diff in the pull request:

https://github.com/apache/spark/pull/14137#discussion_r70643061
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
 ---
@@ -44,6 +44,11 @@ object StronglyConnectedComponents {
 // graph we are going to work with in our iterations
 var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) 
}.cache()
 
+// helper variables to unpersist cached graphs
+var prevSccGraph1 = sccGraph
+var prevSccGraph2 = sccGraph
--- End diff --

You're totally right. I'll correct it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on a diff in the pull request:

https://github.com/apache/spark/pull/14137#discussion_r70637549
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
 ---
@@ -64,11 +69,20 @@ object StronglyConnectedComponents {
 // write values to sccGraph
 sccGraph = sccGraph.outerJoinVertices(finalVertices) {
   (vid, scc, opt) => opt.getOrElse(scc)
-}
+}.cache()
+// take triplet to materialize vertices and edges (faster then 
count on both)
+sccGraph.triplets.take(1)
--- End diff --

I wasn't sure about this take to be honest. I checked both solutions - with 
take on triplets and with count on vertices and edges separately. After return 
from scc count on vertices and edges showed it was generally faster with take, 
but it could be due to small amount of partitions. 
Count on both vertices and edges with randomly generated graph of 500k 
vertices and 2mln vertices adds about 20s to scc, but may be worth it (didn't 
show during my tests but I guess you could be right). It could also be that one 
partition of triplet depends on whole verticesRDD and edgesRDD so, it would be 
sufficient to make take(1). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14137#discussion_r70638869
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
 ---
@@ -44,6 +44,11 @@ object StronglyConnectedComponents {
 // graph we are going to work with in our iterations
 var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) 
}.cache()
 
+// helper variables to unpersist cached graphs
+var prevSccGraph1 = sccGraph
+var prevSccGraph2 = sccGraph
--- End diff --

I guess my logic is that it's easier to have one variable following this 
single variable throughout rather than two -- which of the two refers to which 
previous RDD? are they different? the effect of this as well is that you 
unpersist and the re-persist the RDD once during each loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread wesolowskim
Github user wesolowskim commented on a diff in the pull request:

https://github.com/apache/spark/pull/14137#discussion_r70638560
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
 ---
@@ -44,6 +44,11 @@ object StronglyConnectedComponents {
 // graph we are going to work with in our iterations
 var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) 
}.cache()
 
+// helper variables to unpersist cached graphs
+var prevSccGraph1 = sccGraph
+var prevSccGraph2 = sccGraph
--- End diff --

I guess I could avoid to have this two vars, but it was hard to reason 
about it code with one only. One sccGraph is persisted inside do-while loop, 
and the second at the end of outside while loop. With two variables it is 
clearer which graph is unpersisted. For example when do-while loop is left you 
have to unpersist prevSccGraph1, but not necessarily prevSccGraph2. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14137#discussion_r70636044
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
 ---
@@ -44,6 +44,11 @@ object StronglyConnectedComponents {
 // graph we are going to work with in our iterations
 var sccWorkGraph = graph.mapVertices { case (vid, _) => (vid, false) 
}.cache()
 
+// helper variables to unpersist cached graphs
+var prevSccGraph1 = sccGraph
+var prevSccGraph2 = sccGraph
--- End diff --

Do you need two vars? seems like they both track the same variable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-13 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14137#discussion_r70635683
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
 ---
@@ -64,11 +69,20 @@ object StronglyConnectedComponents {
 // write values to sccGraph
 sccGraph = sccGraph.outerJoinVertices(finalVertices) {
   (vid, scc, opt) => opt.getOrElse(scc)
-}
+}.cache()
+// take triplet to materialize vertices and edges (faster then 
count on both)
+sccGraph.triplets.take(1)
--- End diff --

It's faster because it only materializes a single partition. I think you 
have to call something like count.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14137: SPARK-16478 graphX (added graph caching in strong...

2016-07-11 Thread wesolowskim
GitHub user wesolowskim opened a pull request:

https://github.com/apache/spark/pull/14137

SPARK-16478 graphX (added graph caching in strongly connected components)

## What changes were proposed in this pull request?

I added caching in every iteration for sccGraph that is returned in 
strongly connected components. Without this cache strongly connected components 
returned graph that needed to be computed from scratch when some intermediary 
caches didn't existed anymore. 

## How was this patch tested?
I tested it by running code similar to the one  [on 
databrics](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4889410027417133/3634650767364730/3117184429335832/latest.html).
 Basically I generated large graph  and computed strongly connected components 
with changed code, than simply run count on vertices and edges. Count after 
this update takes few seconds instead 20 minutes. 

# statement
contribution is my original work and I license the work to the project 
under the project's open source license.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wesolowskim/spark SPARK-16478

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14137.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14137


commit f9e28830755535926f7792375cedadb0518ecb31
Author: Michał Wesołowski 
Date:   2016-07-11T11:20:26Z

corrected SPARK-16478 (caching in strongly connected components)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org