[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16483


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-17 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106746316
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
---
@@ -322,13 +335,12 @@ object PageRank extends Logging {
 def personalizedVertexProgram(id: VertexId, attr: (Double, Double),
   msgSum: Double): (Double, Double) = {
   val (oldPR, lastDelta) = attr
-  var teleport = oldPR
-  val delta = if (src==id) resetProb else 0.0
-  teleport = oldPR*delta
-
-  val newPR = teleport + (1.0 - resetProb) * msgSum
-  val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else 
newPR - oldPR
-  (newPR, newDelta)
+  val newPR = if (lastDelta == Double.NegativeInfinity) {
--- End diff --

I agree that the new code is easier to follow in that respect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106548090
  
--- Diff: 
graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala ---
@@ -68,26 +69,34 @@ class PageRankSuite extends SparkFunSuite with 
LocalSparkContext {
   val nVertices = 100
   val starGraph = GraphGenerators.starGraph(sc, nVertices).cache()
   val resetProb = 0.15
+  val tol = 0.0001
+  val numIter = 2
   val errorTol = 1.0e-5
 
-  val staticRanks1 = starGraph.staticPageRank(numIter = 2, 
resetProb).vertices
-  val staticRanks2 = starGraph.staticPageRank(numIter = 3, 
resetProb).vertices.cache()
+  val staticRanks = starGraph.staticPageRank(numIter, 
resetProb).vertices.cache()
+  val staticRanks2 = starGraph.staticPageRank(numIter + 1, 
resetProb).vertices
 
-  // Static PageRank should only take 3 iterations to converge
-  val notMatching = staticRanks1.innerZipJoin(staticRanks2) { (vid, 
pr1, pr2) =>
+  // Static PageRank should only take 2 iterations to converge
--- End diff --

It didn't change, were still comparing the output of the 2nd and 3rd 
iteration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread aray
Github user aray commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106546448
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
---
@@ -322,13 +335,12 @@ object PageRank extends Logging {
 def personalizedVertexProgram(id: VertexId, attr: (Double, Double),
   msgSum: Double): (Double, Double) = {
   val (oldPR, lastDelta) = attr
-  var teleport = oldPR
-  val delta = if (src==id) resetProb else 0.0
-  teleport = oldPR*delta
-
-  val newPR = teleport + (1.0 - resetProb) * msgSum
-  val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else 
newPR - oldPR
-  (newPR, newDelta)
+  val newPR = if (lastDelta == Double.NegativeInfinity) {
--- End diff --

I'm guessing you mean the `if (src==id)` check? I'm honestly not sure what 
was going on with this code its just wrong. The results do not match up with 
igraph/networkx at all. Furthermore the code is just nonsensical -- definition 
of `var teleport = oldPR` that is then unconditionally set two lines down to  
`teleport = oldPR*delta` without being used prior.

This revised implementation is much easier to follow and is now tested 
against 3 sets of reference values computed by igraph/networkx. Please let me 
know if you thing I'm missing something.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106529377
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
---
@@ -353,9 +365,19 @@ object PageRank extends Logging {
 vertexProgram(id, attr, msgSum)
 }
 
-Pregel(pagerankGraph, initialMessage, activeDirection = 
EdgeDirection.Out)(
+val rankGraph = Pregel(pagerankGraph, initialMessage, activeDirection 
= EdgeDirection.Out)(
   vp, sendMessage, messageCombiner)
   .mapVertices((vid, attr) => attr._1)
-  } // end of deltaPageRank
+
+// If the graph has sinks (vertices with no outgoing edges) the sum of 
ranks will not be correct
--- End diff --

This is the same code as above, please factor it into a function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106532078
  
--- Diff: 
graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala ---
@@ -68,26 +69,34 @@ class PageRankSuite extends SparkFunSuite with 
LocalSparkContext {
   val nVertices = 100
   val starGraph = GraphGenerators.starGraph(sc, nVertices).cache()
   val resetProb = 0.15
+  val tol = 0.0001
+  val numIter = 2
   val errorTol = 1.0e-5
 
-  val staticRanks1 = starGraph.staticPageRank(numIter = 2, 
resetProb).vertices
-  val staticRanks2 = starGraph.staticPageRank(numIter = 3, 
resetProb).vertices.cache()
+  val staticRanks = starGraph.staticPageRank(numIter, 
resetProb).vertices.cache()
+  val staticRanks2 = starGraph.staticPageRank(numIter + 1, 
resetProb).vertices
 
-  // Static PageRank should only take 3 iterations to converge
-  val notMatching = staticRanks1.innerZipJoin(staticRanks2) { (vid, 
pr1, pr2) =>
+  // Static PageRank should only take 2 iterations to converge
--- End diff --

Why does it take only two iterations to converge now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106535595
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
---
@@ -322,13 +335,12 @@ object PageRank extends Logging {
 def personalizedVertexProgram(id: VertexId, attr: (Double, Double),
   msgSum: Double): (Double, Double) = {
   val (oldPR, lastDelta) = attr
-  var teleport = oldPR
-  val delta = if (src==id) resetProb else 0.0
-  teleport = oldPR*delta
-
-  val newPR = teleport + (1.0 - resetProb) * msgSum
-  val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else 
newPR - oldPR
-  (newPR, newDelta)
+  val newPR = if (lastDelta == Double.NegativeInfinity) {
--- End diff --

My memory of the algorithm is a bit rusty. Why don't you need to check for 
self-loops here anymore?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/16483#discussion_r106528007
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala 
---
@@ -162,7 +162,15 @@ object PageRank extends Logging {
   iteration += 1
 }
 
-rankGraph
+// If the graph has sinks (vertices with no outgoing edges) the sum of 
ranks will not be correct
--- End diff --

put the name of the ticket as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-01-05 Thread aray
GitHub user aray opened a pull request:

https://github.com/apache/spark/pull/16483

[SPARK-18847][GraphX] PageRank gives incorrect results for graphs with sinks

## What changes were proposed in this pull request?

Graphs with sinks (vertices with no outgoing edges) don't have the expected 
rank sum of n (or 1 for personalized). We fix this by normalizing to the 
expected sum at the end of each implementation.

Additionally this fixes the dynamic version of personal pagerank which gave 
incorrect answers that were not detected by existing unit tests.

## How was this patch tested?

Revamped existing and additional unit tests with reference values (and 
reproduction code) from igraph and NetworkX.

Note that for comparison on personal pagerank we use the arpack algorithm 
in igraph as prpack (the  current default) redistributes rank to all vertices 
uniformly instead of just to the personalization source. We could take the 
alternate convention (redistribute rank to all vertices uniformly) but that 
would involve more extensive changes to the algorithms (the dynamic version 
would no longer be able to use Pregel).


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aray/spark pagerank-sink2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16483.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16483


commit 41178a3f7310eb30b2eebbac1fac532196ad3432
Author: Andrew Ray 
Date:   2017-01-06T06:24:21Z

page rank sink fixes and unit tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org