[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/16483 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106746316 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -322,13 +335,12 @@ object PageRank extends Logging { def personalizedVertexProgram(id: VertexId, attr: (Double, Double), msgSum: Double): (Double, Double) = { val (oldPR, lastDelta) = attr - var teleport = oldPR - val delta = if (src==id) resetProb else 0.0 - teleport = oldPR*delta - - val newPR = teleport + (1.0 - resetProb) * msgSum - val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else newPR - oldPR - (newPR, newDelta) + val newPR = if (lastDelta == Double.NegativeInfinity) { --- End diff -- I agree that the new code is easier to follow in that respect. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106548090 --- Diff: graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala --- @@ -68,26 +69,34 @@ class PageRankSuite extends SparkFunSuite with LocalSparkContext { val nVertices = 100 val starGraph = GraphGenerators.starGraph(sc, nVertices).cache() val resetProb = 0.15 + val tol = 0.0001 + val numIter = 2 val errorTol = 1.0e-5 - val staticRanks1 = starGraph.staticPageRank(numIter = 2, resetProb).vertices - val staticRanks2 = starGraph.staticPageRank(numIter = 3, resetProb).vertices.cache() + val staticRanks = starGraph.staticPageRank(numIter, resetProb).vertices.cache() + val staticRanks2 = starGraph.staticPageRank(numIter + 1, resetProb).vertices - // Static PageRank should only take 3 iterations to converge - val notMatching = staticRanks1.innerZipJoin(staticRanks2) { (vid, pr1, pr2) => + // Static PageRank should only take 2 iterations to converge --- End diff -- It didn't change, were still comparing the output of the 2nd and 3rd iteration. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106546448 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -322,13 +335,12 @@ object PageRank extends Logging { def personalizedVertexProgram(id: VertexId, attr: (Double, Double), msgSum: Double): (Double, Double) = { val (oldPR, lastDelta) = attr - var teleport = oldPR - val delta = if (src==id) resetProb else 0.0 - teleport = oldPR*delta - - val newPR = teleport + (1.0 - resetProb) * msgSum - val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else newPR - oldPR - (newPR, newDelta) + val newPR = if (lastDelta == Double.NegativeInfinity) { --- End diff -- I'm guessing you mean the `if (src==id)` check? I'm honestly not sure what was going on with this code its just wrong. The results do not match up with igraph/networkx at all. Furthermore the code is just nonsensical -- definition of `var teleport = oldPR` that is then unconditionally set two lines down to `teleport = oldPR*delta` without being used prior. This revised implementation is much easier to follow and is now tested against 3 sets of reference values computed by igraph/networkx. Please let me know if you thing I'm missing something. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106529377 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -353,9 +365,19 @@ object PageRank extends Logging { vertexProgram(id, attr, msgSum) } -Pregel(pagerankGraph, initialMessage, activeDirection = EdgeDirection.Out)( +val rankGraph = Pregel(pagerankGraph, initialMessage, activeDirection = EdgeDirection.Out)( vp, sendMessage, messageCombiner) .mapVertices((vid, attr) => attr._1) - } // end of deltaPageRank + +// If the graph has sinks (vertices with no outgoing edges) the sum of ranks will not be correct --- End diff -- This is the same code as above, please factor it into a function. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106532078 --- Diff: graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala --- @@ -68,26 +69,34 @@ class PageRankSuite extends SparkFunSuite with LocalSparkContext { val nVertices = 100 val starGraph = GraphGenerators.starGraph(sc, nVertices).cache() val resetProb = 0.15 + val tol = 0.0001 + val numIter = 2 val errorTol = 1.0e-5 - val staticRanks1 = starGraph.staticPageRank(numIter = 2, resetProb).vertices - val staticRanks2 = starGraph.staticPageRank(numIter = 3, resetProb).vertices.cache() + val staticRanks = starGraph.staticPageRank(numIter, resetProb).vertices.cache() + val staticRanks2 = starGraph.staticPageRank(numIter + 1, resetProb).vertices - // Static PageRank should only take 3 iterations to converge - val notMatching = staticRanks1.innerZipJoin(staticRanks2) { (vid, pr1, pr2) => + // Static PageRank should only take 2 iterations to converge --- End diff -- Why does it take only two iterations to converge now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106535595 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -322,13 +335,12 @@ object PageRank extends Logging { def personalizedVertexProgram(id: VertexId, attr: (Double, Double), msgSum: Double): (Double, Double) = { val (oldPR, lastDelta) = attr - var teleport = oldPR - val delta = if (src==id) resetProb else 0.0 - teleport = oldPR*delta - - val newPR = teleport + (1.0 - resetProb) * msgSum - val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else newPR - oldPR - (newPR, newDelta) + val newPR = if (lastDelta == Double.NegativeInfinity) { --- End diff -- My memory of the algorithm is a bit rusty. Why don't you need to check for self-loops here anymore? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106528007 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -162,7 +162,15 @@ object PageRank extends Logging { iteration += 1 } -rankGraph +// If the graph has sinks (vertices with no outgoing edges) the sum of ranks will not be correct --- End diff -- put the name of the ticket as well --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16483 [SPARK-18847][GraphX] PageRank gives incorrect results for graphs with sinks ## What changes were proposed in this pull request? Graphs with sinks (vertices with no outgoing edges) don't have the expected rank sum of n (or 1 for personalized). We fix this by normalizing to the expected sum at the end of each implementation. Additionally this fixes the dynamic version of personal pagerank which gave incorrect answers that were not detected by existing unit tests. ## How was this patch tested? Revamped existing and additional unit tests with reference values (and reproduction code) from igraph and NetworkX. Note that for comparison on personal pagerank we use the arpack algorithm in igraph as prpack (the current default) redistributes rank to all vertices uniformly instead of just to the personalization source. We could take the alternate convention (redistribute rank to all vertices uniformly) but that would involve more extensive changes to the algorithms (the dynamic version would no longer be able to use Pregel). You can merge this pull request into a Git repository by running: $ git pull https://github.com/aray/spark pagerank-sink2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16483.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16483 commit 41178a3f7310eb30b2eebbac1fac532196ad3432 Author: Andrew RayDate: 2017-01-06T06:24:21Z page rank sink fixes and unit tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org