Github user aray commented on the issue: https://github.com/apache/spark/pull/16271 **References** [Pagerank paper](http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf) > We need to make an initial assignment of the ranks. This assignment can be made by one of several strategies. If it is going to iterate until convergence, in general the initial values will not affect final values, just the rate of convergence. But we can speed up convergence by choosing a good initial assignment. Since they are more focused on updating values for one evolving graph (the internet) they dont really talk about starting from scratch. But this does emphisize that there is no change to answers, just rate of convergence. A more direct statement would be [Wikipedia](https://en.wikipedia.org/wiki/PageRank) > PageRank is initialized to the same value for all pages. In the original form of PageRank, the sum of PageRank over all pages was the total number of pages on the web at that time, so each page in this example would have an initial value of 1. Note that there are two variants of pagerank that differ by a constant multiple in outputs but are determined by the dampening factor, we use the version that sums to N (most other implementations use the other). More Wikipedia: >The difference between them is that the PageRank values in the first formula sum to one, while in the second formula each PageRank is multiplied by N and the sum becomes N. Essentialy starting with the correct sum is closer to the actual fixed point and thus gets you faster convergence. The [NetworkX implementation](https://github.com/networkx/networkx/blob/master/networkx/algorithms/link_analysis/pagerank_alg.py#L122) uses the variant that sums to 1 hence their initialization values are all 1/N. igraph is unfortunately not comparable as they use a [more complex linear solver approach](https://github.com/igraph/igraph/blob/master/src/prpack/prpack_solver.cpp) Additional credentials (if it matters): PhD Mathematics with dissertation in Graph Theory
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org