Hi, Colleagues and I have found that the PageRank implementation bundled with Spark is incorrect in several ways. The code in question is in Apache Spark 1.2 distribution's "examples" directory, called "SparkPageRank.scala".
Consider the example graph presented in the colorful figure on the Wikipedia page for "PageRank"; below is an edge list representation, where vertex "A" is "1", "B" is "2", etc.: - - - - - begin 2 3 3 2 4 1 4 2 5 2 5 4 5 6 6 2 6 5 7 2 7 5 8 2 8 5 9 2 9 5 10 5 11 5 - - - - - end Here's the output we get from Spark's PageRank after 100 iterations: B has rank: 1.9184837009011475. C has rank: 1.7807113697064196. E has rank: 0.24301279014684984. A has rank: 0.24301279014684984. D has rank: 0.21885362387494078. F has rank: 0.21885362387494078. There are three problems with the output: 1. Only six of the eleven vertices are represented in the output; by definition, PageRank assigns a value to each vertex. 2. The values do not sum to 1.0; by definition, PageRank is a probability vector with one element per vertex and the sum of the elements of the vector must be 1.0. 3. Vertices E and A receive the same PageRank, whereas other means of computing PageRank, e.g., our own homebrew code and the method used by Wikipedia, assign different values to these vertices. Our own code has been compared against the PageRank implementation in the "NetworkX" package and it agrees. It looks like bug #1 is due to the Spark implementation of PageRank not emitting output for vertices with no incoming edges and bug #3 is due to the code not correctly handling vertices with no outgoing edges. Once #1 and #3 are fixed, normalization might be all that's required to fix #2 (maybe). We currently rely on the Spark PageRank for tests we're conducting; when do you think a fix might be ready? Thanks. -- Terence Kelly, HP Labs --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org