Hi,

Colleagues and I have found that the PageRank implementation bundled
with Spark is incorrect in several ways.  The code in question is in
Apache Spark 1.2 distribution's "examples" directory, called
"SparkPageRank.scala".

Consider the example graph presented in the colorful figure on the
Wikipedia page for "PageRank"; below is an edge list representation,
where vertex "A" is "1", "B" is "2", etc.:

- - - - - begin
2 3
3 2
4 1
4 2
5 2
5 4
5 6
6 2
6 5
7 2
7 5
8 2
8 5
9 2
9 5
10 5
11 5
- - - - - end

Here's the output we get from Spark's PageRank after 100 iterations:

B has rank: 1.9184837009011475.
C has rank: 1.7807113697064196.
E has rank: 0.24301279014684984.
A has rank: 0.24301279014684984.
D has rank: 0.21885362387494078.
F has rank: 0.21885362387494078.

There are three problems with the output:

1. Only six of the eleven vertices are represented in the output;
   by definition, PageRank assigns a value to each vertex.

2. The values do not sum to 1.0; by definition, PageRank is a
   probability vector with one element per vertex and the sum of
   the elements of the vector must be 1.0.

3. Vertices E and A receive the same PageRank, whereas other means of
   computing PageRank, e.g., our own homebrew code and the method
   used by Wikipedia, assign different values to these vertices.  Our
   own code has been compared against the PageRank implementation in
   the "NetworkX" package and it agrees.

It looks like bug #1 is due to the Spark implementation of PageRank
not emitting output for vertices with no incoming edges and bug #3 is
due to the code not correctly handling vertices with no outgoing
edges.  Once #1 and #3 are fixed, normalization might be all that's
required to fix #2 (maybe).

We currently rely on the Spark PageRank for tests we're
conducting; when do you think a fix might be ready?

Thanks.

-- Terence Kelly, HP Labs


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to