Re: [igraph] PageRank implementations give vastly different results

Tim Althoff Tue, 13 May 2014 19:38:17 -0700

Hi Tamas,

thanks for looking into it and your comprehensive response.


Is there any way to tell PRPACK what "epsilon" I want in the optimization.
I verified whether the actual values satisfy the PR equation. They are not
totally off but the difference between what they should be (given their
neighbors) and what it actually is is 10^-5 but the PR values themselves
are of the order 10^-5. So the relative errors seem pretty high.
Can I influence that precision? It's super fast anyway (100k nodes).

Thanks,
Tim


On Tue, May 13, 2014 at 3:22 PM, Tamás Nepusz <[email protected]> wrote:

> Hi Tim,
>
> Thanks for the graph; I'm not done with the investigations, but I already
> checked some basic things and I have a few comments to make here.
>
> First, don't use Graph.Read_Edgelist to read your file; use
> Graph.Read_NCOL instead. The problem is that igraph assumes that the
> edgelist format uses vertex IDs from zero to |V|-1, so you end up creating
> thousands of isolated vertices. If you use Graph.Read_Ncol, igraph will
> consider the numbers in the input file as symbolic vertex names and "make
> up" vertex IDs on its own, assigning the original IDs from the file in the
> "name" attribute of each vertex.
>
> The reason why I am mentioning this is that ARPACK is thrown off course
> very easily if there are lots of isolated vertices in the graph (and also
> this distorts the PageRank results anyway because random teleportations
> usually end up in one of these isolated vertices). If you use Read_Ncol
> instead of Read_Edgelist, the results are much more favourable; ARPACK and
> PRPACK agree in about 2/3 of the cases such that the maximum difference
> between the PageRank vectors calculated by ARPACK and PRPACK is in the
> order of 10^{-15}. (Ignore the power method -- it is deprecated and
> unsupported, it is left there for sake of compatibility with older versions
> only). In the remaining ~1/3 of the cases, I suspect that ARPACK fails to
> converge because the sum of the PageRank scores is not exactly 1.0 for
> ARPACK (sometimes it is 0.99, but I have even seen 1.02), and the maximum
> difference between ARPACK and PRPACK is *huge* -- in the order of 10^{12}.
> Further investigation shows that it is probably ARPA
> CK to blame because the PageRank vector in ARPACK's case contains negative
> values -- which clearly should never happen.
>
> I think I also know the reason why ARPACK fails to converge in some cases.
> We have seen it before many times that ARPACK is unreliable if the
> underlying graph is not strongly connected. PRPACK works around this by
> calculating PageRank scores separately for each strongly connected
> component and then combining them in a clever way to get the final PageRank
> vector. This conjecture seems to be confirmed by the fact that ARPACK and
> PRPACK agree completely in 1000 out of 1000 trials if I extract the largest
> strongly connected component of your graph -- but that contains only 20
> vertices so that's not a big thing on its own.
>
> The only way to actually validate whether PRPACK is correct is to see if
> the vector generated by PRPACK satisfies the PageRank equation. I have high
> confidence that this is indeed the case, but I will nevertheless try to
> confirm it numerically tomorrow or the day after.
>
> The bottom line is:
>
> - ignore the power method completely
> - if you must choose one method out of the three right now, definitely
> choose PRPACK
> - if you can wait a bit more, I'll let you know whether I find any issues
> with PRPACK's results or not
>
> All the best,
> Tamas
>
>
>
>

_______________________________________________
igraph-help mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/igraph-help

Re: [igraph] PageRank implementations give vastly different results

Reply via email to