At 2014-08-04 20:52:26 +0800, Bin wrote:
> I wonder how spark parameters, e.g., number of paralellism, affect Pregel
> performance? Specifically, sendmessage, mergemessage, and vertexprogram?
>
> I have tried label propagation on a 300,000 edges graph, and I found that no
> paralellism is much faster than 5 or 500 paralellism.
Increasing the level of parallelism will increase storage overhead (because
each vertex will need to be replicated to more edge partitions to form the
triplets) and will also increase communication. Unless there's something to be
gained from higher parallelism, this will worsen performance. Additionally,
going from no parallelism to some parallelism will incur the extra cost of task
communication via shuffles.
Parallelism has two benefits: it allows edge scans and aggregations to proceed
in parallel, and it enables the graph to be stored across many machines.
For small graphs, the slight performance gain due to parallelism is vastly
outweighed by the cost of inter-process communication and shuffling to disk,
and distributed storage is not necessary since the graph fits on a single
machine. There are single-machine graph processing systems such as X-Stream [1]
and GraphChi [2] that optimize performance for these kinds of graphs.
However, parallelism becomes necessary for larger graphs with hundreds of
millions of edges or large amounts of associated vertex and edge data. GraphX
is designed for this scale of data.
Ankur
[1] http://infoscience.epfl.ch/record/188535/files/paper.pdf
[2] http://graphlab.org/projects/graphchi.html
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org