Re: [GraphX] How spark parameters relate to Pregel implementation

2014-08-04 Thread Ankur Dave
At 2014-08-04 20:52:26 +0800, Bin  wrote:
> I wonder how spark parameters, e.g., number of paralellism, affect Pregel 
> performance? Specifically, sendmessage, mergemessage, and vertexprogram?
>
> I have tried label propagation on a 300,000 edges graph, and I found that no 
> paralellism is much faster than 5 or 500 paralellism.

Increasing the level of parallelism will increase storage overhead (because 
each vertex will need to be replicated to more edge partitions to form the 
triplets) and will also increase communication. Unless there's something to be 
gained from higher parallelism, this will worsen performance. Additionally, 
going from no parallelism to some parallelism will incur the extra cost of task 
communication via shuffles.

Parallelism has two benefits: it allows edge scans and aggregations to proceed 
in parallel, and it enables the graph to be stored across many machines.

For small graphs, the slight performance gain due to parallelism is vastly 
outweighed by the cost of inter-process communication and shuffling to disk, 
and distributed storage is not necessary since the graph fits on a single 
machine. There are single-machine graph processing systems such as X-Stream [1] 
and GraphChi [2] that optimize performance for these kinds of graphs.

However, parallelism becomes necessary for larger graphs with hundreds of 
millions of edges or large amounts of associated vertex and edge data. GraphX 
is designed for this scale of data.

Ankur

[1] http://infoscience.epfl.ch/record/188535/files/paper.pdf
[2] http://graphlab.org/projects/graphchi.html

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[GraphX] How spark parameters relate to Pregel implementation

2014-08-04 Thread Bin
Hi all,


I wonder how spark parameters, e.g., number of paralellism, affect Pregel 
performance? Specifically, sendmessage, mergemessage, and vertexprogram?


I have tried label propagation on a 300,000 edges graph, and I found that no 
paralellism is much faster than 5 or 500 paralellism.


Looking for advice!


Thanks a lot!


Best,
Bin