Re: "the default GraphX graph-partition strategy on multicore machine"?

Yifan LI Tue, 15 Jul 2014 03:06:41 -0700

Dear Ankur,

Thanks so much!


Btw, is there any possibility to customise the partition strategy as we expect?


Best,
Yifan
On Jul 11, 2014, at 10:20 PM, Ankur Dave <ankurd...@gmail.com> wrote:

> Hi Yifan,
> 
> When you run Spark on a single machine, it uses a local mode where one task 
> per core can be executed at at a time -- that is, the level of parallelism is 
> the same as the number of cores.
> 
> To take advantage of this, when you load a file using sc.textFile, you should 
> set the minPartitions argument to be the number of cores (available from 
> sc.defaultParallelism) or a multiple thereof. This will split up your local 
> edge file and allow you to take advantage of all the machine's cores.
> 
> Once you've loaded the edge RDD with the appropriate number of partitions and 
> constructed a graph using it, GraphX will leave the edge partitioning alone. 
> During graph computation, each vertex will automatically be copied to the 
> edge partitions where it is needed, and the computation will execute in 
> parallel on each of the edge partitions (cores).
> 
> If you later call Graph.partitionBy, it will by default preserve the number 
> of edge partitions, but shuffle around the edges according to the partition 
> strategy. This won't change the level of parallelism, but it might decrease 
> the amount of inter-core communication.
> 
> Hope that helps! By the way, do continue to post your GraphX questions to the 
> Spark user list if possible. I'll probably still be the one answering them, 
> but that way others can benefit as well.
> 
> Ankur
> 
> 
> On Fri, Jul 11, 2014 at 3:05 AM, Yifan LI <iamyifa...@gmail.com> wrote:
> Hi Ankur,
> 
> I am doing graph computation using GraphX on a single multicore machine(not a 
> cluster).
> But It seems that I couldn't find enough docs w.r.t "how GraphX partition 
> graph on a multicore machine".
> Could you give me some introduction or docs?
> 
> For instance, I have one single edges file(not HDFS, etc), which follows the 
> "srcID, dstID, edgeProperties" format, maybe 100MB or 500GB on size.
> and the latest Spark 1.0.0(with GraphX) has been installed on a 64bit, 8*CPU 
> machine.
> I propose to do my own algorithm application, 
> 
> - as default, how the edges data is partitioned? to each CPU? or to each 
> process?
> 
> - if later I specify partition strategy in partitionBy(), e.g. 
> PartitionStrategy.EdgePartition2D
> what will happen? it will work?
> 
> 
> Thanks in advance! :)
> 
> Best, 
> Yifan LI
> Univ. Paris-Sud/ Inria, Paris, France
>

Re: "the default GraphX graph-partition strategy on multicore machine"?

Reply via email to