Dear Ankur, Thanks so much!
Btw, is there any possibility to customise the partition strategy as we expect? Best, Yifan On Jul 11, 2014, at 10:20 PM, Ankur Dave <ankurd...@gmail.com> wrote: > Hi Yifan, > > When you run Spark on a single machine, it uses a local mode where one task > per core can be executed at at a time -- that is, the level of parallelism is > the same as the number of cores. > > To take advantage of this, when you load a file using sc.textFile, you should > set the minPartitions argument to be the number of cores (available from > sc.defaultParallelism) or a multiple thereof. This will split up your local > edge file and allow you to take advantage of all the machine's cores. > > Once you've loaded the edge RDD with the appropriate number of partitions and > constructed a graph using it, GraphX will leave the edge partitioning alone. > During graph computation, each vertex will automatically be copied to the > edge partitions where it is needed, and the computation will execute in > parallel on each of the edge partitions (cores). > > If you later call Graph.partitionBy, it will by default preserve the number > of edge partitions, but shuffle around the edges according to the partition > strategy. This won't change the level of parallelism, but it might decrease > the amount of inter-core communication. > > Hope that helps! By the way, do continue to post your GraphX questions to the > Spark user list if possible. I'll probably still be the one answering them, > but that way others can benefit as well. > > Ankur > > > On Fri, Jul 11, 2014 at 3:05 AM, Yifan LI <iamyifa...@gmail.com> wrote: > Hi Ankur, > > I am doing graph computation using GraphX on a single multicore machine(not a > cluster). > But It seems that I couldn't find enough docs w.r.t "how GraphX partition > graph on a multicore machine". > Could you give me some introduction or docs? > > For instance, I have one single edges file(not HDFS, etc), which follows the > "srcID, dstID, edgeProperties" format, maybe 100MB or 500GB on size. > and the latest Spark 1.0.0(with GraphX) has been installed on a 64bit, 8*CPU > machine. > I propose to do my own algorithm application, > > - as default, how the edges data is partitioned? to each CPU? or to each > process? > > - if later I specify partition strategy in partitionBy(), e.g. > PartitionStrategy.EdgePartition2D > what will happen? it will work? > > > Thanks in advance! :) > > Best, > Yifan LI > Univ. Paris-Sud/ Inria, Paris, France >