How to avoid the repartitioning in graph construction

Yifan LI Fri, 27 Mar 2015 10:31:26 -0700

Hi,

Now I have 10 edge data files in my HDFS directory, e.g. edges_part00, 
edges_part01, …, edges_part09
format: srcId tarId
(They make a good partitioning of that whole graph, so I never expect any 
change(re-partitoning operations) on them during graph building).


————

I am thinking of how to use them to construct graph using Graphx api, without 
any repartitioning.

My idea:
1) to build an RDD, edgeTupleRDD, by using sc.textFile(“hdfs://myDirectory”)
in where each file size is limited below 64MB(smaller than a HDFS block)
so, normally I could get 1 partitions per file, right?

2) then, to build the graph by using Graph.fromEdgeTuples(edgeTupleRDD,..)
from graphx documentation, this operation will keep those partitions without 
any change, right?

——— — 
- Is there any other idea, or anything I missed?
- if a file is larger than 64MB(the default size of a HDFS block), the 
repartitioning will be inevitable??



Thanks in advance!

Best,
Yifan LI

How to avoid the repartitioning in graph construction

Reply via email to