Hi, Now I have 10 edge data files in my HDFS directory, e.g. edges_part00, edges_part01, …, edges_part09 format: srcId tarId (They make a good partitioning of that whole graph, so I never expect any change(re-partitoning operations) on them during graph building).
———— I am thinking of how to use them to construct graph using Graphx api, without any repartitioning. My idea: 1) to build an RDD, edgeTupleRDD, by using sc.textFile(“hdfs://myDirectory”) in where each file size is limited below 64MB(smaller than a HDFS block) so, normally I could get 1 partitions per file, right? 2) then, to build the graph by using Graph.fromEdgeTuples(edgeTupleRDD,..) from graphx documentation, this operation will keep those partitions without any change, right? ——— — - Is there any other idea, or anything I missed? - if a file is larger than 64MB(the default size of a HDFS block), the repartitioning will be inevitable?? Thanks in advance! Best, Yifan LI