Re: [GraphX] Preserving Partitions when reading from HDFS
If I understand correctly this would set the split size in the Hadoop configuration when reading file. I can see that being useful when you want to create more partitions than what the block size in HDFS might dictate. Instead what I want to do is to create a single partition for each file written by task (from say a previous job) i.e. data in part-0 forms partition 1, part-1 forms partition 2 and so on and so forth. - Bilal On Tue, Apr 16, 2019, 6:00 AM Manu Zhang wrote: > You may try > `sparkContext.hadoopConfiguration().set("mapred.max.split.size", > "33554432")` to tune the partition size when reading from HDFS. > > Thanks, > Manu Zhang > > On Mon, Apr 15, 2019 at 11:28 PM M Bilal wrote: > >> Hi, >> >> I have implemented a custom partitioning algorithm to partition graphs in >> GraphX. Saving the partitioning graph (the edges) to HDFS creates separate >> files in the output folder with the number of files equal to the number of >> Partitions. >> >> However, reading back the edges creates number of partitions that are >> equal to the number of blocks in the HDFS folder. Is there a way to instead >> create the same number of partitions as the number of files written to HDFS >> while preserving the original partitioning? >> >> I would like to avoid repartitioning. >> >> Thanks. >> - Bilal >> >
Re: [GraphX] Preserving Partitions when reading from HDFS
You may try `sparkContext.hadoopConfiguration().set("mapred.max.split.size", "33554432")` to tune the partition size when reading from HDFS. Thanks, Manu Zhang On Mon, Apr 15, 2019 at 11:28 PM M Bilal wrote: > Hi, > > I have implemented a custom partitioning algorithm to partition graphs in > GraphX. Saving the partitioning graph (the edges) to HDFS creates separate > files in the output folder with the number of files equal to the number of > Partitions. > > However, reading back the edges creates number of partitions that are > equal to the number of blocks in the HDFS folder. Is there a way to instead > create the same number of partitions as the number of files written to HDFS > while preserving the original partitioning? > > I would like to avoid repartitioning. > > Thanks. > - Bilal >
[GraphX] Preserving Partitions when reading from HDFS
Hi, I have implemented a custom partitioning algorithm to partition graphs in GraphX. Saving the partitioning graph (the edges) to HDFS creates separate files in the output folder with the number of files equal to the number of Partitions. However, reading back the edges creates number of partitions that are equal to the number of blocks in the HDFS folder. Is there a way to instead create the same number of partitions as the number of files written to HDFS while preserving the original partitioning? I would like to avoid repartitioning. Thanks. - Bilal