If I understand correctly this would set the split size in the Hadoop configuration when reading file. I can see that being useful when you want to create more partitions than what the block size in HDFS might dictate. Instead what I want to do is to create a single partition for each file written by task (from say a previous job) i.e. data in part-00000 forms partition 1, part-00001 forms partition 2 and so on and so forth.
- Bilal On Tue, Apr 16, 2019, 6:00 AM Manu Zhang <owenzhang1...@gmail.com> wrote: > You may try > `sparkContext.hadoopConfiguration().set("mapred.max.split.size", > "33554432")` to tune the partition size when reading from HDFS. > > Thanks, > Manu Zhang > > On Mon, Apr 15, 2019 at 11:28 PM M Bilal <mbilalce....@gmail.com> wrote: > >> Hi, >> >> I have implemented a custom partitioning algorithm to partition graphs in >> GraphX. Saving the partitioning graph (the edges) to HDFS creates separate >> files in the output folder with the number of files equal to the number of >> Partitions. >> >> However, reading back the edges creates number of partitions that are >> equal to the number of blocks in the HDFS folder. Is there a way to instead >> create the same number of partitions as the number of files written to HDFS >> while preserving the original partitioning? >> >> I would like to avoid repartitioning. >> >> Thanks. >> - Bilal >> >