Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-25 Thread M Bilal
If I understand correctly this would set the split size in the Hadoop
configuration when reading file. I can see that being useful when you want
to create more partitions than what the block size in HDFS might dictate.
Instead what I want to do is to create a single partition for each file
written by task (from say a previous job) i.e. data in part-0 forms
partition 1, part-1 forms partition 2 and so on and so forth.

- Bilal

On Tue, Apr 16, 2019, 6:00 AM Manu Zhang  wrote:

> You may try
> `sparkContext.hadoopConfiguration().set("mapred.max.split.size",
> "33554432")` to tune the partition size when reading from HDFS.
>
> Thanks,
> Manu Zhang
>
> On Mon, Apr 15, 2019 at 11:28 PM M Bilal  wrote:
>
>> Hi,
>>
>> I have implemented a custom partitioning algorithm to partition graphs in
>> GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
>> files in the output folder with the number of files equal to the number of
>> Partitions.
>>
>> However, reading back the edges creates number of partitions that are
>> equal to the number of blocks in the HDFS folder. Is there a way to instead
>> create the same number of partitions as the number of files written to HDFS
>> while preserving the original partitioning?
>>
>> I would like to avoid repartitioning.
>>
>> Thanks.
>> - Bilal
>>
>


Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-15 Thread Manu Zhang
You may try
`sparkContext.hadoopConfiguration().set("mapred.max.split.size",
"33554432")` to tune the partition size when reading from HDFS.

Thanks,
Manu Zhang

On Mon, Apr 15, 2019 at 11:28 PM M Bilal  wrote:

> Hi,
>
> I have implemented a custom partitioning algorithm to partition graphs in
> GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
> files in the output folder with the number of files equal to the number of
> Partitions.
>
> However, reading back the edges creates number of partitions that are
> equal to the number of blocks in the HDFS folder. Is there a way to instead
> create the same number of partitions as the number of files written to HDFS
> while preserving the original partitioning?
>
> I would like to avoid repartitioning.
>
> Thanks.
> - Bilal
>