Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-25 Thread M Bilal
If I understand correctly this would set the split size in the Hadoop
configuration when reading file. I can see that being useful when you want
to create more partitions than what the block size in HDFS might dictate.
Instead what I want to do is to create a single partition for each file
written by task (from say a previous job) i.e. data in part-0 forms
partition 1, part-1 forms partition 2 and so on and so forth.

- Bilal

On Tue, Apr 16, 2019, 6:00 AM Manu Zhang  wrote:

> You may try
> `sparkContext.hadoopConfiguration().set("mapred.max.split.size",
> "33554432")` to tune the partition size when reading from HDFS.
>
> Thanks,
> Manu Zhang
>
> On Mon, Apr 15, 2019 at 11:28 PM M Bilal  wrote:
>
>> Hi,
>>
>> I have implemented a custom partitioning algorithm to partition graphs in
>> GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
>> files in the output folder with the number of files equal to the number of
>> Partitions.
>>
>> However, reading back the edges creates number of partitions that are
>> equal to the number of blocks in the HDFS folder. Is there a way to instead
>> create the same number of partitions as the number of files written to HDFS
>> while preserving the original partitioning?
>>
>> I would like to avoid repartitioning.
>>
>> Thanks.
>> - Bilal
>>
>


Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-15 Thread Manu Zhang
You may try
`sparkContext.hadoopConfiguration().set("mapred.max.split.size",
"33554432")` to tune the partition size when reading from HDFS.

Thanks,
Manu Zhang

On Mon, Apr 15, 2019 at 11:28 PM M Bilal  wrote:

> Hi,
>
> I have implemented a custom partitioning algorithm to partition graphs in
> GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
> files in the output folder with the number of files equal to the number of
> Partitions.
>
> However, reading back the edges creates number of partitions that are
> equal to the number of blocks in the HDFS folder. Is there a way to instead
> create the same number of partitions as the number of files written to HDFS
> while preserving the original partitioning?
>
> I would like to avoid repartitioning.
>
> Thanks.
> - Bilal
>


[GraphX] Preserving Partitions when reading from HDFS

2019-04-15 Thread M Bilal
Hi,

I have implemented a custom partitioning algorithm to partition graphs in
GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
files in the output folder with the number of files equal to the number of
Partitions.

However, reading back the edges creates number of partitions that are equal
to the number of blocks in the HDFS folder. Is there a way to instead
create the same number of partitions as the number of files written to HDFS
while preserving the original partitioning?

I would like to avoid repartitioning.

Thanks.
- Bilal