subject:"\[GraphX\] Preserving Partitions when reading from HDFS"

Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-25 Thread M Bilal

If I understand correctly this would set the split size in the Hadoop
configuration when reading file. I can see that being useful when you want
to create more partitions than what the block size in HDFS might dictate.
Instead what I want to do is to create a single partition for each file
written by task (from say a previous job) i.e. data in part-0 forms
partition 1, part-1 forms partition 2 and so on and so forth.

- Bilal

On Tue, Apr 16, 2019, 6:00 AM Manu Zhang  wrote:

> You may try
> `sparkContext.hadoopConfiguration().set("mapred.max.split.size",
> "33554432")` to tune the partition size when reading from HDFS.
>
> Thanks,
> Manu Zhang
>
> On Mon, Apr 15, 2019 at 11:28 PM M Bilal  wrote:
>
>> Hi,
>>
>> I have implemented a custom partitioning algorithm to partition graphs in
>> GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
>> files in the output folder with the number of files equal to the number of
>> Partitions.
>>
>> However, reading back the edges creates number of partitions that are
>> equal to the number of blocks in the HDFS folder. Is there a way to instead
>> create the same number of partitions as the number of files written to HDFS
>> while preserving the original partitioning?
>>
>> I would like to avoid repartitioning.
>>
>> Thanks.
>> - Bilal
>>
>

Re: [GraphX] Preserving Partitions when reading from HDFS

2019-04-15 Thread Manu Zhang

You may try
`sparkContext.hadoopConfiguration().set("mapred.max.split.size",
"33554432")` to tune the partition size when reading from HDFS.

Thanks,
Manu Zhang

On Mon, Apr 15, 2019 at 11:28 PM M Bilal  wrote:

> Hi,
>
> I have implemented a custom partitioning algorithm to partition graphs in
> GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
> files in the output folder with the number of files equal to the number of
> Partitions.
>
> However, reading back the edges creates number of partitions that are
> equal to the number of blocks in the HDFS folder. Is there a way to instead
> create the same number of partitions as the number of files written to HDFS
> while preserving the original partitioning?
>
> I would like to avoid repartitioning.
>
> Thanks.
> - Bilal
>

[GraphX] Preserving Partitions when reading from HDFS

2019-04-15 Thread M Bilal

Hi,

I have implemented a custom partitioning algorithm to partition graphs in
GraphX. Saving the partitioning graph (the edges) to HDFS creates separate
files in the output folder with the number of files equal to the number of
Partitions.

However, reading back the edges creates number of partitions that are equal
to the number of blocks in the HDFS folder. Is there a way to instead
create the same number of partitions as the number of files written to HDFS
while preserving the original partitioning?

I would like to avoid repartitioning.

Thanks.
- Bilal

Re: [GraphX] Preserving Partitions when reading from HDFS

Re: [GraphX] Preserving Partitions when reading from HDFS

[GraphX] Preserving Partitions when reading from HDFS

3 matches

Site Navigation

Mail list logo

Footer information