Re: RDD getPartitions() size and HashPartitioner numPartitions

Manish Malhotra Sun, 04 Dec 2016 01:10:59 -0800

Its a pretty nice question !
I'll trying to understand the problem, and see can help further.

When you say CustomRDD I believe you will using it in the  transformation
stage, once the data is read from a file source like HDFS or Cassandra or
Kafka.

Now the RDD.getPartitions() should return the partitions its having, and in
case of HashPartitioner (which will be used in functions like reduceByKey),
the partition of the key will be identified like

partition_num=(key.hashCode%numOfParttions) ......

so when the RDD(partitions) reaches to nodes (reducer phase) say after
reduceByKey, groupByKey can have few partitions based on the keys it
contains which are mapped to partitions.

So, I believe it is not required to match the numOfPartitons of the
HashPartitoner to the getPartitions of the RDD.

Thanks,
Manish

On Fri, Dec 2, 2016 at 1:53 PM, Amit Sela <amitsel...@gmail.com> wrote:

> This might be a silly question, but I wanted to make sure, when
> implementing my own RDD, if using a HashPartitioner as the RDD's
> partitioner the number of partitions returned by the implementation of
> getPartitions() has to match the number of partitions set in the
> HashPartitioner, correct ?
>

Re: RDD getPartitions() size and HashPartitioner numPartitions

Reply via email to