This is the code that chooses the partition for a key:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L85-L88

it's basically `math.abs(key.hashCode % numberOfPartitions)`

On Fri, Jun 23, 2017 at 3:42 AM, Vikash Pareek <
vikash.par...@infoobjects.com> wrote:

> I am trying to understand how spark partitoing works.
>
> To understand this I have following piece of code on spark 1.6
>
>     def countByPartition1(rdd: RDD[(String, Int)]) = {
>         rdd.mapPartitions(iter => Iterator(iter.length))
>     }
>     def countByPartition2(rdd: RDD[String]) = {
>         rdd.mapPartitions(iter => Iterator(iter.length))
>     }
>
> //RDDs Creation
>     val rdd1 = sc.parallelize(Array(("aa", 1), ("aa", 1), ("aa", 1), ("aa",
> 1)), 8)
>     countByPartition(rdd1).collect()
> >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
>
>     val rdd2 = sc.parallelize(Array("aa", "aa", "aa", "aa"), 8)
>     countByPartition(rdd2).collect()
> >> Array[Int] = Array(0, 1, 0, 1, 0, 1, 0, 1)
>
> In both the cases data is distributed uniformaly.
> I do have following questions on the basis of above observation:
>
>  1. In case of rdd1, hash partitioning should calculate hashcode of key
> (i.e. "aa" in this case), so all records should go to single partition
> instead of uniform distribution?
>  2. In case of rdd2, there is no key value pair so how hash partitoning
> going to work i.e. what is the key to calculate hashcode?
>
> I have followed @zero323 answer but not getting answer of these.
> https://stackoverflow.com/questions/31424396/how-does-hashpartitioner-work
>
>
>
>
> -----
>
> __Vikash Pareek
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-does-HashPartitioner-distribute-data-in-Spark-
> tp28785.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to