Hi All,

I have a three machine cluster. I have two RDDs each consisting of (K,V)
pairs. RDDs have just three keys 'a', 'b' and 'c'.

    // list1 - List(('a',1), ('b',2), ....
    val rdd1 = sc.parallelize(list1).groupByKey(new HashPartitioner(3))

    // list2 - List(('a',2), ('b',7), ....
    val rdd2 = sc.parallelize(list2).groupByKey(new HashPartitioner(3))

By using a HashPartitioner with 3 partitions I can achieve that each of the
keys ('a', 'b' and 'c') in each RDD gets partitioned on different machines
on cluster (based on the hashCode).

Problem is that I cannot deterministically do the same allocation for
second RDD? (all 'a's from rdd2 going to the same machine where 'a's from
first RDD went to).

Is there a way to achieve this?

Manoj

Reply via email to