Thanks everyone. As Nathan suggested, I ended up collecting the distinct keys first and then assigning Ids to each key explicitly.
Regards Sumit Chawla On Fri, Jun 22, 2018 at 7:29 AM, Nathan Kronenfeld < nkronenfeld@uncharted.software> wrote: > On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit <sumitkcha...@gmail.com> >>>> wrote: >>>> >>>>> Hi >>>>> >>>>> I have been trying to this simple operation. I want to land all >>>>> values with one key in same partition, and not have any different key in >>>>> the same partition. Is this possible? I am getting b and c always >>>>> getting mixed up in the same partition. >>>>> >>>>> >>>>> > I think you could do something approsimately like: > > val keys = rdd.map(_.getKey).distinct.zipWithIndex > val numKey = keys.map(_._2).count > rdd.map(r => (r.getKey, r)).join(keys).partitionBy(new Partitioner() > {def numPartitions=numKeys;def getPartition(key: Any) = > key.asInstanceOf[Long].toInt}) > > i.e., key by a unique number, count that, and repartition by key to the > exact count. This presumes, of course, that the number of keys is <MAXINT. > > Also, I haven't tested this code, so don't take it as anything more than > an approximate idea, please :-) > > -Nathan Kronenfeld >