Based on code read it looks like Spark does modulo of key for partition. Keys of c and b end up pointing to same value. Whats the best partitioning scheme to deal with this?
Regards Sumit Chawla On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit <sumitkcha...@gmail.com> wrote: > Hi > > I have been trying to this simple operation. I want to land all values > with one key in same partition, and not have any different key in the same > partition. Is this possible? I am getting b and c always getting mixed > up in the same partition. > > > rdd = sc.parallelize([('a', 5), ('d', 8), ('b', 6), ('a', 8), ('d', 9), > ('b', 3),('c', 8)]) > from pyspark.rdd import portable_hash > > n = 4 > > def partitioner(n): > """Partition by the first item in the key tuple""" > def partitioner_(x): > val = x[0] > key = portable_hash(x[0]) > print ("Val %s Assigned Key %s" % (val, key)) > return key > return partitioner_ > > def validate(part): > last_key = None > for p in part: > k = p[0] > if not last_key: > last_key = k > if k != last_key: > print("Mixed keys in partition %s %s" % (k,last_key) ) > > partioned = (rdd > .keyBy(lambda kv: (kv[0], kv[1])) > .repartitionAndSortWithinPartitions( > numPartitions=n, partitionFunc=partitioner(n), > ascending=False)).map(lambda x: x[1]) > > print(partioned.getNumPartitions()) > partioned.foreachPartition(validate) > > > Val a Assigned Key -7583489610679606711 > Val a Assigned Key -7583489610679606711 > Val d Assigned Key 2755936516345535118 > Val b Assigned Key -1175849324817995036 > Val c Assigned Key 1421958803217889556 > Val d Assigned Key 2755936516345535118 > Val b Assigned Key -1175849324817995036 > Mixed keys in partition b c > Mixed keys in partition b c > > > Regards > Sumit Chawla > >