Hi All I have a RDD which has 1 million keys and each key is repeated from around 7000 values so total there will be around 1M*7K records in RDD.
and each key is created from ZipWithIndex so key start from 0 to M-1 the problem with ZipWithIndex is it take long for key which is 8 bytes. can I reduce it to 4 bytes? Now how Can I make sure the record with same key will go the same node so that I can avoid shuffling. Also how default partition-er will work here. Regards jeetendra