Here's my use case: I read an array into an RDD and I use a hash partitioner to partition the RDD. This is the array type: Array[(String, Iterable[(Long, Int)])]
topK:Array[(String, Iterable[(Long, Int)])] = ... import org.apache.spark.HashPartitioner val hashPartitioner=new HashPartitioner(10) val resultRdd=sc.parallelize(topK).partitionBy(hashPartitioner).sortByKey().saveAsTextFile(fileName) I also tried val resultRdd=sc.parallelize(topK, 10).sortByKey().saveAsTextFile(fileName) The results: I do get 10 partitions. However, the first partition always contains data for the first 2 keys in the RDD, then each following partition contains data for 1 key in the RDD (as expected), then the last file is empty since the first file contained 2 keys. The Question: How to make Spark write 1 file per key? Is this behaviour I'm currently seeing a bug? -Adrian