Here's my use case:

I read an array into an RDD and I use a hash partitioner to partition the RDD.
This is the array type: Array[(String, Iterable[(Long, Int)])]

topK:Array[(String, Iterable[(Long, Int)])] = ...
import org.apache.spark.HashPartitioner
val hashPartitioner=new HashPartitioner(10)
val 
resultRdd=sc.parallelize(topK).partitionBy(hashPartitioner).sortByKey().saveAsTextFile(fileName)

I also tried
val resultRdd=sc.parallelize(topK, 10).sortByKey().saveAsTextFile(fileName)

The results:
I do get 10 partitions. However, the first partition always contains data for 
the first 2 keys in the RDD, then each following partition contains data for 1 
key in the RDD (as expected), then the last file is empty since the first file 
contained 2 keys.

The Question:
How to make Spark write 1 file per key? Is this behaviour I'm currently seeing 
a bug?


-Adrian


Reply via email to