RangePartitioner does not actually provide a guarantee that all partitions will be equal sized (that is hard), and instead uses sampling to approximate equal buckets. Thus, it is possible that a bucket is left empty.
If you want the specified behavior, you should define your own partitioner. It would look something like this (untested): class AlphabetPartitioner extends Partitioner { def numPartitions = 26 def getPartition(key: Any): Int = key match { case s: String => s(0).toUpper - 'A' } override def equals(other: Any): Boolean = other.isInstanceOf[AlphabetPartitioner] override def hashCode: Int = 0 } On Tue, Feb 17, 2015 at 7:05 PM, java8964 <java8...@hotmail.com> wrote: > Hi, Sparkers: > > I just happened to search in google for something related to the > RangePartitioner of spark, and found an old thread in this email list as > here: > > > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html > > I followed the code example mentioned in that email thread as following: > > scala> import org.apache.spark.RangePartitioner > import org.apache.spark.RangePartitioner > > scala> val rdd = sc.parallelize(List("apple", "Ball", "cat", "dog", > "Elephant", "fox", "gas", "horse", "index", "jet", "kitsch", "long", > "moon", "Neptune", "ooze", "Pen", "quiet", "rose", "sun", "talk", > "umbrella", "voice", "Walrus", "xeon", "Yam", "zebra")) > rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at > parallelize at <console>:13 > > scala> rdd.keyBy(s => s(0).toUpper) > res0: org.apache.spark.rdd.RDD[(Char, String)] = MappedRDD[1] at keyBy at > <console>:16 > > scala> res0.partitionBy(new RangePartitioner[Char, String](26, > res0)).values > res1: org.apache.spark.rdd.RDD[String] = MappedRDD[5] at values at > <console>:18 > > scala> res1.mapPartitionsWithIndex((idx, itr) => itr.map(s => (idx, > s))).collect.foreach(println) > > The above example is clear for me to understand the meaning of the > RangePartitioner, but to my surprise, I got the following result: > > *(0,apple)* > *(0,Ball)* > (1,cat) > (2,dog) > (3,Elephant) > (4,fox) > (5,gas) > (6,horse) > (7,index) > (8,jet) > (9,kitsch) > (10,long) > (11,moon) > (12,Neptune) > (13,ooze) > (14,Pen) > (15,quiet) > (16,rose) > (17,sun) > (18,talk) > (19,umbrella) > (20,voice) > (21,Walrus) > (22,xeon) > (23,Yam) > (24,zebra) > > instead of a perfect range index from 0 to 25 in old email thread. Why is > that? Is this a bug, or some new feature I don't understand? > > BTW, the above environment I tested is in Spark 1.2.1 with Hadoop 2.4 > binary release. > > Thanks > > Yong >