RangePartitioner does not actually provide a guarantee that all partitions
will be equal sized (that is hard), and instead uses sampling to
approximate equal buckets. Thus, it is possible that a bucket is left empty.
If you want the specified behavior, you should define your own partitioner.
It would look something like this (untested):
class AlphabetPartitioner extends Partitioner {
def numPartitions = 26
def getPartition(key: Any): Int = key match {
case s: String = s(0).toUpper - 'A'
}
override def equals(other: Any): Boolean =
other.isInstanceOf[AlphabetPartitioner]
override def hashCode: Int = 0
}
On Tue, Feb 17, 2015 at 7:05 PM, java8964 java8...@hotmail.com wrote:
Hi, Sparkers:
I just happened to search in google for something related to the
RangePartitioner of spark, and found an old thread in this email list as
here:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html
I followed the code example mentioned in that email thread as following:
scala import org.apache.spark.RangePartitioner
import org.apache.spark.RangePartitioner
scala val rdd = sc.parallelize(List(apple, Ball, cat, dog,
Elephant, fox, gas, horse, index, jet, kitsch, long,
moon, Neptune, ooze, Pen, quiet, rose, sun, talk,
umbrella, voice, Walrus, xeon, Yam, zebra))
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at
parallelize at console:13
scala rdd.keyBy(s = s(0).toUpper)
res0: org.apache.spark.rdd.RDD[(Char, String)] = MappedRDD[1] at keyBy at
console:16
scala res0.partitionBy(new RangePartitioner[Char, String](26,
res0)).values
res1: org.apache.spark.rdd.RDD[String] = MappedRDD[5] at values at
console:18
scala res1.mapPartitionsWithIndex((idx, itr) = itr.map(s = (idx,
s))).collect.foreach(println)
The above example is clear for me to understand the meaning of the
RangePartitioner, but to my surprise, I got the following result:
*(0,apple)*
*(0,Ball)*
(1,cat)
(2,dog)
(3,Elephant)
(4,fox)
(5,gas)
(6,horse)
(7,index)
(8,jet)
(9,kitsch)
(10,long)
(11,moon)
(12,Neptune)
(13,ooze)
(14,Pen)
(15,quiet)
(16,rose)
(17,sun)
(18,talk)
(19,umbrella)
(20,voice)
(21,Walrus)
(22,xeon)
(23,Yam)
(24,zebra)
instead of a perfect range index from 0 to 25 in old email thread. Why is
that? Is this a bug, or some new feature I don't understand?
BTW, the above environment I tested is in Spark 1.2.1 with Hadoop 2.4
binary release.
Thanks
Yong