Hi, Sparkers:
I just happened to search in google for something related to the 
RangePartitioner of spark, and found an old thread in this email list as here:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Partition-td991.html
I followed the code example mentioned in that email thread as following:
scala>  import org.apache.spark.RangePartitionerimport 
org.apache.spark.RangePartitioner
scala> val rdd = sc.parallelize(List("apple", "Ball", "cat", "dog", "Elephant", 
"fox", "gas", "horse", "index", "jet", "kitsch", "long", "moon", "Neptune", 
"ooze", "Pen", "quiet", "rose", "sun", "talk", "umbrella", "voice", "Walrus", 
"xeon", "Yam", "zebra"))rdd: org.apache.spark.rdd.RDD[String] = 
ParallelCollectionRDD[0] at parallelize at <console>:13
scala> rdd.keyBy(s => s(0).toUpper)res0: org.apache.spark.rdd.RDD[(Char, 
String)] = MappedRDD[1] at keyBy at <console>:16
scala> res0.partitionBy(new RangePartitioner[Char, String](26, 
res0)).valuesres1: org.apache.spark.rdd.RDD[String] = MappedRDD[5] at values at 
<console>:18
scala> res1.mapPartitionsWithIndex((idx, itr) => itr.map(s => (idx, 
s))).collect.foreach(println)
The above example is clear for me to understand the meaning of the 
RangePartitioner, but to my surprise, I got the following result:
(0,apple)(0,Ball)(1,cat)(2,dog)(3,Elephant)(4,fox)(5,gas)(6,horse)(7,index)(8,jet)(9,kitsch)(10,long)(11,moon)(12,Neptune)(13,ooze)(14,Pen)(15,quiet)(16,rose)(17,sun)(18,talk)(19,umbrella)(20,voice)(21,Walrus)(22,xeon)(23,Yam)(24,zebra)
instead of a perfect range index from 0 to 25 in old email thread. Why is that? 
Is this a bug, or some new feature I don't understand?
BTW, the above environment I tested is in Spark 1.2.1 with Hadoop 2.4 binary 
release.
Thanks
Yong                                      

Reply via email to