Hi all:
  I have a question about RDD.sortByKey

val n=20000
val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey()
 sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest")

sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like 
[(0,2),(0,3),.....,(0,19999),(1,20000)], the key is skewed.

The result of sortByKey is expected to distributed evenly. But when I view the 
result and found that part-00000 is large and part-00001 is small.

 hadoop fs -ls /SkewedGroupByTest/
16/09/06 03:24:55 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r-- 1 root supergroup 0 2016-09-06 03:21 /SkewedGroupByTest /_SUCCESS
-rw-r--r-- 1 root supergroup 188878 2016-09-06 03:21 
/SkewedGroupByTest/part-00000
-rw-r--r-- 1 root supergroup 10 2016-09-06 03:21 /SkewedGroupByTest/part-00001

How can I get the result distributed evenly?  I don't need that the key in the 
part-xxxxx are same and only need to guarantee the data in part-xxxx0 ~ 
part-xxxxx is sorted.


Thanks for any help!


Kelly Zhang/Zhang,Liyun
Best Regards

Reply via email to