Hi all: I have a question about RDD.sortByKey val n=20000 val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey() sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest")
sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like [(0,2),(0,3),.....,(0,19999),(1,20000)], the key is skewed. The result of sortByKey is expected to distributed evenly. But when I view the result and found that part-00000 is large and part-00001 is small. hadoop fs -ls /SkewedGroupByTest/ 16/09/06 03:24:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 3 items -rw-r--r-- 1 root supergroup 0 2016-09-06 03:21 /SkewedGroupByTest /_SUCCESS -rw-r--r-- 1 root supergroup 188878 2016-09-06 03:21 /SkewedGroupByTest/part-00000 -rw-r--r-- 1 root supergroup 10 2016-09-06 03:21 /SkewedGroupByTest/part-00001 How can I get the result distributed evenly? I don't need that the key in the part-xxxxx are same and only need to guarantee the data in part-xxxx0 ~ part-xxxxx is sorted. Thanks for any help! Kelly Zhang/Zhang,Liyun Best Regards