I imagine this is a sample example to explain a bigger concern.
In general when you do a sort by key, it will implicitly shuffle the data by
the key. Since you have 1 key (0) with 1 and the other with just 1 record
it will simply shuffle it into two very skewed partitions.
One way you can
Your data has only two keys, and basically all values are assigned to
only one of them. There is no better way to distribute the keys, than
the one Spark executes.
What you have to do is to use different keys to sort and range-partition
on. Try to invoke sortBy() on a non-pair-RDD. This will
Hi all:
I have a question about RDD.sortByKey
val n=2
val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey()
sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest")
sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like
[(0,2),(0,3),.,(0,1),(1,2)], the