RE: How to make the result of sortByKey distributed evenly?

2016-09-06 Thread AssafMendelson
I imagine this is a sample example to explain a bigger concern. In general when you do a sort by key, it will implicitly shuffle the data by the key. Since you have 1 key (0) with 1 and the other with just 1 record it will simply shuffle it into two very skewed partitions. One way you can

Re: How to make the result of sortByKey distributed evenly?

2016-09-06 Thread Fridtjof Sander
Your data has only two keys, and basically all values are assigned to only one of them. There is no better way to distribute the keys, than the one Spark executes. What you have to do is to use different keys to sort and range-partition on. Try to invoke sortBy() on a non-pair-RDD. This will

How to make the result of sortByKey distributed evenly?

2016-09-06 Thread Zhang, Liyun
Hi all: I have a question about RDD.sortByKey val n=2 val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey() sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest") sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like [(0,2),(0,3),.,(0,1),(1,2)], the