For example, We consider the word count of the long text data (100GB order). There is clearly a bias for the word , has been expected to be a long tail data do word count. Probably word number 1 occupies about over 1 / 10.
word count code ``` val allWordLineSplited: RDD[String] = // create RDD … val wordRDD = allWordLineSplited.flatMap(_.split(“ “)) val count = wordRDD.map((_, 1)).reduceByKey(_ + _) val sort = count.map{case (word, count) => (count, word)}.sortByKey() sort.saveAsText(“/path/to/save/dir/“, classOf[GZipCodec]) ``` It is considered that by focusing on the sort. this case, All data of word number 1 gather in a partition. and Shuffle Read become very large size. After the other, and waiting Executor has lost… Although I want to equalize the size of the post if possible sort. If the technique that does not exist, how will the better to calculate? Regards, Takuya --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org