(By the way, you can use wordRDD.countByValue instead of the map and
reduceByKey. It won't make a difference to your issue but is more
compact.)

As you say, the problem is the very limited range of keys (word
lengths). I wonder if you can use sortBy instead of map and sortByKey,
and instead sortBy (count, word). That is, you're sorting by count and
then word. This at least makes the number of distinct keys much larger
and lets the partitions be much more balanced?


On Thu, Jan 29, 2015 at 2:57 AM, 瀬川 卓也 <takuya_seg...@dwango.co.jp> wrote:
> For example, We consider the word count of the long text data (100GB order).  
> There is clearly a bias for the word , has been expected to be a long tail 
> data do word count. Probably word number 1 occupies about  over 1 / 10.
>
> word count code
>
> ```
> val allWordLineSplited: RDD[String] = // create RDD …
> val wordRDD = allWordLineSplited.flatMap(_.split(“ “))
> val count = wordRDD.map((_, 1)).reduceByKey(_ + _)
> val sort = count.map{case (word, count) => (count, word)}.sortByKey()
>
> sort.saveAsText(“/path/to/save/dir/“, classOf[GZipCodec])
> ```
>
> It is considered that by focusing on the sort. this case, All data of word 
> number 1 gather in a partition. and Shuffle Read become very large size. 
> After the other, and waiting Executor has lost…
>
> Although I want to equalize the size of the post if possible sort.
> If the technique that does not exist, how will the better to calculate?
>
> Regards,
> Takuya
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to