[jira] Updated: (MAPREDUCE-1690) Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle

luoli (JIRA) Thu, 15 Apr 2010 01:09:18 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


luoli updated MAPREDUCE-1690:
-----------------------------

    Attachment: allo_use_buddy_gc.JPG

> Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle
> -----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1690
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1690
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task, tasktracker
>    Affects Versions: 0.20.2, 0.20.3
>            Reporter: luoli
>             Fix For: 0.20.2
>
>         Attachments: allo_use_buddy.JPG, allo_use_buddy_gc.JPG, 
> allo_use_new.JPG, allo_use_new_gc.JPG, mapreduce-1690.v1.patch, 
> mapreduce-1690.v1.patch, mapreduce-1690.v1.patch, mapreduce-1690.v2.patch
>
>
>        When the reduce task launched, it will start several MapOutputCopier 
> threads to download the output from finished map, every thread is a 
> MapOutputCopier thread running instance. Every time the thread trying to copy 
> map output from remote from local, the MapOutputCopier thread will desides to 
> shuffle the map output data in memory or to disk, this depends on the map 
> output data size and the configuration of the ShuffleRamManager which loaded 
> from the client hadoop-site.xml or JobConf, no matter what, if the reduce 
> task decides to shuffle the map output data in memory , the MapOutputCopier 
> will connect to the remote map host , read the map output in the socket, and 
> then  copy map-output into an in-memory buffer, and every time, the in-memory 
> buffer is from "byte[] shuffleData = new byte[mapOutputLength];", here is 
> where the problem begin. In our cluster, there are some special jobs which 
> will process a huge number of original data, say 110TB,  so the reduce tasks 
> will shuffle a lot of data, some shuffled to disk and some shuffle in memory, 
> even though, their will be a lot of data shuffled in memory, and every time 
> the MapOutputCopier threads will "new" some memory from the reduce heap, for 
> a long-running-huge-data job, this will easily feed the Reduce Task's heap 
> size to the full,  make the reduce task to OOM and then exhausted the memory 
> of the TaskTracker machine.
>        Here is our solution: Change the code logic when MapOutputCopier 
> threads shuffle map-output in memory, using a BuddySystem similar to the 
> Linux Kernel  BuddySystem which used to allocate and deallocate memory page. 
> When the reduce task launched , initialize some memory to this BuddySystem, 
> say 128MB, everytime the reduce want to shuffle map-output in memory ,just 
> require memory buffer from the buddySystem, if the buddySystem has enough 
> memory , use it, and if not , let  the MapOutputCopier threads to wait() just 
> like what they do right now in the current hadoop shuffle code logic. This 
> will reduce the Reduce Task's memory usage and reduce the TaskTracker memory 
> shortage a lot. In our cluster, this buddySystem makes the situation of "lost 
> a batch of tasktrackers because of memory over used when the huge jobs 
> running  "  disappeared. And therefore makes the cluster more stable.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (MAPREDUCE-1690) Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle

Reply via email to