[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luoli updated MAPREDUCE-1690:
-----------------------------

    Attachment: mapreduce-1690.v1.patch

This is the patch file diff from branch-0.20 , just  the buddySystem code and 
unittest, havn't modify the ReduceTask.java code yet because it is so hard to 
merge the code of svn branch and the code which ourselves are using right now. 
I will merge the buddy to hadoop code and upload the patch v2 late.

> Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle
> -----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1690
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1690
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task, tasktracker
>    Affects Versions: 0.20.2, 0.20.3
>            Reporter: luoli
>             Fix For: 0.20.2, 0.20.3
>
>         Attachments: mapreduce-1690.v1.patch
>
>
>        When the reduce task launched, it will start several MapOutputCopier 
> threads to download the output from finished map, every thread is a 
> MapOutputCopier thread running instance. Every time the thread trying to copy 
> map output from remote from local, the MapOutputCopier thread will desides to 
> shuffle the map output data in memory or to disk, this depends on the map 
> output data size and the configuration of the ShuffleRamManager which loaded 
> from the client hadoop-site.xml or JobConf, no matter what, if the reduce 
> task decides to shuffle the map output data in memory , the MapOutputCopier 
> will connect to the remote map host , read the map output in the socket, and 
> then  copy map-output into an in-memory buffer, and every time, the in-memory 
> buffer is from "byte[] shuffleData = new byte[mapOutputLength];", here is 
> where the problem begin. In our cluster, there are some special jobs which 
> will process a huge number of original data, say 110TB,  so the reduce tasks 
> will shuffle a lot of data, some shuffled to disk and some shuffle in memory, 
> even though, their will be a lot of data shuffled in memory, and every time 
> the MapOutputCopier threads will "new" some memory from the reduce heap, for 
> a long-running-huge-data job, this will easily feed the Reduce Task's heap 
> size to the full,  make the reduce task to OOM and then exhausted the memory 
> of the TaskTracker machine.
>        Here is our solution: Change the code logic when MapOutputCopier 
> threads shuffle map-output in memory, using a BuddySystem similar to the 
> Linux Kernel  BuddySystem which used to allocate and deallocate memory page. 
> When the reduce task launched , initialize some memory to this BuddySystem, 
> say 128MB, everytime the reduce want to shuffle map-output in memory ,just 
> require memory buffer from the buddySystem, if the buddySystem has enough 
> memory , use it, and if not , let  the MapOutputCopier threads to wait() just 
> like what they do right now in the current hadoop shuffle code logic. This 
> will reduce the Reduce Task's memory usage and reduce the TaskTracker memory 
> shortage a lot. In our cluster, this buddySystem makes the situation of "lost 
> a batch of tasktrackers because of memory over used when the huge jobs 
> running  "  disappeared. And therefore makes the cluster more stable.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to