[ https://issues.apache.org/jira/browse/MAPREDUCE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
luoli updated MAPREDUCE-1690: ----------------------------- Attachment: allo_use_buddy_gc.JPG > Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle > ----------------------------------------------------------------------------- > > Key: MAPREDUCE-1690 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1690 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: task, tasktracker > Affects Versions: 0.20.2, 0.20.3 > Reporter: luoli > Fix For: 0.20.2 > > Attachments: allo_use_buddy.JPG, allo_use_buddy_gc.JPG, > allo_use_new.JPG, allo_use_new_gc.JPG, mapreduce-1690.v1.patch, > mapreduce-1690.v1.patch, mapreduce-1690.v1.patch, mapreduce-1690.v2.patch > > > When the reduce task launched, it will start several MapOutputCopier > threads to download the output from finished map, every thread is a > MapOutputCopier thread running instance. Every time the thread trying to copy > map output from remote from local, the MapOutputCopier thread will desides to > shuffle the map output data in memory or to disk, this depends on the map > output data size and the configuration of the ShuffleRamManager which loaded > from the client hadoop-site.xml or JobConf, no matter what, if the reduce > task decides to shuffle the map output data in memory , the MapOutputCopier > will connect to the remote map host , read the map output in the socket, and > then copy map-output into an in-memory buffer, and every time, the in-memory > buffer is from "byte[] shuffleData = new byte[mapOutputLength];", here is > where the problem begin. In our cluster, there are some special jobs which > will process a huge number of original data, say 110TB, so the reduce tasks > will shuffle a lot of data, some shuffled to disk and some shuffle in memory, > even though, their will be a lot of data shuffled in memory, and every time > the MapOutputCopier threads will "new" some memory from the reduce heap, for > a long-running-huge-data job, this will easily feed the Reduce Task's heap > size to the full, make the reduce task to OOM and then exhausted the memory > of the TaskTracker machine. > Here is our solution: Change the code logic when MapOutputCopier > threads shuffle map-output in memory, using a BuddySystem similar to the > Linux Kernel BuddySystem which used to allocate and deallocate memory page. > When the reduce task launched , initialize some memory to this BuddySystem, > say 128MB, everytime the reduce want to shuffle map-output in memory ,just > require memory buffer from the buddySystem, if the buddySystem has enough > memory , use it, and if not , let the MapOutputCopier threads to wait() just > like what they do right now in the current hadoop shuffle code logic. This > will reduce the Reduce Task's memory usage and reduce the TaskTracker memory > shortage a lot. In our cluster, this buddySystem makes the situation of "lost > a batch of tasktrackers because of memory over used when the huge jobs > running " disappeared. And therefore makes the cluster more stable. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira