[ https://issues.apache.org/jira/browse/MAPREDUCE-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13720982#comment-13720982 ]
Jason Lowe commented on MAPREDUCE-5423: --------------------------------------- This may be a duplicate of MAPREDUCE-4842. > Rare deadlock situation when reducers try to fetch map output > ------------------------------------------------------------- > > Key: MAPREDUCE-5423 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5423 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 2.0.2-alpha > Reporter: Chu Tong > > During our cluster deployment, we found there is a very rare deadlock > situation when reducers try to fetch map output. We had 5 fetchers and log > snippet illustrates this problem is below (all fetchers went into a wait > state after they can't acquire more RAM beyond the memoryLimit and no fetcher > is releasing memory): > 2013-07-18 04:32:28,135 INFO [main] > org.apache.hadoop.mapreduce.task.reduce.MergeManager: MergerManager: > memoryLimit=1503238528, maxSingleShuffleLimit=375809632, > mergeThreshold=992137472, ioSortFactor=10, memToMemMergeOutputsThreshold=10 > 2013-07-18 04:32:28,138 INFO [EventFetcher for fetching Map Completion > Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: > attempt_1373902166027_0622_r_000001_0 Thread started: EventFetcher for > fetching Map Completion Events > 2013-07-18 04:32:28,146 INFO [EventFetcher for fetching Map Completion > Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: > attempt_1373902166027_0622_r_000001_0: Got 1 new map-outputs > 2013-07-18 04:32:28,146 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: Assiging > 101-09-04.sc1.verticloud.com:8080 with 1 to fetcher#1 > 2013-07-18 04:32:28,146 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: assigned 1 of 1 to > 101-09-04.sc1.verticloud.com:8080 to fetcher#1 > 2013-07-18 04:32:28,319 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=8080/mapOutput?job=job_1373902166027_0622&reduce=1&map=attempt_1373902166027_0622_m_000017_0 > sent hash and receievd reply > 2013-07-18 04:32:28,320 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 about to shuffle > output of map attempt_1373902166027_0622_m_000017_0 decomp: 27 len: 31 to > MEMORY > 2013-07-18 04:32:28,325 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: Read 27 bytes from > map-output for attempt_1373902166027_0622_m_000017_0 > 2013-07-18 04:32:28,325 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.MergeManager: closeInMemoryFile -> > map-output of size: 27, inMemoryMapOutputs.size() -> 1, commitMemory -> > 0, usedMemory ->27 > 2013-07-18 04:32:28,325 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: > 101-09-04.sc1.verticloud.com:8080 freed by fetcher#1 in 179s > 2013-07-18 04:32:33,158 INFO [EventFetcher for fetching Map Completion > Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: > attempt_1373902166027_0622_r_000001_0: Got 1 new map-outputs > 2013-07-18 04:32:33,158 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: Assiging > 101-09-04.sc1.verticloud.com:8080 with 1 to fetcher#1 > 2013-07-18 04:32:33,158 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: assigned 1 of 1 to > 101-09-04.sc1.verticloud.com:8080 to fetcher#1 > 2013-07-18 04:32:33,161 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=8080/mapOutput?job=job_1373902166027_0622&reduce=1&map=attempt_1373902166027_0622_m_000016_0 > sent hash and receievd reply > 2013-07-18 04:32:33,200 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 about to shuffle > output of map attempt_1373902166027_0622_m_000016_0 decomp: 55841282 len: > 55841286 to MEMORY > 2013-07-18 04:32:33,322 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: Read 55841282 bytes from > map-output for attempt_1373902166027_0622_m_000016_0 > 2013-07-18 04:32:33,323 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.MergeManager: closeInMemoryFile -> > map-output of size: 55841282, inMemoryMapOutputs.size() -> 2, commitMemory > -> 27, usedMemory ->55841309 > 2013-07-18 04:32:39,594 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: Read 118022137 bytes from > map-output for attempt_1373902166027_0622_m_000015_0 > 2013-07-18 04:32:39,594 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.MergeManager: closeInMemoryFile -> > map-output of size: 118022137, inMemoryMapOutputs.size() -> 3, > commitMemory -> 55841309, usedMemory ->173863446 > 2013-07-18 04:32:39,594 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: > 101-09-04.sc1.verticloud.com:8080 freed by fetcher#1 in 413s > 2013-07-18 04:32:42,188 INFO [EventFetcher for fetching Map Completion > Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: > attempt_1373902166027_0622_r_000001_0: Got 1 new map-outputs > 2013-07-18 04:32:42,188 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: Assiging > 101-09-04.sc1.verticloud.com:8080 with 1 to fetcher#1 > 2013-07-18 04:32:42,188 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: assigned 1 of 1 to > 101-09-04.sc1.verticloud.com:8080 to fetcher#1 > 2013-07-18 04:32:42,190 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=8080/mapOutput?job=job_1373902166027_0622&reduce=1&map=attempt_1373902166027_0622_m_000014_0 > sent hash and receievd reply > 2013-07-18 04:32:42,277 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 about to shuffle > output of map attempt_1373902166027_0622_m_000014_0 decomp: 140715962 len: > 140715966 to MEMORY > 2013-07-18 04:32:42,493 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: Read 140715962 bytes from > map-output for attempt_1373902166027_0622_m_000014_0 > 2013-07-18 04:32:42,493 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.MergeManager: closeInMemoryFile -> > map-output of size: 140715962, inMemoryMapOutputs.size() -> 4, > commitMemory -> 173863446, usedMemory ->314579408 > 2013-07-18 04:32:42,494 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: > 101-09-04.sc1.verticloud.com:8080 freed by fetcher#1 in 306s > 2013-07-18 04:32:43,192 INFO [EventFetcher for fetching Map Completion > Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: > attempt_1373902166027_0622_r_000001_0: Got 1 new map-outputs > 2013-07-18 04:32:43,192 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: Assiging > 101-09-04.sc1.verticloud.com:8080 with 1 to fetcher#1 > 2013-07-18 04:32:43,192 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: assigned 1 of 1 to > 101-09-04.sc1.verticloud.com:8080 to fetcher#1 > 2013-07-18 04:32:43,195 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=8080/mapOutput?job=job_1373902166027_0622&reduce=1&map=attempt_1373902166027_0622_m_000013_0 > sent hash and receievd reply > 2013-07-18 04:32:43,280 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 about to shuffle > output of map attempt_1373902166027_0622_m_000013_0 decomp: 141243082 len: > 141243086 to MEMORY > 2013-07-18 04:32:43,506 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: Read 141243082 bytes from > map-output for attempt_1373902166027_0622_m_000013_0 > 2013-07-18 04:32:43,506 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.MergeManager: closeInMemoryFile -> > map-output of size: 141243082, inMemoryMapOutputs.size() -> 5, > commitMemory -> 314579408, usedMemory ->455822490 > 2013-07-18 04:32:43,507 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: > 101-09-04.sc1.verticloud.com:8080 freed by fetcher#1 in 315s > 2013-07-18 04:32:44,195 INFO [EventFetcher for fetching Map Completion > Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: > attempt_1373902166027_0622_r_000001_0: Got 1 new map-outputs > 2013-07-18 04:32:44,195 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: Assiging > 101-09-04.sc1.verticloud.com:8080 with 1 to fetcher#1 > 2013-07-18 04:32:44,195 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: assigned 1 of 1 to > 101-09-04.sc1.verticloud.com:8080 to fetcher#1 > 2013-07-18 04:32:44,198 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=8080/mapOutput?job=job_1373902166027_0622&reduce=1&map=attempt_1373902166027_0622_m_000011_0 > sent hash and receievd reply > 2013-07-18 04:32:44,305 INFO [fetcher#1] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#1 about to shuffle > output of map attempt_1373902166027_0622_m_000011_0 decomp: 173528412 len: > 173528416 to MEMORY > ... > 2013-07-18 04:32:56,901 INFO [fetcher#2] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: Read 282474777 bytes from > map-output for attempt_1373902166027_0622_m_000001_0 > 2013-07-18 04:32:56,901 INFO [fetcher#2] > org.apache.hadoop.mapreduce.task.reduce.MergeManager: closeInMemoryFile -> > map-output of size: 282474777, inMemoryMapOutputs.size() -> 5, > commitMemory -> 1179552807, usedMemory ->1462027584 > 2013-07-18 04:32:56,901 INFO [fetcher#2] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: > 101-09-04.sc1.verticloud.com:8080 freed by fetcher#2 in 2682s > 2013-07-18 04:32:56,901 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: Assiging > 101-09-04.sc1.verticloud.com:8080 with 4 to fetcher#4 > 2013-07-18 04:32:56,902 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: assigned 4 of 4 to > 101-09-04.sc1.verticloud.com:8080 to fetcher#4 > 2013-07-18 04:32:56,904 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=8080/mapOutput?job=job_1373902166027_0622&reduce=1&map=attempt_1373902166027_0622_m_000006_0,attempt_1373902166027_0622_m_000002_0,attempt_1373902166027_0622_m_000003_0,attempt_1373902166027_0622_m_000005_0 > sent hash and receievd reply > 2013-07-18 04:32:57,336 INFO [EventFetcher for fetching Map Completion > Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher: > attempt_1373902166027_0622_r_000001_0: Got 1 new map-outputs > 2013-07-18 04:32:57,414 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#4 about to shuffle > output of map attempt_1373902166027_0622_m_000006_0 decomp: 280156692 len: > 280156696 to MEMORY > 2013-07-18 04:32:57,867 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: Read 280156692 bytes from > map-output for attempt_1373902166027_0622_m_000006_0 > 2013-07-18 04:32:57,867 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.MergeManager: closeInMemoryFile -> > map-output of size: 280156692, inMemoryMapOutputs.size() -> 6, > commitMemory -> 1462027584, usedMemory ->1742184276 > 2013-07-18 04:32:57,900 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#4 - MergerManager > returned Status.WAIT ... > 2013-07-18 04:32:57,901 INFO [fetcher#4] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: > 101-09-04.sc1.verticloud.com:8080 freed by fetcher#4 in 999s > 2013-07-18 04:32:57,901 INFO [fetcher#3] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: Assiging > 101-09-04.sc1.verticloud.com:8080 with 4 to fetcher#3 > 2013-07-18 04:32:57,901 INFO [fetcher#3] > org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: assigned 4 of 4 to > 101-09-04.sc1.verticloud.com:8080 to fetcher#3 > 2013-07-18 04:32:57,903 INFO [fetcher#3] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: for > url=8080/mapOutput?job=job_1373902166027_0622&reduce=1&map=attempt_1373902166027_0622_m_000000_0,attempt_1373902166027_0622_m_000002_0,attempt_1373902166027_0622_m_000005_0,attempt_1373902166027_0622_m_000003_0 > sent hash and receievd reply > 2013-07-18 04:32:57,904 INFO [fetcher#3] > org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 - MergerManager > returned Status.WAIT ... > ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira