[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13484036#comment-13484036 ] Hudson commented on MAPREDUCE-4730: --- Integrated in Hadoop-Yarn-trunk #16 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/16/]) MAPREDUCE-4730. Fix Reducer's EventFetcher to scale the map-completion requests slowly to avoid HADOOP-8942. Contributed by Jason Lowe. (Revision 1401941) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401941 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/EventFetcher.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestEventFetcher.java AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Fix For: 2.0.3-alpha, 0.23.5 Attachments: MAPREDUCE-4730.patch, MAPREDUCE-4730.patch, MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13484079#comment-13484079 ] Hudson commented on MAPREDUCE-4730: --- Integrated in Hadoop-Hdfs-0.23-Build #415 (See [https://builds.apache.org/job/Hadoop-Hdfs-0.23-Build/415/]) MAPREDUCE-4730. Fix Reducer's EventFetcher to scale the map-completion requests slowly to avoid HADOOP-8942. Contributed by Jason Lowe. svn merge --ignore-ancestry -c 1401941 ../../trunk/ (Revision 1401943) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401943 Files : * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/EventFetcher.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java * /hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestEventFetcher.java AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Fix For: 2.0.3-alpha, 0.23.5 Attachments: MAPREDUCE-4730.patch, MAPREDUCE-4730.patch, MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13484095#comment-13484095 ] Hudson commented on MAPREDUCE-4730: --- Integrated in Hadoop-Hdfs-trunk #1206 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1206/]) MAPREDUCE-4730. Fix Reducer's EventFetcher to scale the map-completion requests slowly to avoid HADOOP-8942. Contributed by Jason Lowe. (Revision 1401941) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401941 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/EventFetcher.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestEventFetcher.java AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Fix For: 2.0.3-alpha, 0.23.5 Attachments: MAPREDUCE-4730.patch, MAPREDUCE-4730.patch, MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13484136#comment-13484136 ] Hudson commented on MAPREDUCE-4730: --- Integrated in Hadoop-Mapreduce-trunk #1236 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1236/]) MAPREDUCE-4730. Fix Reducer's EventFetcher to scale the map-completion requests slowly to avoid HADOOP-8942. Contributed by Jason Lowe. (Revision 1401941) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401941 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/EventFetcher.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestEventFetcher.java AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Fix For: 2.0.3-alpha, 0.23.5 Attachments: MAPREDUCE-4730.patch, MAPREDUCE-4730.patch, MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483435#comment-13483435 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-4730: I was thinking that the connection timeouts are unrelated to HADOOP-8942. You are right, AMScalability only runs maps, so there is no chance to uncover this issue. Are these socket timeout exceptions? I remember running into those with gridmix, but never got around to the bottom of that because of more pressing concerns. AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch, MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483454#comment-13483454 ] Jason Lowe commented on MAPREDUCE-4730: --- Yes, these are socket timeout exceptions. The timeouts are somewhat related to HADOOP-8942 in the sense that when the AM heap starts to fill up from buffering all those responses, it will spend more time garbage collecting and enough garbage collecting leads to unresponsiveness and ultimately timeouts on some of the clients. AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch, MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483622#comment-13483622 ] Hadoop QA commented on MAPREDUCE-4730: -- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12550687/MAPREDUCE-4730.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2965//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2965//console This message is automatically generated. AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch, MAPREDUCE-4730.patch, MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483797#comment-13483797 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-4730: Neat test case! +1, checking this in. AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch, MAPREDUCE-4730.patch, MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483822#comment-13483822 ] Hudson commented on MAPREDUCE-4730: --- Integrated in Hadoop-trunk-Commit #2925 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/2925/]) MAPREDUCE-4730. Fix Reducer's EventFetcher to scale the map-completion requests slowly to avoid HADOOP-8942. Contributed by Jason Lowe. (Revision 1401941) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401941 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/EventFetcher.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/Shuffle.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestEventFetcher.java AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Fix For: 2.0.3-alpha, 0.23.5 Attachments: MAPREDUCE-4730.patch, MAPREDUCE-4730.patch, MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13481862#comment-13481862 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-4730: Patch looks good. Can you try writing a simple test for EventFetcher? You can mock umbilicalProtocol, shuffleScheduler and reporter I suppose. Then you can validate your current change also. Let me know if it becomes too cumbersome. bq. The only issue I ran into was a significant number of maps and reduces failed because they timed out trying to establish a connection to the AM. This is new. I don't remember us running into it when we ran AMScalability. Can you file a bug, more details will be great to have. AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch, MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13480289#comment-13480289 ] Jason Lowe commented on MAPREDUCE-4730: --- Update on testing, I was able to test this (along with the fix for MAPREDUCE-4733) using a sleep job with 2 maps and 3000 reduces on a cluster big enough to mass-launch the map and reduce phases. The AM with a 1.5GB slot size stayed up during the job, where previously it failed even with a larger slot. The only issue I ran into was a significant number of maps and reduces failed because they timed out trying to establish a connection to the AM. I suspected the AM could have been busy garbage collecting and causing the delays, so I bumped up the AM size to 3G and it ran smoothly with no connection timeout failures from any tasks. AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch, MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479257#comment-13479257 ] Jason Lowe commented on MAPREDUCE-4730: --- A little more digging and I'm a bit more confident that this is a flow control problem in the IPC layer. I think the scenario goes like this: # 1000's of reducers start asking for map completion events about the same time # IPC Server.Handler thread fields a call off the queue, makes the call and gets 900K of data # Handler thread queues up the response data to the connection, likely sees its the only thing in the queue, and tries to push out the data # It's too big to send it all without blocking so it pushes the remainder back onto the response queue for the Responder thread to deal with and moves on to another call from the call queue # Lots of reducers are queueing up in the call queue to get their 900K of data, and the handler threads are processing them and pushing that data on the response queues as fast as they can # Responder thread and/or socket I/O can't keep pace with the rate at which handlers are generating 900K responses and we eventually exhaust memory AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Priority: Blocker We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479286#comment-13479286 ] Robert Joseph Evans commented on MAPREDUCE-4730: The patch is simple enough if Jenkins comes back OK I am a +1 on it. AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479290#comment-13479290 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-4730: Great analysis! 900K * 3000 reducers = 2.7GB, so the numbers are adding up. Instead of hard-coding it, each reducer could base it on the total number of reducers for the job (from configuration)? AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479291#comment-13479291 ] Hadoop QA commented on MAPREDUCE-4730: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549736/MAPREDUCE-4730.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2939//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2939//console This message is automatically generated. AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479296#comment-13479296 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-4730: Also, lessening it has performance implications on small jobs (not sure how much) given the fetcher loop runs every 1 second irrespective of whether there are more events or not. So, hate to propose it, but shall we add in a config to override this? AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479311#comment-13479311 ] Jason Lowe commented on MAPREDUCE-4730: --- Is the 1 second sleep necessary? Seems like we could eliminate that sleep if we got a maximum-sized response? AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479432#comment-13479432 ] Jason Lowe commented on MAPREDUCE-4730: --- Filed MAPREDUCE-4733 to track the filtering/windowing issue in TaskAttemptListenerImpl.getMapCompletionEvents AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479532#comment-13479532 ] Vinod Kumar Vavilapalli commented on MAPREDUCE-4730: bq. Seems like we could eliminate that sleep if we got a maximum-sized response? +1. AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13479633#comment-13479633 ] Hadoop QA commented on MAPREDUCE-4730: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549805/MAPREDUCE-4730.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated -4 warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2945//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2945//console This message is automatically generated. AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Assignee: Jason Lowe Priority: Blocker Attachments: MAPREDUCE-4730.patch, MAPREDUCE-4730.patch We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4730) AM crashes due to OOM while serving up map task completion events
[ https://issues.apache.org/jira/browse/MAPREDUCE-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13478496#comment-13478496 ] Jason Lowe commented on MAPREDUCE-4730: --- Here's what I have gathered so far from a heap dump of an AM attempt that was just about to run out of memory. Most of the memory was consumed by byte buffers, specifically it looked like most of them were RPC response buffers. I think there might be a flow control issue in the IPC layer that lead to this. More than half of the mappers finished before the first reducer started, and thousands of reducers all launched within a few seconds of each other. They all asked the AM for map completion task events, which currently caps the response to 1 events per query. Since more than 1 maps completed before the first reducers started, each reducer saw a full event list which took around 900K for each response buffer. There were many IPC Handler threads to service the calls, but only one Responder thread to send out the rather large response buffers. I see there's a blocking queue to prevent too many calls from coming in at once, but I didn't see any flow control between the Handlers and the Responder thread. It appears that as long as the Handler threads can keep up with call queue relatively low, they can continue to queue up call response data faster than the Responder thread can send it out. Eventually this will exhaust available memory leading to an OOM. AM crashes due to OOM while serving up map task completion events - Key: MAPREDUCE-4730 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4730 Project: Hadoop Map/Reduce Issue Type: Bug Components: applicationmaster, mrv2 Affects Versions: 0.23.3 Reporter: Jason Lowe Priority: Blocker We're seeing a repeatable OOM crash in the AM for a task with around 3 maps and 3000 reducers. Details to follow. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira