[jira] [Commented] (MAPREDUCE-4464) Reduce tasks failing with NullPointerException in ConcurrentHashMap.get()
[ https://issues.apache.org/jira/browse/MAPREDUCE-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13464854#comment-13464854 ] Hadoop QA commented on MAPREDUCE-4464: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12546872/MAPREDUCE-4464.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2886//console This message is automatically generated. Reduce tasks failing with NullPointerException in ConcurrentHashMap.get() - Key: MAPREDUCE-4464 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4464 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 1.0.0 Reporter: Clint Heath Assignee: Clint Heath Priority: Minor Attachments: MAPREDUCE-4464_new.patch, MAPREDUCE-4464.patch, MAPREDUCE-4464.patch Original Estimate: 1h Remaining Estimate: 1h If DNS does not resolve hostnames properly, reduce tasks can fail with a very misleading exception. as per my peer Ahmed's diagnosis: In ReduceTask, it seems that event.getTaskTrackerHttp() returns a malformed URI, and so host from: {code} String host = u.getHost(); {code} is evaluated to null and the NullPointerException is thrown afterwards in the ConcurrentHashMap. I have written a patch to check for a null hostname condition when getHost is called in the getMapCompletionEvents method and print an intelligible warning message rather than suppressing it until later when it becomes confusing and misleading. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4464) Reduce tasks failing with NullPointerException in ConcurrentHashMap.get()
[ https://issues.apache.org/jira/browse/MAPREDUCE-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13464860#comment-13464860 ] Clint Heath commented on MAPREDUCE-4464: Thanks Harsh! I look forward to contributing much more too Reduce tasks failing with NullPointerException in ConcurrentHashMap.get() - Key: MAPREDUCE-4464 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4464 Project: Hadoop Map/Reduce Issue Type: Improvement Components: task Affects Versions: 1.0.0 Reporter: Clint Heath Assignee: Clint Heath Priority: Minor Fix For: 1.2.0 Attachments: MAPREDUCE-4464_new.patch, MAPREDUCE-4464.patch, MAPREDUCE-4464.patch Original Estimate: 1h Remaining Estimate: 1h If DNS does not resolve hostnames properly, reduce tasks can fail with a very misleading exception. as per my peer Ahmed's diagnosis: In ReduceTask, it seems that event.getTaskTrackerHttp() returns a malformed URI, and so host from: {code} String host = u.getHost(); {code} is evaluated to null and the NullPointerException is thrown afterwards in the ConcurrentHashMap. I have written a patch to check for a null hostname condition when getHost is called in the getMapCompletionEvents method and print an intelligible warning message rather than suppressing it until later when it becomes confusing and misleading. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4464) Reduce tasks failing with NullPointerException in ConcurrentHashMap.get()
[ https://issues.apache.org/jira/browse/MAPREDUCE-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463833#comment-13463833 ] Harsh J commented on MAPREDUCE-4464: Hi Clint, Sorry on the delay here! I noticed that the line: bq. String host = u.getHost(); Which is the one in question of carrying a null, is then used in the lookup as: bq. ListMapOutputLocation loc = mapLocations.get(host); Hence, I think the most ideal fix would be to throw an exception. Because, in the chunks later, we rely heavily on host: {code} URI u = URI.create(event.getTaskTrackerHttp()); String host = u.getHost(); TaskAttemptID taskId = event.getTaskAttemptId(); URL mapOutputLocation = new URL(event.getTaskTrackerHttp() + /mapOutput?job= + taskId.getJobID() + map= + taskId + reduce= + getPartition()); ListMapOutputLocation loc = mapLocations.get(host); if (loc == null) { loc = Collections.synchronizedList (new LinkedListMapOutputLocation()); mapLocations.put(host, loc); } loc.add(new MapOutputLocation(taskId, host, mapOutputLocation)); numNewMaps ++; {code} As seen by its usage, if host itself is undeterminable, and is consistently null, we cannot really work with it, and throwing an IOException makes sense. I'm currently running test-patch on your patch for branch-1, depending on whose results I'll commit it in or post some further comments. MR2 may be similarly affected on the netty side but may be failing properly already, I haven't the time to verify at the moment (perhaps another JIRA). So I'll just focus on the MR1 side now. Thanks for the patch! Reduce tasks failing with NullPointerException in ConcurrentHashMap.get() - Key: MAPREDUCE-4464 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4464 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 1.0.0 Reporter: Clint Heath Assignee: Clint Heath Priority: Minor Attachments: MAPREDUCE-4464_new.patch, MAPREDUCE-4464.patch Original Estimate: 1h Remaining Estimate: 1h If DNS does not resolve hostnames properly, reduce tasks can fail with a very misleading exception. as per my peer Ahmed's diagnosis: In ReduceTask, it seems that event.getTaskTrackerHttp() returns a malformed URI, and so host from: {code} String host = u.getHost(); {code} is evaluated to null and the NullPointerException is thrown afterwards in the ConcurrentHashMap. I have written a patch to check for a null hostname condition when getHost is called in the getMapCompletionEvents method and print an intelligible warning message rather than suppressing it until later when it becomes confusing and misleading. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4464) Reduce tasks failing with NullPointerException in ConcurrentHashMap.get()
[ https://issues.apache.org/jira/browse/MAPREDUCE-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463842#comment-13463842 ] Clint Heath commented on MAPREDUCE-4464: Thanks Harsh! I'll take a look at yarn and see if a similar situation is present there. Reduce tasks failing with NullPointerException in ConcurrentHashMap.get() - Key: MAPREDUCE-4464 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4464 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 1.0.0 Reporter: Clint Heath Assignee: Clint Heath Priority: Minor Attachments: MAPREDUCE-4464_new.patch, MAPREDUCE-4464.patch Original Estimate: 1h Remaining Estimate: 1h If DNS does not resolve hostnames properly, reduce tasks can fail with a very misleading exception. as per my peer Ahmed's diagnosis: In ReduceTask, it seems that event.getTaskTrackerHttp() returns a malformed URI, and so host from: {code} String host = u.getHost(); {code} is evaluated to null and the NullPointerException is thrown afterwards in the ConcurrentHashMap. I have written a patch to check for a null hostname condition when getHost is called in the getMapCompletionEvents method and print an intelligible warning message rather than suppressing it until later when it becomes confusing and misleading. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4464) Reduce tasks failing with NullPointerException in ConcurrentHashMap.get()
[ https://issues.apache.org/jira/browse/MAPREDUCE-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463873#comment-13463873 ] Harsh J commented on MAPREDUCE-4464: Clint, Thanks for looking at YARN (do file a new JIRA even if its just for investigation). You will need to look at the ShuffleHandler class downwards. Reduce tasks failing with NullPointerException in ConcurrentHashMap.get() - Key: MAPREDUCE-4464 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4464 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 1.0.0 Reporter: Clint Heath Assignee: Clint Heath Priority: Minor Attachments: MAPREDUCE-4464_new.patch, MAPREDUCE-4464.patch Original Estimate: 1h Remaining Estimate: 1h If DNS does not resolve hostnames properly, reduce tasks can fail with a very misleading exception. as per my peer Ahmed's diagnosis: In ReduceTask, it seems that event.getTaskTrackerHttp() returns a malformed URI, and so host from: {code} String host = u.getHost(); {code} is evaluated to null and the NullPointerException is thrown afterwards in the ConcurrentHashMap. I have written a patch to check for a null hostname condition when getHost is called in the getMapCompletionEvents method and print an intelligible warning message rather than suppressing it until later when it becomes confusing and misleading. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4464) Reduce tasks failing with NullPointerException in ConcurrentHashMap.get()
[ https://issues.apache.org/jira/browse/MAPREDUCE-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463991#comment-13463991 ] Harsh J commented on MAPREDUCE-4464: From test-patch on branch-1: {code} [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no tests are needed for this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] -1 findbugs. The patch appears to introduce 220 new Findbugs (version 2.0.1-rc3) warnings. {code} The 220 new findbugs from version 2.0.1-rc3 is what we get when the findbugs target is run on the patch-less branch-1. Doesn't look like there are any existing test-cases to cover this from an initial look. Also, the method in which we'll be throwing this exception, already grants it via a throws IOException agreement. I ran -Dtestcase=TestMR* and -Dtestcase=TestMap* to run some MR tests over branch-1 and they seem to pass with this applied. +1 for committing. Just gonna run one job over a cluster instance with a hostname with an underscore before doing so, to make sure this is working reliably well. Reduce tasks failing with NullPointerException in ConcurrentHashMap.get() - Key: MAPREDUCE-4464 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4464 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 1.0.0 Reporter: Clint Heath Assignee: Clint Heath Priority: Minor Attachments: MAPREDUCE-4464_new.patch, MAPREDUCE-4464.patch Original Estimate: 1h Remaining Estimate: 1h If DNS does not resolve hostnames properly, reduce tasks can fail with a very misleading exception. as per my peer Ahmed's diagnosis: In ReduceTask, it seems that event.getTaskTrackerHttp() returns a malformed URI, and so host from: {code} String host = u.getHost(); {code} is evaluated to null and the NullPointerException is thrown afterwards in the ConcurrentHashMap. I have written a patch to check for a null hostname condition when getHost is called in the getMapCompletionEvents method and print an intelligible warning message rather than suppressing it until later when it becomes confusing and misleading. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4464) Reduce tasks failing with NullPointerException in ConcurrentHashMap.get()
[ https://issues.apache.org/jira/browse/MAPREDUCE-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13436522#comment-13436522 ] Hadoop QA commented on MAPREDUCE-4464: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12537745/MAPREDUCE-4464_new.patch against trunk revision . -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2747//console This message is automatically generated. Reduce tasks failing with NullPointerException in ConcurrentHashMap.get() - Key: MAPREDUCE-4464 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4464 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 1.0.0 Reporter: Clint Heath Assignee: Clint Heath Priority: Minor Attachments: MAPREDUCE-4464_new.patch, MAPREDUCE-4464.patch Original Estimate: 1h Remaining Estimate: 1h If DNS does not resolve hostnames properly, reduce tasks can fail with a very misleading exception. as per my peer Ahmed's diagnosis: In ReduceTask, it seems that event.getTaskTrackerHttp() returns a malformed URI, and so host from: {code} String host = u.getHost(); {code} is evaluated to null and the NullPointerException is thrown afterwards in the ConcurrentHashMap. I have written a patch to check for a null hostname condition when getHost is called in the getMapCompletionEvents method and print an intelligible warning message rather than suppressing it until later when it becomes confusing and misleading. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4464) Reduce tasks failing with NullPointerException in ConcurrentHashMap.get()
[ https://issues.apache.org/jira/browse/MAPREDUCE-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418643#comment-13418643 ] Clint Heath commented on MAPREDUCE-4464: Sorry, I should have supplied the exception that we encountered when this issue happened. As it turned out, the host names in the cluster all had illegal DNS characters in them (the underscore _), so when the getHost() call was made, null was returned and we saw the following. Mappers get about 80% complete when the reducers all begin to throw the following exceptions and then die almost immediately...eventually the whole job dies: {noformat} 2012-06-26 15:56:02,326 FATAL org.apache.hadoop.mapred.Task: attempt_201206251823_0004_r_36_1 GetMapEventsThread Ignoring exception : java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2835) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2756) 2012-06-26 15:56:02,356 FATAL org.apache.hadoop.mapred.Task: attempt_201206251823_0004_r_36_1 GetMapEventsThread Ignoring exception : org.apache.hadoop.ipc.RemoteException: java.io.IOException: JvmValidate Failed. Ignoring request from task: attempt_201206251823_0004_r_36_1, with JvmId: jvm_201206251823_0004_r_-396118293 at org.apache.hadoop.mapred.TaskTracker.validateJVM(TaskTracker.java:3468) at org.apache.hadoop.mapred.TaskTracker.getMapCompletionEvents(TaskTracker.java:3731) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428) at org.apache.hadoop.ipc.Client.call(Client.java:1107) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy0.getMapCompletionEvents(Unknown Source) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2798) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2756) 2012-06-26 15:56:02,361 FATAL org.apache.hadoop.mapred.Task: Failed to contact the tasktracker org.apache.hadoop.ipc.RemoteException: java.io.IOException: JvmValidate Failed. Ignoring request from task: attempt_201206251823_0004_r_36_1, with JvmId: jvm_201206251823_0004_r_-396118293 at org.apache.hadoop.mapred.TaskTracker.validateJVM(TaskTracker.java:3468) at org.apache.hadoop.mapred.TaskTracker.fatalError(TaskTracker.java:3714) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428) at org.apache.hadoop.ipc.Client.call(Client.java:1107) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy0.fatalError(Unknown Source) at org.apache.hadoop.mapred.Task.reportFatalError(Task.java:294) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2781) {noformat} Reduce tasks failing with NullPointerException in ConcurrentHashMap.get() - Key: MAPREDUCE-4464 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4464 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 1.0.0 Reporter: Clint Heath Priority: Minor Attachments: MAPREDUCE-4464.patch Original Estimate: 1h Remaining Estimate: 1h If DNS does not resolve hostnames properly, reduce tasks can fail with a very misleading exception. as per my peer Ahmed's diagnosis: In ReduceTask, it seems
[jira] [Commented] (MAPREDUCE-4464) Reduce tasks failing with NullPointerException in ConcurrentHashMap.get()
[ https://issues.apache.org/jira/browse/MAPREDUCE-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418642#comment-13418642 ] Karthik Kambatla commented on MAPREDUCE-4464: - Clint, thanks a lot for looking into this issue. *Minor comment: Would it be better to throw an IOException wrapped with your message, so that we can avoid the subsequent NullPointerException? Reduce tasks failing with NullPointerException in ConcurrentHashMap.get() - Key: MAPREDUCE-4464 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4464 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 1.0.0 Reporter: Clint Heath Priority: Minor Attachments: MAPREDUCE-4464.patch Original Estimate: 1h Remaining Estimate: 1h If DNS does not resolve hostnames properly, reduce tasks can fail with a very misleading exception. as per my peer Ahmed's diagnosis: In ReduceTask, it seems that event.getTaskTrackerHttp() returns a malformed URI, and so host from: {code} String host = u.getHost(); {code} is evaluated to null and the NullPointerException is thrown afterwards in the ConcurrentHashMap. I have written a patch to check for a null hostname condition when getHost is called in the getMapCompletionEvents method and print an intelligible warning message rather than suppressing it until later when it becomes confusing and misleading. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4464) Reduce tasks failing with NullPointerException in ConcurrentHashMap.get()
[ https://issues.apache.org/jira/browse/MAPREDUCE-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13418649#comment-13418649 ] Clint Heath commented on MAPREDUCE-4464: Karthik, I'm fine with that as long as it doesn't interrupt the overall flow and process of what's supposed to happen when a task fails. In our case, every reduce task failed and therefore the entire job, but I can see a situation where only one TT machine had a bad hostname and therefore only a subset of reduce tasks would fail and the overall job may still complete. I just want to make sure we are informative in the logs and that the tasks are allowed to be re-tried if applicable, etc. I haven't thought through all the logic far enough yet to know the ramifications of throwing an IOE right there. Harsh and I chatted about the same idea earlier, though. I'll vet that out... Reduce tasks failing with NullPointerException in ConcurrentHashMap.get() - Key: MAPREDUCE-4464 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4464 Project: Hadoop Map/Reduce Issue Type: Bug Components: task Affects Versions: 1.0.0 Reporter: Clint Heath Priority: Minor Attachments: MAPREDUCE-4464.patch Original Estimate: 1h Remaining Estimate: 1h If DNS does not resolve hostnames properly, reduce tasks can fail with a very misleading exception. as per my peer Ahmed's diagnosis: In ReduceTask, it seems that event.getTaskTrackerHttp() returns a malformed URI, and so host from: {code} String host = u.getHost(); {code} is evaluated to null and the NullPointerException is thrown afterwards in the ConcurrentHashMap. I have written a patch to check for a null hostname condition when getHost is called in the getMapCompletionEvents method and print an intelligible warning message rather than suppressing it until later when it becomes confusing and misleading. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira