[ https://issues.apache.org/jira/browse/MAPREDUCE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646670#comment-13646670 ]
Ivan Mitic commented on MAPREDUCE-50: ------------------------------------- Hi Steve, Vinod, I've run into the similar problem to this one. In my case, JobTracker started failing jobs because the network topology resolution started failing for a single node in the cluster: {code} 2013-04-27 08:33:08,204 ERROR org.apache.hadoop.mapred.JobTracker: Job initialization failed: java.lang.NullPointerException at org.apache.hadoop.mapred.JobTracker.resolveAndAddToTopology(JobTracker.java:3205) at org.apache.hadoop.mapred.JobInProgress.createCache(JobInProgress.java:550) at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:734) at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4214) at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {code} What happens is that some input split blocks are located on the datanode with the same IP/hostname as the TT. As a side effect this results in many of the customer jobs to fail during initialization. NN on the other hand has a fallback logic that defaults to /default-rack, and this inconsistency actually makes this problem more severe :) {code} 2013-04-27 04:36:47,185 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: The resolve call returned null! Using /default-rack for host [100.64.34.3] 2013-04-27 04:36:47,185 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/100.64.34.3:50010 {code} In terms of the fix, my proposal would be to add the same fallback logic to the JobTracker. In our case, we actually had a network topology script that worked fine for a year or so, and now started failing for a single node for a reason we cannot explain yet. Let me know what you think. I'll take up this Jira if you don't mind. > NPE in heartbeat when the configured topology script doesn't exist > ------------------------------------------------------------------ > > Key: MAPREDUCE-50 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-50 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 1.0.3 > Reporter: Vinod Kumar Vavilapalli > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira