[
https://issues.apache.org/jira/browse/MAPREDUCE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646670#comment-13646670
]
Ivan Mitic commented on MAPREDUCE-50:
-------------------------------------
Hi Steve, Vinod,
I've run into the similar problem to this one. In my case, JobTracker started
failing jobs because the network topology resolution started failing for a
single node in the cluster:
{code}
2013-04-27 08:33:08,204 ERROR org.apache.hadoop.mapred.JobTracker: Job
initialization failed:
java.lang.NullPointerException
at
org.apache.hadoop.mapred.JobTracker.resolveAndAddToTopology(JobTracker.java:3205)
at
org.apache.hadoop.mapred.JobInProgress.createCache(JobInProgress.java:550)
at
org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:734)
at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4214)
at
org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
{code}
What happens is that some input split blocks are located on the datanode with
the same IP/hostname as the TT. As a side effect this results in many of the
customer jobs to fail during initialization.
NN on the other hand has a fallback logic that defaults to /default-rack, and
this inconsistency actually makes this problem more severe :)
{code}
2013-04-27 04:36:47,185 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: The resolve call returned
null! Using /default-rack for host [100.64.34.3]
2013-04-27 04:36:47,185 INFO org.apache.hadoop.net.NetworkTopology: Adding a
new node: /default-rack/100.64.34.3:50010
{code}
In terms of the fix, my proposal would be to add the same fallback logic to the
JobTracker. In our case, we actually had a network topology script that worked
fine for a year or so, and now started failing for a single node for a reason
we cannot explain yet.
Let me know what you think. I'll take up this Jira if you don't mind.
> NPE in heartbeat when the configured topology script doesn't exist
> ------------------------------------------------------------------
>
> Key: MAPREDUCE-50
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-50
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 1.0.3
> Reporter: Vinod Kumar Vavilapalli
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira