[ 
https://issues.apache.org/jira/browse/MAPREDUCE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646670#comment-13646670
 ] 

Ivan Mitic commented on MAPREDUCE-50:
-------------------------------------

Hi Steve, Vinod,

I've run into the similar problem to this one. In my case, JobTracker started 
failing jobs because the network topology resolution started failing for a 
single node in the cluster:
{code}
2013-04-27 08:33:08,204 ERROR org.apache.hadoop.mapred.JobTracker: Job 
initialization failed:
java.lang.NullPointerException
        at 
org.apache.hadoop.mapred.JobTracker.resolveAndAddToTopology(JobTracker.java:3205)
        at 
org.apache.hadoop.mapred.JobInProgress.createCache(JobInProgress.java:550)
        at 
org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:734)
        at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4214)
        at 
org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
{code}

What happens is that some input split blocks are located on the datanode with 
the same IP/hostname as the TT. As a side effect this results in many of the 
customer jobs to fail during initialization.

NN on the other hand has a fallback logic that defaults to /default-rack, and 
this inconsistency actually makes this problem more severe :)
{code}
2013-04-27 04:36:47,185 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: The resolve call returned 
null! Using /default-rack for host [100.64.34.3]
2013-04-27 04:36:47,185 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: /default-rack/100.64.34.3:50010                  
{code}

In terms of the fix, my proposal would be to add the same fallback logic to the 
JobTracker. In our case, we actually had a network topology script that worked 
fine for a year or so, and now started failing for a single node for a reason 
we cannot explain yet.

Let me know what you think. I'll take up this Jira if you don't mind.
                
> NPE in heartbeat when the configured topology script doesn't exist
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-50
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-50
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 1.0.3
>            Reporter: Vinod Kumar Vavilapalli
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to