Tasks freeze with "No live nodes contain current block", job takes long time to 
recover
---------------------------------------------------------------------------------------

                 Key: HADOOP-5361
                 URL: https://issues.apache.org/jira/browse/HADOOP-5361
             Project: Hadoop Core
          Issue Type: Bug
    Affects Versions: 0.21.0
            Reporter: Matei Zaharia


Running a recent version of trunk on 100 nodes, I occasionally see some tasks 
freeze at startup and hang the job. These tasks are not speculatively executed 
either. Here's sample output from one of them:

{noformat}
009-02-27 15:19:09,856 WARN org.apache.hadoop.conf.Configuration: DEPRECATED: 
hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. 
Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override 
properties of core-default.xml, mapred-default.xml and hdfs-default.xml 
respectively
2009-02-27 15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=MAP, sessionId=
2009-02-27 15:19:10,486 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2009-02-27 15:21:20,952 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_2086525142250101885_39076 from any node:  java.io.IOException: No 
live nodes contain current block
2009-02-27 15:23:23,972 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_2086525142250101885_39076 from any node:  java.io.IOException: No 
live nodes contain current block
2009-02-27 15:25:26,992 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_2086525142250101885_39076 from any node:  java.io.IOException: No 
live nodes contain current block
2009-02-27 15:27:30,012 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: 
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 
file=/user/root/rand2/part-00864
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
    at java.io.DataInputStream.read(DataInputStream.java:83)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:155)

2009-02-27 15:27:30,018 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 
file=/user/root/rand2/part-00864
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
    at java.io.DataInputStream.read(DataInputStream.java:83)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:155)
{noformat}

Note how the DFS client fails multiple times to retrieve the block, with a 2 
minute wait between each one, without giving up. During this time, the task is 
*not* speculated. However, once this task finally failed, a new version of it 
ran successfully. Getting the input file in question with bin/hadoop fs -get 
also worked fine.

There is no mention of the task attempt in question in the NameNode logs but my 
guess is that something to do with RPC queues is causing its connection to get 
lost, and the DFSClient does not recover.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to