Tasks freeze with "No live nodes contain current block", job takes long time to
recover
---------------------------------------------------------------------------------------
Key: HADOOP-5361
URL: https://issues.apache.org/jira/browse/HADOOP-5361
Project: Hadoop Core
Issue Type: Bug
Affects Versions: 0.21.0
Reporter: Matei Zaharia
Running a recent version of trunk on 100 nodes, I occasionally see some tasks
freeze at startup and hang the job. These tasks are not speculatively executed
either. Here's sample output from one of them:
{noformat}
009-02-27 15:19:09,856 WARN org.apache.hadoop.conf.Configuration: DEPRECATED:
hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated.
Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override
properties of core-default.xml, mapred-default.xml and hdfs-default.xml
respectively
2009-02-27 15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2009-02-27 15:19:10,486 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2009-02-27 15:21:20,952 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain
block blk_2086525142250101885_39076 from any node: java.io.IOException: No
live nodes contain current block
2009-02-27 15:23:23,972 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain
block blk_2086525142250101885_39076 from any node: java.io.IOException: No
live nodes contain current block
2009-02-27 15:25:26,992 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain
block blk_2086525142250101885_39076 from any node: java.io.IOException: No
live nodes contain current block
2009-02-27 15:27:30,012 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076
file=/user/root/rand2/part-00864
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
2009-02-27 15:27:30,018 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076
file=/user/root/rand2/part-00864
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
{noformat}
Note how the DFS client fails multiple times to retrieve the block, with a 2
minute wait between each one, without giving up. During this time, the task is
*not* speculated. However, once this task finally failed, a new version of it
ran successfully. Getting the input file in question with bin/hadoop fs -get
also worked fine.
There is no mention of the task attempt in question in the NameNode logs but my
guess is that something to do with RPC queues is causing its connection to get
lost, and the DFSClient does not recover.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.