[
https://issues.apache.org/jira/browse/HADOOP-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia updated HADOOP-5361:
----------------------------------
Description:
Running a recent version of trunk on 100 nodes, I occasionally see some tasks
freeze at startup and hang the job. These tasks are not speculatively executed
either. Here's sample output from one of them:
{noformat}
2009-02-27 15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2009-02-27 15:19:10,486 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2009-02-27 15:21:20,952 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain
block blk_2086525142250101885_39076 from any node: java.io.IOException: No
live nodes contain current block
2009-02-27 15:23:23,972 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain
block blk_2086525142250101885_39076 from any node: java.io.IOException: No
live nodes contain current block
2009-02-27 15:25:26,992 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain
block blk_2086525142250101885_39076 from any node: java.io.IOException: No
live nodes contain current block
2009-02-27 15:27:30,012 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076
file=/user/root/rand2/part-00864
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
2009-02-27 15:27:30,018 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076
file=/user/root/rand2/part-00864
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
{noformat}
Note how the DFS client fails multiple times to retrieve the block, with a 2
minute wait between each one, without giving up. During this time, the task is
*not* speculated. However, once this task finally failed, a new version of it
ran successfully. Getting the input file in question with bin/hadoop fs -get
also worked fine.
There is no mention of the task attempt in question in the NameNode logs but my
guess is that something to do with RPC queues is causing its connection to get
lost, and the DFSClient does not recover.
was:
Running a recent version of trunk on 100 nodes, I occasionally see some tasks
freeze at startup and hang the job. These tasks are not speculatively executed
either. Here's sample output from one of them:
{noformat}
009-02-27 15:19:09,856 WARN org.apache.hadoop.conf.Configuration: DEPRECATED:
hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated.
Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override
properties of core-default.xml, mapred-default.xml and hdfs-default.xml
respectively
2009-02-27 15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2009-02-27 15:19:10,486 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2009-02-27 15:21:20,952 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain
block blk_2086525142250101885_39076 from any node: java.io.IOException: No
live nodes contain current block
2009-02-27 15:23:23,972 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain
block blk_2086525142250101885_39076 from any node: java.io.IOException: No
live nodes contain current block
2009-02-27 15:25:26,992 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain
block blk_2086525142250101885_39076 from any node: java.io.IOException: No
live nodes contain current block
2009-02-27 15:27:30,012 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076
file=/user/root/rand2/part-00864
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
2009-02-27 15:27:30,018 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076
file=/user/root/rand2/part-00864
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:155)
{noformat}
Note how the DFS client fails multiple times to retrieve the block, with a 2
minute wait between each one, without giving up. During this time, the task is
*not* speculated. However, once this task finally failed, a new version of it
ran successfully. Getting the input file in question with bin/hadoop fs -get
also worked fine.
There is no mention of the task attempt in question in the NameNode logs but my
guess is that something to do with RPC queues is causing its connection to get
lost, and the DFSClient does not recover.
Updated description to remove an insanely long line.
> Tasks freeze with "No live nodes contain current block", job takes long time
> to recover
> ---------------------------------------------------------------------------------------
>
> Key: HADOOP-5361
> URL: https://issues.apache.org/jira/browse/HADOOP-5361
> Project: Hadoop Core
> Issue Type: Bug
> Affects Versions: 0.21.0
> Reporter: Matei Zaharia
>
> Running a recent version of trunk on 100 nodes, I occasionally see some tasks
> freeze at startup and hang the job. These tasks are not speculatively
> executed either. Here's sample output from one of them:
> {noformat}
> 2009-02-27 15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=MAP, sessionId=
> 2009-02-27 15:19:10,486 INFO org.apache.hadoop.mapred.MapTask:
> numReduceTasks: 0
> 2009-02-27 15:21:20,952 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> obtain block blk_2086525142250101885_39076 from any node:
> java.io.IOException: No live nodes contain current block
> 2009-02-27 15:23:23,972 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> obtain block blk_2086525142250101885_39076 from any node:
> java.io.IOException: No live nodes contain current block
> 2009-02-27 15:25:26,992 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> obtain block blk_2086525142250101885_39076 from any node:
> java.io.IOException: No live nodes contain current block
> 2009-02-27 15:27:30,012 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read:
> java.io.IOException: Could not obtain block: blk_2086525142250101885_39076
> file=/user/root/rand2/part-00864
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
> at java.io.DataInputStream.read(DataInputStream.java:83)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:155)
> 2009-02-27 15:27:30,018 WARN org.apache.hadoop.mapred.TaskTracker: Error
> running child
> java.io.IOException: Could not obtain block: blk_2086525142250101885_39076
> file=/user/root/rand2/part-00864
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
> at java.io.DataInputStream.read(DataInputStream.java:83)
> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
> at
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
> at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:155)
> {noformat}
> Note how the DFS client fails multiple times to retrieve the block, with a 2
> minute wait between each one, without giving up. During this time, the task
> is *not* speculated. However, once this task finally failed, a new version of
> it ran successfully. Getting the input file in question with bin/hadoop fs
> -get also worked fine.
> There is no mention of the task attempt in question in the NameNode logs but
> my guess is that something to do with RPC queues is causing its connection to
> get lost, and the DFSClient does not recover.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.