[ 
https://issues.apache.org/jira/browse/HADOOP-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated HADOOP-5361:
----------------------------------

    Description: 
Running a recent version of trunk on 100 nodes, I occasionally see some tasks 
freeze at startup and hang the job. These tasks are not speculatively executed 
either. Here's sample output from one of them:

{noformat}
2009-02-27 15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=MAP, sessionId=
2009-02-27 15:19:10,486 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2009-02-27 15:21:20,952 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_2086525142250101885_39076 from any node:  java.io.IOException: No 
live nodes contain current block
2009-02-27 15:23:23,972 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_2086525142250101885_39076 from any node:  java.io.IOException: No 
live nodes contain current block
2009-02-27 15:25:26,992 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_2086525142250101885_39076 from any node:  java.io.IOException: No 
live nodes contain current block
2009-02-27 15:27:30,012 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: 
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 
file=/user/root/rand2/part-00864
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
    at java.io.DataInputStream.read(DataInputStream.java:83)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:155)

2009-02-27 15:27:30,018 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 
file=/user/root/rand2/part-00864
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
    at java.io.DataInputStream.read(DataInputStream.java:83)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:155)
{noformat}

Note how the DFS client fails multiple times to retrieve the block, with a 2 
minute wait between each one, without giving up. During this time, the task is 
*not* speculated. However, once this task finally failed, a new version of it 
ran successfully. Getting the input file in question with bin/hadoop fs -get 
also worked fine.

There is no mention of the task attempt in question in the NameNode logs but my 
guess is that something to do with RPC queues is causing its connection to get 
lost, and the DFSClient does not recover.

  was:
Running a recent version of trunk on 100 nodes, I occasionally see some tasks 
freeze at startup and hang the job. These tasks are not speculatively executed 
either. Here's sample output from one of them:

{noformat}
009-02-27 15:19:09,856 WARN org.apache.hadoop.conf.Configuration: DEPRECATED: 
hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. 
Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override 
properties of core-default.xml, mapred-default.xml and hdfs-default.xml 
respectively
2009-02-27 15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=MAP, sessionId=
2009-02-27 15:19:10,486 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 0
2009-02-27 15:21:20,952 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_2086525142250101885_39076 from any node:  java.io.IOException: No 
live nodes contain current block
2009-02-27 15:23:23,972 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_2086525142250101885_39076 from any node:  java.io.IOException: No 
live nodes contain current block
2009-02-27 15:25:26,992 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain 
block blk_2086525142250101885_39076 from any node:  java.io.IOException: No 
live nodes contain current block
2009-02-27 15:27:30,012 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: 
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 
file=/user/root/rand2/part-00864
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
    at java.io.DataInputStream.read(DataInputStream.java:83)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:155)

2009-02-27 15:27:30,018 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 
file=/user/root/rand2/part-00864
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
    at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
    at java.io.DataInputStream.read(DataInputStream.java:83)
    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
    at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
    at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:155)
{noformat}

Note how the DFS client fails multiple times to retrieve the block, with a 2 
minute wait between each one, without giving up. During this time, the task is 
*not* speculated. However, once this task finally failed, a new version of it 
ran successfully. Getting the input file in question with bin/hadoop fs -get 
also worked fine.

There is no mention of the task attempt in question in the NameNode logs but my 
guess is that something to do with RPC queues is causing its connection to get 
lost, and the DFSClient does not recover.


Updated description to remove an insanely long line.

> Tasks freeze with "No live nodes contain current block", job takes long time 
> to recover
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5361
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5361
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.21.0
>            Reporter: Matei Zaharia
>
> Running a recent version of trunk on 100 nodes, I occasionally see some tasks 
> freeze at startup and hang the job. These tasks are not speculatively 
> executed either. Here's sample output from one of them:
> {noformat}
> 2009-02-27 15:19:10,229 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
> Initializing JVM Metrics with processName=MAP, sessionId=
> 2009-02-27 15:19:10,486 INFO org.apache.hadoop.mapred.MapTask: 
> numReduceTasks: 0
> 2009-02-27 15:21:20,952 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
> obtain block blk_2086525142250101885_39076 from any node:  
> java.io.IOException: No live nodes contain current block
> 2009-02-27 15:23:23,972 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
> obtain block blk_2086525142250101885_39076 from any node:  
> java.io.IOException: No live nodes contain current block
> 2009-02-27 15:25:26,992 INFO org.apache.hadoop.hdfs.DFSClient: Could not 
> obtain block blk_2086525142250101885_39076 from any node:  
> java.io.IOException: No live nodes contain current block
> 2009-02-27 15:27:30,012 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: 
> java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 
> file=/user/root/rand2/part-00864
>     at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
>     at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
>     at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
>     at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>     at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
>     at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
>     at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
>     at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>     at org.apache.hadoop.mapred.Child.main(Child.java:155)
> 2009-02-27 15:27:30,018 WARN org.apache.hadoop.mapred.TaskTracker: Error 
> running child
> java.io.IOException: Could not obtain block: blk_2086525142250101885_39076 
> file=/user/root/rand2/part-00864
>     at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1664)
>     at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1492)
>     at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1619)
>     at java.io.DataInputStream.read(DataInputStream.java:83)
>     at org.apache.hadoop.util.LineReader.readLine(LineReader.java:134)
>     at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:136)
>     at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:40)
>     at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:191)
>     at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:175)
>     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>     at org.apache.hadoop.mapred.Child.main(Child.java:155)
> {noformat}
> Note how the DFS client fails multiple times to retrieve the block, with a 2 
> minute wait between each one, without giving up. During this time, the task 
> is *not* speculated. However, once this task finally failed, a new version of 
> it ran successfully. Getting the input file in question with bin/hadoop fs 
> -get also worked fine.
> There is no mention of the task attempt in question in the NameNode logs but 
> my guess is that something to do with RPC queues is causing its connection to 
> get lost, and the DFSClient does not recover.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to