[jira] Updated: (MAPREDUCE-1264) Error Recovery failed, task will continue but run forever as new data only comes in very very slowly

2010-02-18 Thread Chris Douglas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated MAPREDUCE-1264:
-

Fix Version/s: (was: 0.20.2)

> Error Recovery failed, task will continue but run forever as new data only 
> comes in very very slowly
> 
>
> Key: MAPREDUCE-1264
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1264
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Thibaut
>
> Hi,
> Sometimes, some of my jobs (It normally always happens in the reducers and on 
> random basis) will not finish and will run forever. I have to manually fail 
> the task so the task will be started and be finished.
> The error log on the node is full of entries like:
> java.io.IOException: Error Recovery for block 
> blk_-8036012205502614140_21582139 failed  because recovery from primary 
> datanode 192.168.0.3:50011 failed 6 times.  Pipeline was 192.168.0.3:50011. 
> Aborting...
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2582)
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2076)
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2239)
> java.io.IOException: Error Recovery for block 
> blk_-8036012205502614140_21582139 failed  because recovery from primary 
> datanode 192.168.0.3:50011 failed 6 times.  Pipeline was 192.168.0.3:50011. 
> Aborting...
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2582)
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2076)
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2239)
> java.io.IOException: Error Recovery for block 
> blk_-8036012205502614140_21582139 failed  because recovery from primary 
> datanode 192.168.0.3:50011 failed 6 times.  Pipeline was 192.168.0.3:50011. 
> Aborting...
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2582)
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2076)
> at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2239)
> The error entries all refer to the same data block.
> Unfortunately, the reduce function still seems to be called in the reducer 
> with valid data (although very very slowly), so the task will never been 
> killed and restarted and will take forever to run!
> If I kill the task, the job will finish without any problems. 
> I experienced the same problem under version 0.20.0 as well.
> Thanks,
> Thibaut

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1264) Error Recovery failed, task will continue but run forever as new data only comes in very very slowly

2009-12-04 Thread Thibaut (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thibaut updated MAPREDUCE-1264:
---

  Description: 
Hi,

Sometimes, some of my jobs (It normally always happens in the reducers and on 
random basis) will not finish and will run forever. I have to manually fail the 
task so the task will be started and be finished.

The error log on the node is full of entries like:
java.io.IOException: Error Recovery for block blk_-8036012205502614140_21582139 
failed  because recovery from primary datanode 192.168.0.3:50011 failed 6 
times.  Pipeline was 192.168.0.3:50011. Aborting...
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2582)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2076)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2239)
java.io.IOException: Error Recovery for block blk_-8036012205502614140_21582139 
failed  because recovery from primary datanode 192.168.0.3:50011 failed 6 
times.  Pipeline was 192.168.0.3:50011. Aborting...
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2582)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2076)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2239)
java.io.IOException: Error Recovery for block blk_-8036012205502614140_21582139 
failed  because recovery from primary datanode 192.168.0.3:50011 failed 6 
times.  Pipeline was 192.168.0.3:50011. Aborting...
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2582)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2076)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2239)
The error entries all refer to the same data block.

Unfortunately, the reduce function still seems to be called in the reducer with 
valid data (although very very slowly), so the task will never been killed and 
restarted and will take forever to run!

If I kill the task, the job will finish without any problems. 

I experienced the same problem under version 0.20.0 as well.


Thanks,
Thibaut

  was:
Hi,

Sometimes, some of my jobs (It normally always happens in the reducers and on 
random basis) will not finish and will run forever. I have to manually fail the 
task so the task will be started and be finished.

The error log on the node is full of entries like:
java.io.IOException: Error Recovery for block blk_-8036012205502614140_21582139 
failed  because recovery from primary datanode 192.168.0.3:50011 failed 6 
times.  Pipeline was 192.168.0.3:50011. Aborting...
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2582)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2076)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2239)
java.io.IOException: Error Recovery for block blk_-8036012205502614140_21582139 
failed  because recovery from primary datanode 192.168.0.3:50011 failed 6 
times.  Pipeline was 192.168.0.3:50011. Aborting...
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2582)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2076)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2239)
java.io.IOException: Error Recovery for block blk_-8036012205502614140_21582139 
failed  because recovery from primary datanode 192.168.0.3:50011 failed 6 
times.  Pipeline was 192.168.0.3:50011. Aborting...
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2582)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2076)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2239)
The error entries all refer to the same data block.

Unfortunally, the reduce function still seems to be called in the reducer with 
valid data (allthough very very slowly), so the task will never been killed and 
restarted and will take forever to run!

I experienced the same problem under version 0.20.0 as well.


Thanks,
Thibaut

Fix Version/s: 0.20.2

> Error Recovery failed, task will continue but run forever as new data only 
> comes in very very slowly
> 
>
> Key: MAPREDUCE-1264
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1264
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: Thibaut
>