[ https://issues.apache.org/jira/browse/MAPREDUCE-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Allen Wittenauer resolved MAPREDUCE-1264. ----------------------------------------- Resolution: Incomplete Closing this as stale. > Error Recovery failed, task will continue but run forever as new data only > comes in very very slowly > ---------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-1264 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1264 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 0.20.1 > Reporter: Thibaut > > Hi, > Sometimes, some of my jobs (It normally always happens in the reducers and on > random basis) will not finish and will run forever. I have to manually fail > the task so the task will be started and be finished. > The error log on the node is full of entries like: > java.io.IOException: Error Recovery for block > blk_-8036012205502614140_21582139 failed because recovery from primary > datanode 192.168.0.3:50011 failed 6 times. Pipeline was 192.168.0.3:50011. > Aborting... > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2582) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2076) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2239) > java.io.IOException: Error Recovery for block > blk_-8036012205502614140_21582139 failed because recovery from primary > datanode 192.168.0.3:50011 failed 6 times. Pipeline was 192.168.0.3:50011. > Aborting... > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2582) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2076) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2239) > java.io.IOException: Error Recovery for block > blk_-8036012205502614140_21582139 failed because recovery from primary > datanode 192.168.0.3:50011 failed 6 times. Pipeline was 192.168.0.3:50011. > Aborting... > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2582) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2076) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2239) > The error entries all refer to the same data block. > Unfortunately, the reduce function still seems to be called in the reducer > with valid data (although very very slowly), so the task will never been > killed and restarted and will take forever to run! > If I kill the task, the job will finish without any problems. > I experienced the same problem under version 0.20.0 as well. > Thanks, > Thibaut -- This message was sent by Atlassian JIRA (v6.2#6252)