[ 
https://issues.apache.org/jira/browse/FLUME-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chenshangan resolved FLUME-2353.
--------------------------------
    Resolution: Duplicate

Resolved in another task. It will provide max retry times option.

> BucketWriter throw IOException endlessly while failed to close file
> -------------------------------------------------------------------
>
>                 Key: FLUME-2353
>                 URL: https://issues.apache.org/jira/browse/FLUME-2353
>             Project: Flume
>          Issue Type: Improvement
>            Reporter: chenshangan
>            Assignee: chenshangan
>
> sometimes .tmp file might lost block in hdfs, and HDFSWriter can not go on 
> writing events or flush or close the file, so it will repeatedly try and 
> catch the IOException.
> The error stack is as following:
> 06 Aug 2013 04:27:08,853 WARN [DataStreamer for file 
> **************************.1375732802628.lzo.tmp block 
> blk_709795560527813415_25801594] 
> (org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3159) 
> - Error Recovery for block blk_709795560527813415_25801594 failed because 
> recovery from primary datanode *****:50010 failed 1 times. Pipeline was 
> ******:50010,******:50010, ******:50010. Will retry...
> 06 Aug 2013 04:27:08,990 WARN [DataStreamer for file 
> **************************.1375732802628.lzo.tmp block 
> blk_709795560527813415_25801594] 
> (org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3159) 
> - Error Recovery for block blk_709795560527813415_25801594 failed because 
> recovery from primary datanode ******:50010 failed 2 times. Pipeline was 
> ******:50010,******:50010, ******:50010. Will retry...
> …
> 06 Aug 2013 04:27:50,694 WARN [DataStreamer for file 
> **************************.1375732802628.lzo.tmp block 
> blk_709795560527813415_25801594] 
> (org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3139) 
> - Error Recovery for block blk_709795560527813415_25801594 failed because 
> recovery from primary datanode ******:50010 failed 6 times. Pipeline was 
> ******:50010,******:50010, ******:50010. Marking primary datanode as bad.
> 06 Aug 2013 04:30:40,365 WARN 
> [SinkRunner-PollingRunner-FailoverSinkProcessor] 
> (org.apache.flume.sink.hdfs.HDFSEventSink.process:418) - HDFS IO error
> java.io.IOException: Error Recovery for block blk_709795560527813415_25801594 
> failed because recovery from primary datanode ********:50010 failed 6 times. 
> Pipeline was *****:50010. Aborting...
> DFSClient will try to recovery for missing block with a maximum times , if 
> failed finally it will throw IOException. But HDFSWriter will rethrow the 
> exception to HDFSEventSink, HDFSEventSink will rollback transaction and 
> repeat the error stack, so it's a dead loop.
> My suggestion and solution is to add a graceClose() method, and if it failed 
> too many times, just leave the .tmp file alone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to