[
https://issues.apache.org/jira/browse/FLUME-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
chenshangan resolved FLUME-2353.
--------------------------------
Resolution: Duplicate
Resolved in another task. It will provide max retry times option.
> BucketWriter throw IOException endlessly while failed to close file
> -------------------------------------------------------------------
>
> Key: FLUME-2353
> URL: https://issues.apache.org/jira/browse/FLUME-2353
> Project: Flume
> Issue Type: Improvement
> Reporter: chenshangan
> Assignee: chenshangan
>
> sometimes .tmp file might lost block in hdfs, and HDFSWriter can not go on
> writing events or flush or close the file, so it will repeatedly try and
> catch the IOException.
> The error stack is as following:
> 06 Aug 2013 04:27:08,853 WARN [DataStreamer for file
> **************************.1375732802628.lzo.tmp block
> blk_709795560527813415_25801594]
> (org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3159)
> - Error Recovery for block blk_709795560527813415_25801594 failed because
> recovery from primary datanode *****:50010 failed 1 times. Pipeline was
> ******:50010,******:50010, ******:50010. Will retry...
> 06 Aug 2013 04:27:08,990 WARN [DataStreamer for file
> **************************.1375732802628.lzo.tmp block
> blk_709795560527813415_25801594]
> (org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3159)
> - Error Recovery for block blk_709795560527813415_25801594 failed because
> recovery from primary datanode ******:50010 failed 2 times. Pipeline was
> ******:50010,******:50010, ******:50010. Will retry...
> …
> 06 Aug 2013 04:27:50,694 WARN [DataStreamer for file
> **************************.1375732802628.lzo.tmp block
> blk_709795560527813415_25801594]
> (org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError:3139)
> - Error Recovery for block blk_709795560527813415_25801594 failed because
> recovery from primary datanode ******:50010 failed 6 times. Pipeline was
> ******:50010,******:50010, ******:50010. Marking primary datanode as bad.
> 06 Aug 2013 04:30:40,365 WARN
> [SinkRunner-PollingRunner-FailoverSinkProcessor]
> (org.apache.flume.sink.hdfs.HDFSEventSink.process:418) - HDFS IO error
> java.io.IOException: Error Recovery for block blk_709795560527813415_25801594
> failed because recovery from primary datanode ********:50010 failed 6 times.
> Pipeline was *****:50010. Aborting...
> DFSClient will try to recovery for missing block with a maximum times , if
> failed finally it will throw IOException. But HDFSWriter will rethrow the
> exception to HDFSEventSink, HDFSEventSink will rollback transaction and
> repeat the error stack, so it's a dead loop.
> My suggestion and solution is to add a graceClose() method, and if it failed
> too many times, just leave the .tmp file alone.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)