[
https://issues.apache.org/jira/browse/STORM-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651233#comment-14651233
]
ASF GitHub Bot commented on STORM-969:
--------------------------------------
GitHub user dossett opened a pull request:
https://github.com/apache/storm/pull/664
STORM-969: HDFS Bolt can end up in an unrecoverable state
A few notes about this PR:
- I updated the storm-hdfs pom.xml to align with other external modules.
Most significant change was probably going from hdfs version 2.2 to
${hadoop.version} (i.e. currently 2.6)
- Many errors are recovered by forcing a file rotation which opens a new,
valid File. So the rotation now occurs either according to rotation policy or
when a serious error happens. Work could probably be done to reopen the same
file name to reduce the number of rotations.
- Added unittests with MiniDFSCluster
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dossett/storm STORM-969
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/664.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #664
----
commit 795aaf93af78bf664727b91c179e0d96f673f674
Author: Aaron Dossett <[email protected]>
Date: 2015-08-02T22:22:51Z
STORM-969: HDFS Bolt can end up in an unrecoverable state
----
> HDFS Bolt can end up in an unrecoverable state
> ----------------------------------------------
>
> Key: STORM-969
> URL: https://issues.apache.org/jira/browse/STORM-969
> Project: Apache Storm
> Issue Type: Improvement
> Components: storm-hdfs
> Reporter: Aaron Dossett
> Assignee: Aaron Dossett
>
> The body of the HDFSBolt.execute() method is essentially one try-catch block.
> The catch block reports the error and fails the current tuple. In some
> cases the bolt's FSDataOutputStream object (named 'out') is in an
> unrecoverable state and no subsequent calls to execute() can succeed.
> To produce this scenario:
> - process some tuples through HDFS bolt
> - put the underlying HDFS system into safemode
> - process some more tuples and receive a correct ClosedChannelException
> - take the underlying HDFS system out of safemode
> - subsequent tuples continue to fail with the same exception
> The three fundamental operations that execute takes (writing, sync'ing,
> rotating) need to be isolated so that errors from each are specifically
> handled.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)