[
https://issues.apache.org/jira/browse/STORM-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265499#comment-15265499
]
ASF GitHub Bot commented on STORM-1750:
---------------------------------------
GitHub user srdo opened a pull request:
https://github.com/apache/storm/pull/1384
STORM-1750: Ensure worker dies when report-error-and-die is called. M…
…ake ZkStateStorage set_data try setting data if node creation fails
because the node exists
Similar changes probably need to be made to 0.10.x and 1.x to prevent
executors from disappearing until the worker is manually rebooted. The change
to ZkStateStorage may not be strictly necessary to fix this issue, but it
should reduce the number of exceptions thrown out of set_data due to some
component creating a node after another has passed the exists check.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/srdo/storm STORM-1750
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/1384.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1384
----
commit b267c292faffc28be869dfb0424dc729d45808fe
Author: Stig Rohde Døssing <[email protected]>
Date: 2016-04-30T20:50:15Z
STORM-1750: Ensure worker dies when report-error-and-die is called. Make
ZkStateStorage set_data try setting data if node creation fails because the
node exists
----
> Report-error-and-die may not kill the worker
> --------------------------------------------
>
> Key: STORM-1750
> URL: https://issues.apache.org/jira/browse/STORM-1750
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 0.10.0, 1.0.0, 2.0.0
> Reporter: Stig Rohde Døssing
> Assignee: Stig Rohde Døssing
> Priority: Critical
>
> The report-error-and-die function in executor.clj calls report-error, which
> can throw exceptions if Curator runs into any kind of trouble while
> registering the error. I suspect this may happen with network errors, but it
> can also happen if two executors for the same component throw exceptions at
> the same time and no errors have been registered for the component
> previously. This is because both calls to report-error-and-die update the
> lastErrorPath, and ZkStateStorage set_data doesn't catch the potential
> NodeExistsException that may be thrown from the create call.
> If an exception is thrown from report-error, the suicide-fn is never called,
> and the worker keeps running sans the crashed executor.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)