[ 
https://issues.apache.org/jira/browse/STORM-1750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265499#comment-15265499
 ] 

ASF GitHub Bot commented on STORM-1750:
---------------------------------------

GitHub user srdo opened a pull request:

    https://github.com/apache/storm/pull/1384

    STORM-1750: Ensure worker dies when report-error-and-die is called. M…

    …ake ZkStateStorage set_data try setting data if node creation fails 
because the node exists
    
    Similar changes probably need to be made to 0.10.x and 1.x to prevent 
executors from disappearing until the worker is manually rebooted. The change 
to ZkStateStorage may not be strictly necessary to fix this issue, but it 
should reduce the number of exceptions thrown out of set_data due to some 
component creating a node after another has passed the exists check.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srdo/storm STORM-1750

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/1384.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1384
    
----
commit b267c292faffc28be869dfb0424dc729d45808fe
Author: Stig Rohde Døssing <[email protected]>
Date:   2016-04-30T20:50:15Z

    STORM-1750: Ensure worker dies when report-error-and-die is called. Make 
ZkStateStorage set_data try setting data if node creation fails because the 
node exists

----


> Report-error-and-die may not kill the worker
> --------------------------------------------
>
>                 Key: STORM-1750
>                 URL: https://issues.apache.org/jira/browse/STORM-1750
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 0.10.0, 1.0.0, 2.0.0
>            Reporter: Stig Rohde Døssing
>            Assignee: Stig Rohde Døssing
>            Priority: Critical
>
> The report-error-and-die function in executor.clj calls report-error, which 
> can throw exceptions if Curator runs into any kind of trouble while 
> registering the error. I suspect this may happen with network errors, but it 
> can also happen if two executors for the same component throw exceptions at 
> the same time and no errors have been registered for the component 
> previously. This is because both calls to report-error-and-die update the 
> lastErrorPath, and ZkStateStorage set_data doesn't catch the potential 
> NodeExistsException that may be thrown from the create call.
> If an exception is thrown from report-error, the suicide-fn is never called, 
> and the worker keeps running sans the crashed executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to