[ 
https://issues.apache.org/jira/browse/STORM-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15239733#comment-15239733
 ] 

ASF GitHub Bot commented on STORM-1696:
---------------------------------------

GitHub user zhuoliu opened a pull request:

    https://github.com/apache/storm/pull/1338

    [STORM-1696] status not sync if zk fails in backpressure #1320 

    This is a fix for master branch.
    
    When there is a zk exception happens during worker-backpressure!,
     there is a bad state which can block the topology from running normally 
any more.
    
    The root cause: in worker/mk-backpressure-handler
     if the worker-backpressure! fails once due to zk connection exception,
     next time when this method gets called by WordBackpressureThread, because 
(when (not= prev-backpressure-flag curr-backpressure-flag) will never be true, 
the remote zk node can not be synced with local state.
     This problem can cause a topology to be blocked!
    
    This also explains why we will not see any problem when testing in a stable 
(zk never fail) environment.
    
    Solution is quite straightforward: first change the zk status, if succeeds, 
change local status.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zhuoliu/storm 1696b

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/1338.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1338
    
----
commit a70195d717e1ed179425a0305b2e4279afeb6118
Author: zhuoliu <[email protected]>
Date:   2016-04-13T18:04:57Z

    [STORM-1696] status not sync if zk fails in backpressure

commit 63518be11275104314dcebbf83bb1f8a513891ee
Author: zhuoliu <[email protected]>
Date:   2016-04-13T18:08:19Z

    minor

----


> Backpressure flag not sync if zookeeper connection errors
> ---------------------------------------------------------
>
>                 Key: STORM-1696
>                 URL: https://issues.apache.org/jira/browse/STORM-1696
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 2.0.0
>            Reporter: Zhuo Liu
>            Assignee: Zhuo Liu
>            Priority: Blocker
>             Fix For: 1.0.0, 2.0.0
>
>
> When there is a zk exception happens during worker-backpressure!,
> there is a bad state which can block the topology from running normally any 
> more.
> The root cause: in worker/mk-backpressure-handler
> if the worker-backpressure! fails once due to zk connection exception,
> next time when this method gets called by WordBackpressureThread, because 
> (when (not= prev-backpressure-flag curr-backpressure-flag) will never be 
> true, the remote zk node can not be synced with local state.
> This also explains why we will not see any problem when testing in a stable 
> (zk never fail) environment.
> Solution is quite straightforward: first change the zk status, if succeeds, 
> change local status.
> This fixes the hidden bug and removes redundant flags in executor-data and 
> worker-data (since we can get the executor status directly from the 
> "_throttleOn" boolean in the DisruptorQueue)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to