[
https://issues.apache.org/jira/browse/HELIX-195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
dafu updated HELIX-195:
-----------------------
Description:
FINALIZE callbacks are sent async via CallbackHandler#reset(), while Zk
callbacks are queued in ZkEventThread. It's possible that we are handling a
FINALIZE callback before all Zk callbacks are cleaned up. This creates race
conditions, for example, in zk session expiry, when a GenericController gets a
FINALIZE callback, it cleans up all listeners using ZkClient#unsubscribe(), but
Zk callbacks leftover in ZkEventThread comes later, and re-subscribe all
listeners, causing zk watcher leaking.
This is observed by setting up two controllers and expire the leader (by
simulating a long gc). The second controller takes the leadership and add all
listeners, but when the former leader recovers from gc, it gets leftover Zk
callbacks and re-subscribe the live-instance listener hence react to all
live-instance changes, though it doesn't acquire the leadership.
was:
FINALIZE callbacks are sent async via CallbackHandler#reset(), while Zk
callbacks are queued in ZkEventThread. It's possible that we are handling a
FINALIZE callback before all Zk callbacks are cleaned up. This creates race
conditions, for example, in zk session expiry, when a GenericController gets a
FINALIZE callback, it cleans up all listeners using ZkClient#unsubscribe(), but
Zk callbacks leftover in ZkEventThread comes later, and re-subscribe all
listeners, causing zk watcher leaking.
This is observed by setting up two controllers and expire the leader (by
simulating a long gc). The second controller takes the leadership and add all
listeners, but when the former leader recovers from gc, it gets leftover Zk
callbacks and re-subscribe then live-instance listener hence react to all
live-instance changes, though it doesn't acquire the leadership.
> Race condition between FINALIZE callbacks and Zk Callbacks
> ----------------------------------------------------------
>
> Key: HELIX-195
> URL: https://issues.apache.org/jira/browse/HELIX-195
> Project: Apache Helix
> Issue Type: Sub-task
> Reporter: dafu
> Assignee: dafu
>
> FINALIZE callbacks are sent async via CallbackHandler#reset(), while Zk
> callbacks are queued in ZkEventThread. It's possible that we are handling a
> FINALIZE callback before all Zk callbacks are cleaned up. This creates race
> conditions, for example, in zk session expiry, when a GenericController gets
> a FINALIZE callback, it cleans up all listeners using ZkClient#unsubscribe(),
> but Zk callbacks leftover in ZkEventThread comes later, and re-subscribe all
> listeners, causing zk watcher leaking.
> This is observed by setting up two controllers and expire the leader (by
> simulating a long gc). The second controller takes the leadership and add all
> listeners, but when the former leader recovers from gc, it gets leftover Zk
> callbacks and re-subscribe the live-instance listener hence react to all
> live-instance changes, though it doesn't acquire the leadership.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira