[ https://issues.apache.org/jira/browse/HBASE-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765004#comment-15765004 ]
Vincent Poon commented on HBASE-17341: -------------------------------------- [~tedyu] I think that would be another possible fix for this issue. But that would be a one-off solution just for replication removePeer. The ZK event thread issue could potentially happen with any watcher. A more generalized solution would be to take your idea and apply it at a higher level - maybe one queue per component (replication, snapshots, etc). That way a stuck event for one component at least doesn't affect other components. We could also manage timeouts at a higher level to kill any events which get stuck. The granularity and timeouts have to be carefully managed though. ZK has only one event thread to guarantee serialized execution of the watchers. We could only do this for components that don't need that guarantee with respect to other components. I think that work should be done in another patch, if the community wants to go that route. For now I think this patch at least fixes an existing bug in ReplicationSource. > Add a timeout during replication endpoint termination > ----------------------------------------------------- > > Key: HBASE-17341 > URL: https://issues.apache.org/jira/browse/HBASE-17341 > Project: HBase > Issue Type: Bug > Affects Versions: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 0.98.23, 1.2.4 > Reporter: Vincent Poon > Assignee: Vincent Poon > Priority: Critical > Attachments: HBASE-17341.branch-1.1.v1.patch, > HBASE-17341.master.v1.patch > > > In ReplicationSource#terminate(), a Future is obtained from > ReplicationEndpoint#stop(). Future.get() is then called, but can potentially > hang there if something went wrong in the endpoint stop(). > Hanging there has serious implications, because the thread could potentially > be the ZK event thread (e.g. watcher calls > ReplicationSourceManager#removePeer() -> ReplicationSource#terminate() -> > blocked). This means no other events in the ZK event queue will get > processed, which for HBase means other ZK watches such as replication watch > notifications, snapshot watch notifications, even RegionServer shutdown will > all get blocked. > The short term fix addressed here is to simply add a timeout for > Future.get(). But the severe consequences seen here perhaps suggest a > broader refactoring of the ZKWatcher usage in HBase is in order, to protect > against situations like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)