[ 
https://issues.apache.org/jira/browse/HBASE-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15765004#comment-15765004
 ] 

Vincent Poon commented on HBASE-17341:
--------------------------------------

[~tedyu] I think that would be another possible fix for this issue.  But that 
would be a one-off solution just for replication removePeer.  The ZK event 
thread issue could potentially happen with any watcher.  A more generalized 
solution would be to take your idea and apply it at a higher level - maybe one 
queue per component (replication, snapshots, etc).  That way a stuck event for 
one component at least doesn't affect other components.  We could also manage 
timeouts at a higher level to kill any events which get stuck.

The granularity and timeouts have to be carefully managed though.  ZK has only 
one event thread to guarantee serialized execution of the watchers.  We could 
only do this for components that don't need that guarantee with respect to 
other components.

I think that work should be done in another patch, if the community wants to go 
that route.  For now I think this patch at least fixes an existing bug in 
ReplicationSource.

> Add a timeout during replication endpoint termination
> -----------------------------------------------------
>
>                 Key: HBASE-17341
>                 URL: https://issues.apache.org/jira/browse/HBASE-17341
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 0.98.23, 1.2.4
>            Reporter: Vincent Poon
>            Assignee: Vincent Poon
>            Priority: Critical
>         Attachments: HBASE-17341.branch-1.1.v1.patch, 
> HBASE-17341.master.v1.patch
>
>
> In ReplicationSource#terminate(), a Future is obtained from 
> ReplicationEndpoint#stop().  Future.get() is then called, but can potentially 
> hang there if something went wrong in the endpoint stop().
> Hanging there has serious implications, because the thread could potentially 
> be the ZK event thread (e.g. watcher calls 
> ReplicationSourceManager#removePeer() -> ReplicationSource#terminate() -> 
> blocked).  This means no other events in the ZK event queue will get 
> processed, which for HBase means other ZK watches such as replication watch 
> notifications, snapshot watch notifications, even RegionServer shutdown will 
> all get blocked.
> The short term fix addressed here is to simply add a timeout for 
> Future.get().  But the severe consequences seen here perhaps suggest a 
> broader refactoring of the ZKWatcher usage in HBase is in order, to protect 
> against situations like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to