[ https://issues.apache.org/jira/browse/HBASE-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767514#comment-15767514 ]
stack commented on HBASE-17341: ------------------------------- If we timeout, is it a WARN or an ERROR? Do we lose data if we timeout just keep processing? Thanks. Good find. > Add a timeout during replication endpoint termination > ----------------------------------------------------- > > Key: HBASE-17341 > URL: https://issues.apache.org/jira/browse/HBASE-17341 > Project: HBase > Issue Type: Bug > Affects Versions: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 0.98.23, 1.2.4 > Reporter: Vincent Poon > Assignee: Vincent Poon > Priority: Critical > Fix For: 2.0.0, 1.4.0 > > Attachments: HBASE-17341.branch-1.1.v1.patch, > HBASE-17341.branch-1.1.v2.patch, HBASE-17341.master.v1.patch, > HBASE-17341.master.v2.patch > > > In ReplicationSource#terminate(), a Future is obtained from > ReplicationEndpoint#stop(). Future.get() is then called, but can potentially > hang there if something went wrong in the endpoint stop(). > Hanging there has serious implications, because the thread could potentially > be the ZK event thread (e.g. watcher calls > ReplicationSourceManager#removePeer() -> ReplicationSource#terminate() -> > blocked). This means no other events in the ZK event queue will get > processed, which for HBase means other ZK watches such as replication watch > notifications, snapshot watch notifications, even RegionServer shutdown will > all get blocked. > The short term fix addressed here is to simply add a timeout for > Future.get(). But the severe consequences seen here perhaps suggest a > broader refactoring of the ZKWatcher usage in HBase is in order, to protect > against situations like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)