[ https://issues.apache.org/jira/browse/HBASE-11963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133088#comment-14133088 ]
Lars Hofhansl commented on HBASE-11963: --------------------------------------- Also lemme explain what happened: * We have a ReplicationPeer per slave cluster * We have a ReplicationSource for every "queue" to replicate. A queue is either the data this region wishes to replicate or data it took over for another region server (for example when that region server went down) * When we take over a queue from another region server we have *multiple* ReplicationSources replicating to the *same* set of ReplicationPeers. * When the slave cluster is down, the ReplicationSources attempt to reset their peers upon each failed request. * And hence now we have race where multiple ReplicationSources attempt to reconnect a peer simultaneously. That caused the race condition and leaked ZK clients. * Each of the leaked client would attempt to reconnect to the slave once/sec until the ZK timeout (defaulting to 180s). So this only happens when (a) we have some queues failed over from another region server *and* (b) a peer is not currently reachable (or there are some other ZK issues) causing the source and reconnect its peer. But if we have this condition it gets nasty pretty quickly. > Synchronize peer cluster replication connection attempts > -------------------------------------------------------- > > Key: HBASE-11963 > URL: https://issues.apache.org/jira/browse/HBASE-11963 > Project: HBase > Issue Type: Sub-task > Reporter: Andrew Purtell > Assignee: Maddineni Sukumar > Fix For: 2.0.0, 0.98.7, 0.94.24, 0.99.1 > > Attachments: 11963-0.94.txt, HBASE-11963-0.98.patch, HBASE-11963.patch > > > Synchronize peer cluster connection attempts to avoid races and rate limit > connections when multiple replication sources try to connect to the peer > cluster. If the peer cluster is down we can get out of control over time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)