[ 
https://issues.apache.org/jira/browse/HBASE-11963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14133088#comment-14133088
 ] 

Lars Hofhansl commented on HBASE-11963:
---------------------------------------

Also lemme explain what happened:
* We have a ReplicationPeer per slave cluster
* We have a ReplicationSource for every "queue" to replicate. A queue is either 
the data this region wishes to replicate or data it took over for another 
region server (for example when that region server went down)
* When we take over a queue from another region server we have *multiple* 
ReplicationSources replicating to the *same* set of ReplicationPeers.
* When the slave cluster is down, the ReplicationSources attempt to reset their 
peers upon each failed request.
* And hence now we have race where multiple ReplicationSources attempt to 
reconnect a peer simultaneously. That caused the race condition and leaked ZK 
clients.
* Each of the leaked client would attempt to reconnect to the slave once/sec 
until the ZK timeout (defaulting to 180s).

So this only happens when (a) we have some queues failed over from another 
region server *and* (b) a peer is not currently reachable (or there are some 
other ZK issues) causing the source and reconnect its peer.
But if we have this condition it gets nasty pretty quickly.


> Synchronize peer cluster replication connection attempts
> --------------------------------------------------------
>
>                 Key: HBASE-11963
>                 URL: https://issues.apache.org/jira/browse/HBASE-11963
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Andrew Purtell
>            Assignee: Maddineni Sukumar
>             Fix For: 2.0.0, 0.98.7, 0.94.24, 0.99.1
>
>         Attachments: 11963-0.94.txt, HBASE-11963-0.98.patch, HBASE-11963.patch
>
>
> Synchronize peer cluster connection attempts to avoid races and rate limit 
> connections when multiple replication sources try to connect to the peer 
> cluster. If the peer cluster is down we can get out of control over time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to