[ https://issues.apache.org/jira/browse/HBASE-7634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13762507#comment-13762507 ]
Nick Dimiduk commented on HBASE-7634: ------------------------------------- Is there a corresponding patch to the book and/or package summary documenting the additional ZK watches and configuration points this patch introduces? How about a release note (at least for config)? > Replication handling of changes to peer clusters is inefficient > --------------------------------------------------------------- > > Key: HBASE-7634 > URL: https://issues.apache.org/jira/browse/HBASE-7634 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 0.95.2 > Reporter: Gabriel Reid > Assignee: Gabriel Reid > Fix For: 0.98.0, 0.95.2 > > Attachments: HBASE-7634.patch, HBASE-7634.v2.patch, > HBASE-7634.v3.patch, HBASE-7634.v4.patch, HBASE-7634.v5.patch, > HBASE-7634.v6.patch > > > The current handling of changes to the region servers in a replication peer > cluster is currently quite inefficient. The list of region servers that are > being replicated to is only updated if there are a large number of issues > encountered while replicating. > This can cause it to take quite a while to recognize that a number of the > regionserver in a peer cluster are no longer available. A potentially bigger > problem is that if a replication peer cluster is started with a small number > of regionservers, and then more region servers are added after replication > has started, the additional region servers will never be used for replication > (unless there are failures on the in-use regionservers). > Part of the current issue is that the retry code in > ReplicationSource#shipEdits checks a randomly-chosen replication peer > regionserver (in ReplicationSource#isSlaveDown) to see if it is up after a > replication write has failed on a different randonly-chosen replication peer. > If the peer is seen as not down, another randomly-chosen peer is used for > writing. > A second part of the issue is that changes to the list of region servers in a > peer cluster are not detected at all, and are only picked up if a certain > number of failures have occurred when trying to ship edits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira