[ https://issues.apache.org/jira/browse/HBASE-12971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14318816#comment-14318816 ]
Andrew Purtell commented on HBASE-12971: ---------------------------------------- Separately. You don't want to make this change in 0.98 Lars? Earlier you said: {quote} We can already configure replication.source.socketTimeoutMultiplier, it's just about a good default. In fact with that in mind maybe the socketTimeoutMultiplier should just be maxRetriesMultiplier (we declared maxRetriesMultiplier to be a good maximum since we configured it that way, on a socket timeout it seems good to wait for that maximum immediately). {quote} Does that logic not hold? I realize there will be a behavioral change if we stop squaring the max retries multiplier here, but following the above it borders on a bug. If we don't apply this to 0.98 I think we are going to be bit by this in production at some point. > Replication stuck due to large default value for > replication.source.maxretriesmultiplier > ---------------------------------------------------------------------------------------- > > Key: HBASE-12971 > URL: https://issues.apache.org/jira/browse/HBASE-12971 > Project: HBase > Issue Type: Bug > Components: hbase > Affects Versions: 1.0.0, 0.98.10 > Reporter: Adrian Muraru > Fix For: 2.0.0, 1.0.1, 1.1.0, 0.94.27, 0.98.11 > > Attachments: 12971.txt > > > We are setting in hbase-site the default value of 300 for > {{replication.source.maxretriesmultiplier}} introduced in HBASE-11964. > While this value works fine to recover for transient errors with remote ZK > quorum from the peer Hbase cluster - it proved to have side effects in the > code introduced in HBASE-11367 Pluggable replication endpoint, where the > default is much lower (10). > See: > 1. > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L169 > 2. > https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/HBaseInterClusterReplicationEndpoint.java#L79 > The the two default values are definitely conflicting - when > {{replication.source.maxretriesmultiplier}} is set in the hbase-site to 300 > this will lead to a sleep time of 300*300 (25h!) when a sockettimeout > exception is thrown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)