[jira] [Commented] (HBASE-12971) Replication stuck due to large default value for replication.source.maxretriesmultiplier

Lars Hofhansl (JIRA) Thu, 12 Feb 2015 11:42:14 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-12971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14318848#comment-14318848
 ]


Lars Hofhansl commented on HBASE-12971:
---------------------------------------

[~amuraru], agreed.

[~apurtell], the default is max retries is 10 (i.e. 10s with the default sleep 
interval). For socket timeouts that is too small (I think).
In fact I see the following comment as to why socket timeouts are handled 
differently:
{code}
            // This exception means we waited for more than 60s and nothing
            // happened, the cluster is alive and calling it right away
            // even for a test just makes things worse.
{code}

I do not see us setting a socket timeout anywhere, so the 60s much be an 
assumption/default.
Maybe the retry after a socket timeout should at least wait for 60s. (i.e. 
max(60s, sleep interval * max retries).


> Replication stuck due to large default value for 
> replication.source.maxretriesmultiplier
> ----------------------------------------------------------------------------------------
>
>                 Key: HBASE-12971
>                 URL: https://issues.apache.org/jira/browse/HBASE-12971
>             Project: HBase
>          Issue Type: Bug
>          Components: hbase
>    Affects Versions: 1.0.0, 0.98.10
>            Reporter: Adrian Muraru
>             Fix For: 2.0.0, 1.0.1, 1.1.0, 0.94.27, 0.98.11
>
>         Attachments: 12971.txt
>
>
> We are setting in hbase-site the default value of 300 for 
> {{replication.source.maxretriesmultiplier}} introduced in HBASE-11964.
> While this value works fine to recover for transient errors with remote ZK 
> quorum from the peer Hbase cluster - it proved to have side effects in the 
> code introduced in HBASE-11367 Pluggable replication endpoint, where the 
> default is much lower (10).
> See:
> 1. 
> https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java#L169
> 2. 
> https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/HBaseInterClusterReplicationEndpoint.java#L79
> The the two default values are definitely conflicting - when 
> {{replication.source.maxretriesmultiplier}} is set in the hbase-site to 300 
> this will lead to a  sleep time of 300*300 (25h!) when a sockettimeout 
> exception is thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12971) Replication stuck due to large default value for replication.source.maxretriesmultiplier

Reply via email to