[ 
https://issues.apache.org/jira/browse/HBASE-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606733#comment-16606733
 ] 

Mingliang Liu commented on HBASE-21164:
---------------------------------------

The v1 patch looks good to me (non-binding).

My initial idea was to dump log every dozen of retries. Later I guess 
exponential backoff with jitter will be better. Finally I think 
{{RetryCounter}} backoff is enough to mitigate excessive reportDuty retry and 
verbose logging.  We don't have register contention problem here so jitter is 
not a must.

For the testing, I don't find straightforward way. I have a simple testing 
using 
[LogCapture|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java#L460]
 that verifies the HRegionServer {{reportForDuty}} failing and retry logs. It 
asserts the number of the logs should be about exponential. It passes with this 
patch and fails with existing code.

I know the testing method is not elegant, and would be glad to know any better 
ways. Because HADOOP-13470 is not our Hadoop version, we need our LogCapture. 
The good side is the log assertion approach is least intrusive.

> reportForDuty should do (expotential) backoff rather than retry every 3 
> seconds (default).
> ------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21164
>                 URL: https://issues.apache.org/jira/browse/HBASE-21164
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>            Reporter: stack
>            Assignee: Mingliang Liu
>            Priority: Minor
>         Attachments: HBASE-21164.branch-2.1.001.patch
>
>
> RegionServers do reportForDuty on startup to tell Master they are available. 
> If Master is initializing, and especially on a big cluster when it can take a 
> while particularly if something is amiss, the log every three seconds is 
> annoying and doesn't do anything of use. Do backoff if fails up to a 
> reasonable maximum period. Here is example:
> {code}
> 2018-09-06 14:01:39,312 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to 
> master=vc0207.halxg.cloudera.com,22001,1536266763109 with port=22001, 
> startcode=1536266763109
> 2018-09-06 14:01:39,312 WARN 
> org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; 
> sleeping and then retrying.
> ....
> {code}
> For example, I am looking at a large cluster now that had a backlog of 
> procedure WALs. It is taking a couple of hours recreating the procedure-state 
> because there are millions of procedures outstanding. Meantime, the Master 
> log is just full of the above message -- every three seconds...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to