[ https://issues.apache.org/jira/browse/HBASE-21164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16606733#comment-16606733 ]
Mingliang Liu commented on HBASE-21164: --------------------------------------- The v1 patch looks good to me (non-binding). My initial idea was to dump log every dozen of retries. Later I guess exponential backoff with jitter will be better. Finally I think {{RetryCounter}} backoff is enough to mitigate excessive reportDuty retry and verbose logging. We don't have register contention problem here so jitter is not a must. For the testing, I don't find straightforward way. I have a simple testing using [LogCapture|https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java#L460] that verifies the HRegionServer {{reportForDuty}} failing and retry logs. It asserts the number of the logs should be about exponential. It passes with this patch and fails with existing code. I know the testing method is not elegant, and would be glad to know any better ways. Because HADOOP-13470 is not our Hadoop version, we need our LogCapture. The good side is the log assertion approach is least intrusive. > reportForDuty should do (expotential) backoff rather than retry every 3 > seconds (default). > ------------------------------------------------------------------------------------------ > > Key: HBASE-21164 > URL: https://issues.apache.org/jira/browse/HBASE-21164 > Project: HBase > Issue Type: Improvement > Components: regionserver > Reporter: stack > Assignee: Mingliang Liu > Priority: Minor > Attachments: HBASE-21164.branch-2.1.001.patch > > > RegionServers do reportForDuty on startup to tell Master they are available. > If Master is initializing, and especially on a big cluster when it can take a > while particularly if something is amiss, the log every three seconds is > annoying and doesn't do anything of use. Do backoff if fails up to a > reasonable maximum period. Here is example: > {code} > 2018-09-06 14:01:39,312 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty to > master=vc0207.halxg.cloudera.com,22001,1536266763109 with port=22001, > startcode=1536266763109 > 2018-09-06 14:01:39,312 WARN > org.apache.hadoop.hbase.regionserver.HRegionServer: reportForDuty failed; > sleeping and then retrying. > .... > {code} > For example, I am looking at a large cluster now that had a backlog of > procedure WALs. It is taking a couple of hours recreating the procedure-state > because there are millions of procedures outstanding. Meantime, the Master > log is just full of the above message -- every three seconds... -- This message was sent by Atlassian JIRA (v7.6.3#76005)