[ 
https://issues.apache.org/jira/browse/HBASE-21406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674255#comment-16674255
 ] 

Wellington Chevreuil commented on HBASE-21406:
----------------------------------------------

Added initial patch proposal for *branch-1.* Idea here is to not show stats for 
SINK, until it has not received any edits. Added additional metrics showing the 
sink startup time, something as below:
{noformat}
SINK  : TimeStampStarted=1541292912227, Waiting for OPs...{noformat}
 
BTW, while testing, noticed additional issues with metrics for source on 
current branch-1 version:
1) Once started and while no OP eligible for replication occurs, 
TimeStampsOfLastShippedOp shows "Thu Jan 01 01:00:00 GMT 1970", and huge 
Replication Lag is accounted. This seems to be due HBASE-15995, which removed 
code on ReplicationSource class that initializes AgeOfLastShippedOp to the 
startup time:

{noformat}
-          // Reset the sleep multiplier if nothing has actually gone wrong
-          if (!gotIOE) {
-            sleepMultiplier = 1;
-            // if there was nothing to ship and it's not an error
-            // set "ageOfLastShippedOp" to <now> to indicate that we're current
-            
metrics.setAgeOfLastShippedOp(EnvironmentEdgeManager.currentTime(), walGroupId);
+          WALEntryBatch entryBatch = entryReader.take();
+          for (Map.Entry<String, Long> entry : 
entryBatch.getLastSeqIds().entrySet()) {
+            waitingUntilCanPush(entry);
{noformat}

2) After source gets OPs to replicate and successfully ships it to target, 
source metrics then keep showing lags, even if there was no new edits to 
replicate. This is also wrong, and was apparently introduced by changes from 
HBASE-15093, which has modified the way log que size is accounted, and 
replication lag calculation logic seems to rely on the log queue size in 
ReplicationLoad:
{noformat}
      long ageOfLastShippedOp = sm.getAgeOfLastShippedOp();
      int sizeOfLogQueue = sm.getSizeOfLogQueue();
      long timeStampOfLastShippedOp = sm.getTimeStampOfLastShippedOp();
      long replicationLag;
      long timePassedAfterLastShippedOp =
          EnvironmentEdgeManager.currentTime() - timeStampOfLastShippedOp;
      if (sizeOfLogQueue != 0) {
        // err on the large side
        replicationLag = Math.max(ageOfLastShippedOp, 
timePassedAfterLastShippedOp);
      } else if (timePassedAfterLastShippedOp < 2 * ageOfLastShippedOp) {
        replicationLag = ageOfLastShippedOp; // last shipped happen recently
      } else {
        // last shipped may happen last night,
        // so NO real lag although ageOfLastShippedOp is non-zero
        replicationLag = 0;
      }
{noformat}

I'll be opening another jira to fix the source metrics issues mentioned above.

> "status 'replication'" should not show SINK if the cluster does not act as 
> sink
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-21406
>                 URL: https://issues.apache.org/jira/browse/HBASE-21406
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Daisuke Kobayashi
>            Assignee: Wellington Chevreuil
>            Priority: Minor
>         Attachments: HBASE-21406-branch-1.001.patch, Screen Shot 2018-10-31 
> at 18.12.54.png
>
>
> When replicating in 1 way, from source to target, {{status 'replication'}} on 
> source always dumps SINK with meaningless metrics. It only makes sense when 
> running the command on target cluster.
> {{status 'replication'}} on source, for example. {{AgeOfLastAppliedOp}} is 
> always zero and {{TimeStampsOfLastAppliedOp}} does not get updated from the 
> time the RS started since it's not acting as sink.
> {noformat}
>     source-1.com
>        SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=0, 
> TimeStampsOfLastShippedOp=Mon Oct 29 23:44:14 PDT 2018, Replication Lag=0
>        SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Thu Oct 25 
> 23:56:53 PDT 2018
> {noformat}
> {{status 'replication'}} on target works as expected. SOURCE is empty as it's 
> not acting as source:
> {noformat}
>     target-1.com
>        SOURCE:
>        SINK  : AgeOfLastAppliedOp=70, TimeStampsOfLastAppliedOp=Mon Oct 29 
> 23:44:08 PDT 2018
> {noformat}
> This is because {{getReplicationLoadSink}}, called in {{admin.rb}}, always 
> returns a value (not null).
> 1.X
> https://github.com/apache/hbase/blob/rel/1.4.0/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerLoad.java#L194-L204
> 2.X
> https://github.com/apache/hbase/blob/rel/2.0.0/hbase-client/src/main/java/org/apache/hadoop/hbase/ServerLoad.java#L392-L399



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to