Wellington Chevreuil created HBASE-21505:
--------------------------------------------

             Summary: Several inconsistencies on information reported for 
Replication Sources by hbase shell status 'replication' command.
                 Key: HBASE-21505
                 URL: https://issues.apache.org/jira/browse/HBASE-21505
             Project: HBase
          Issue Type: Bug
            Reporter: Wellington Chevreuil
            Assignee: Wellington Chevreuil


While reviewing hbase shell status 'replication' command, noticed the following 
issues related to replication source section:

1) TimeStampsOfLastShippedOp keeps getting updated and increasing even when no 
new edits were added to source, so nothing was really shipped. Test steps 
performed:
1.1) Source cluster with only one table targeted to replication;
1.2) Added a new row, confirmed the row appeared in Target cluster;
1.3) Issued status 'replication' command in source, TimeStampsOfLastShippedOp 
shows current timestamp T1.
1.4) Waited 30 seconds, no new data added to source. Issued status 
'replication' command, now shows timestamp T2.

2) When replication is stuck due some connectivity issues or target 
unavailability, if new edits are added in source, reported AgeOfLastShippedOp 
is wrongly showing same value as "Replication Lag". This is incorrect, 
AgeOfLastShippedOp should not change until there's indeed another edit shipped 
to target. Test steps performed:
2.1) Source cluster with only one table targeted to replication;
2.2) Stopped target cluster RS;
2.3) Put a new row on source. Running status 'replication' command does show 
lag increasing. TimeStampsOfLastShippedOp seems correct also, no further 
updates as described on bullet #1 above.
2.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
though there's no new edit shipped to target:
{noformat}
...
 SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
...

...
SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
...
{noformat}

3) AgeOfLastShippedOp gets set to 0 even when a given edit had taken some time 
before it got finally shipped to target. Test steps performed:
3.1) Source cluster with only one table targeted to replication;
3.2) Stopped target cluster RS;
3.3) Put a new row on source. 
3.4) AgeOfLastShippedOp keeps increasing together with Replication Lag, even 
though there's no new edit shipped to target:
{noformat}
T1:
...
 SOURCE: PeerID=1, AgeOfLastShippedOp=5581, SizeOfLogQueue=1, 
TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=5581
...
T2:
...
SOURCE: PeerID=1, AgeOfLastShippedOp=8586, SizeOfLogQueue=1, 
TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=8586
...
{noformat}
3.5) Restart target cluster RS and verified the new row appeared there. No new 
edit added, but status 'replication' command reports AgeOfLastShippedOp as 0, 
while it should be the diff between the time it concluded shipping at target 
and the time it was added in source:
{noformat}
SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
TimeStampsOfLastShippedOp=Wed Nov 21 02:50:23 GMT 2018, Replication Lag=0
{noformat}

4) When replication is stuck due some connectivity issues or target 
unavailability, if RS is restarted, once recovered queue source is started, 
TimeStampsOfLastShippedOp is set to initial java date (Thu Jan 01 01:00:00 GMT 
1970, for example), thus "Replication Lag" also gives a complete inaccurate 
value. 
Tests performed:
4.1) Source cluster with only one table targeted to replication;
4.2) Stopped target cluster RS;
4.3) Put a new row on source, restart RS on source, waited a few seconds for 
recovery queue source to startup, then it gives:
{noformat}
SOURCE: PeerID=1, AgeOfLastShippedOp=0, SizeOfLogQueue=1, 
TimeStampsOfLastShippedOp=Thu Jan 01 01:00:00 GMT 1970, Replication 
Lag=9223372036854775807
{noformat}

Also, we should report status to all sources running, current output format 
gives the impression there’s only one, even when there are recovery queues, for 
instance. 

Here is a list of ideas on how the command should report under different states 
of replication:
a) Source started, target stopped, no edits arrived on source yet: 
Status replication should not show any lags, no edits shipped, no edits arrived;

b) Source started, target stopped, add edit on source:
Status replication should report following info -> lag, time of edit arrival on 
source, additional message saying no edits had been shipped to target;

c) Source started, target stopped, edit added on source, restart source:
Status replication should list two sources, one normal, other recovered. Normal 
source should show no lags, no edits shipped, no edits arrived. Recovered 
should show no edits shipped, but should have edits arrived in source and lag > 
0;

d) Source started, target stopped, add edit on source, restart source, add 
another edit on source:
Status replication should list two sources, one normal, other recovered. Both 
sources should show no edits shipped, but should have edits arrived in source 
and lag > 0;

e) Source started, target stopped, add edit on source, restart source, add 
another edit on source, start target:
Status replication should list normal source only (after some short period), 
with proper times for last shipped, last arrived in source and no replication 
lag.

f) Source started, target stopped, add edit on source, restart source, restart 
target:
Status replication should list normal source only, with no shipped, nor arrived 
edits, and lag should be 0;



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to