[ 
https://issues.apache.org/jira/browse/HBASE-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashu Pachauri updated HBASE-16302:
----------------------------------
    Description: 
Replication exports metric ageOfLastShippedOp as an indication of how much 
replication is lagging. But, with multiwal enabled, it's not representative 
because replication could be lagging for a long time for one wal group 
(something wrong with a particular region) while being fine for others. The 
ageOfLastShippedOp becomes a useless metric for alerting in such a case.

Also, since there is no mapping between individual replication sources and 
replication sinks, the age of last applied op can be a highly spiky metric if 
only certain replication sources are lagging.

We should use histograms for these metrics and use maximum value of this 
histogram to report replication lag when building stats.

  was:
Replication exports metric ageOfLastShippedOp as an indication of how much 
replication is lagging. But, with multiwal enabled, it's not representative 
because replication could be lagging for a long time for one wal group 
(something wrong with a particular region) while being fine for others. The 
ageOfLastShippedOp becomes a useless metric for alerting in such a case.

We should just report the maximum of the age of last shipped ops across 
walgroups.


> age of last shipped op and age of last applied op should be a histogram
> -----------------------------------------------------------------------
>
>                 Key: HBASE-16302
>                 URL: https://issues.apache.org/jira/browse/HBASE-16302
>             Project: HBase
>          Issue Type: Improvement
>          Components: Replication
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>
> Replication exports metric ageOfLastShippedOp as an indication of how much 
> replication is lagging. But, with multiwal enabled, it's not representative 
> because replication could be lagging for a long time for one wal group 
> (something wrong with a particular region) while being fine for others. The 
> ageOfLastShippedOp becomes a useless metric for alerting in such a case.
> Also, since there is no mapping between individual replication sources and 
> replication sinks, the age of last applied op can be a highly spiky metric if 
> only certain replication sources are lagging.
> We should use histograms for these metrics and use maximum value of this 
> histogram to report replication lag when building stats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to