Re: Replication Lag Issue in HBase DR Cluster after Upgrade

Valli Tue, 22 Aug 2023 05:16:47 -0700

Hi Duo

[image: Screenshot 2023-08-22 at 5.39.58 PM.png]



In the above metrics, sizeofLogQueue is always as 1 though we don't
have any entry for regionserver in the oldWAL folder.

https://github.com/apache/hbase/blob/rel/1.4.14/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationLoad.java

long timePassedAfterLastShippedOp =
EnvironmentEdgeManager.currentTime() - timeStampOfLastShippedOp;
if (sizeOfLogQueue != 0) {
// err on the large side
replicationLag = Math.max(ageOfLastShippedOp, timePassedAfterLastShippedOp);
} else if (timePassedAfterLastShippedOp < 2 * ageOfLastShippedOp) {
replicationLag = ageOfLastShippedOp; // last shipped happen recently
} else {
// last shipped may happen last night,
// so NO real lag although ageOfLastShippedOp is non-zero
replicationLag = 0;
}

Above is the code extract from 1.4.14, here in this if sizeOfLogQueue is
not equals 0 , the max value of either timePassedAfterLastShippedOp,
ageOfLastShippedOp has been displayed as replication Lag. How to find
sizeOfLogQueue value has been set.


On Sun, 20 Aug 2023 at 12:56, 张铎(Duo Zhang) <[email protected]> wrote:

> If it is just a metrics issue then HBASE-22784 won't help. I guess the
> problem is that the replication lag is calculated by comparing the
> current time and the time when we ship the last edit, so if there is
> no new edit, the replication lag will keep growing.
>
> Looking at the current code
>
>
> https://github.com/apache/hbase/blob/dae078e5bc342012b49cd066027eb53ae9a21280/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/MetricsSource.java#L341
>
>   public long getReplicationDelay() {
>     if (getTimestampOfLastShippedOp() >= timeStampNextToReplicate) {
>       return 0;
>     } else {
>       return EnvironmentEdgeManager.currentTime() -
> timeStampNextToReplicate;
>     }
>   }
>
> It has a if condition to check whether there are actual edits to
> replicate to avoid false alarming, which is added by HBASE-21505.
>
> The code for branch-1.4 is completely different, and since all hbase
> 1.x version have been EOL for quite some time, I'm not sure what is
> the easier way to fix the problem, maybe you need to read the code a
> bit more carefully to see how to add the above check in 1.x code line.
>
> Thanks.
>
> Valli <[email protected]> 于2023年8月17日周四 23:14写道：
> >
> > Hi Duo Zhang
> >
> > Its just metrics. Because in that cluster, there is no active write. So
> we
> > don't have any data to replicate to the another cluster.
> >
> >
> > On Wed, 16 Aug 2023 at 08:01, 张铎(Duo Zhang) <[email protected]>
> wrote:
> >
> > > Is this just a metrics issue or is there an actual replication lag?
> > >
> > > Valli <[email protected]> 于2023年8月11日周五 22:51写道：
> > > >
> > > > Hello HBase Community,
> > > >
> > > > We recently upgraded our HBase cluster from version 1.2.6 to 1.4.14
> and
> > > > have encountered an issue with replication lag in our Disaster
> Recovery
> > > > (DR) cluster. We have two clusters in our setup: an active write
> cluster
> > > > and a DR cluster that receives replication from the active cluster.
> The
> > > > replication lag in the DR cluster has been building up, even though
> there
> > > > are no direct writes to it.
> > > >
> > > > Here's a brief overview of the problem:
> > > > - We have an active write cluster with no replication lag.
> > > > - The DR cluster only receives replication from the active cluster
> and
> > > > doesn't have direct writes.
> > > > - Replication lag builds up in the DR cluster over time, even though
> > > there
> > > > is no active write.
> > > > - When a 'put' call is made in the DR cluster, the replication lag
> > > reduces
> > > > momentarily, but then starts building up .
> > > >
> > > > We have experienced similar kind of issue in 1.4.9 version in another
> > > > cluster.  We used the below patch for it.
> > > >
> > > > https://issues.apache.org/jira/browse/HBASE-22784
> > > >
> > > > But 1.4.14 version contains above patch but still we experience
> issue.
> > > >
> > > > If there are any specific configurations or adjustments we should be
> > > making
> > > > to address this problem. It's important for us to maintain a
> reliable DR
> > > > setup, and any guidance or insights you can provide would be greatly
> > > > appreciated.
> > > >
> > > > If anyone has experienced a similar issue after upgrading HBase or
> has
> > > any
> > > > recommendations on how to troubleshoot and resolve replication lag
> in a
> > > DR
> > > > cluster, please share your thoughts.
> > > >
> > > > Thank you in advance for your time and assistance. Your expertise and
> > > > insights are invaluable to us as we work to resolve this issue and
> > > maintain
> > > > the stability of our HBase setup.
> > > >
> > > > Best regards,
> > > > Manimekalai K
> > > > --
> > > > *Regards,*
> > > > *Manimekalai K*
> > >
>

Re: Replication Lag Issue in HBase DR Cluster after Upgrade

Reply via email to