Hi Duo [image: Screenshot 2023-08-22 at 5.39.58 PM.png]
In the above metrics, sizeofLogQueue is always as 1 though we don't have any entry for regionserver in the oldWAL folder. https://github.com/apache/hbase/blob/rel/1.4.14/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationLoad.java long timePassedAfterLastShippedOp = EnvironmentEdgeManager.currentTime() - timeStampOfLastShippedOp; if (sizeOfLogQueue != 0) { // err on the large side replicationLag = Math.max(ageOfLastShippedOp, timePassedAfterLastShippedOp); } else if (timePassedAfterLastShippedOp < 2 * ageOfLastShippedOp) { replicationLag = ageOfLastShippedOp; // last shipped happen recently } else { // last shipped may happen last night, // so NO real lag although ageOfLastShippedOp is non-zero replicationLag = 0; } Above is the code extract from 1.4.14, here in this if sizeOfLogQueue is not equals 0 , the max value of either timePassedAfterLastShippedOp, ageOfLastShippedOp has been displayed as replication Lag. How to find sizeOfLogQueue value has been set. On Sun, 20 Aug 2023 at 12:56, 张铎(Duo Zhang) <palomino...@gmail.com> wrote: > If it is just a metrics issue then HBASE-22784 won't help. I guess the > problem is that the replication lag is calculated by comparing the > current time and the time when we ship the last edit, so if there is > no new edit, the replication lag will keep growing. > > Looking at the current code > > > https://github.com/apache/hbase/blob/dae078e5bc342012b49cd066027eb53ae9a21280/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/MetricsSource.java#L341 > > public long getReplicationDelay() { > if (getTimestampOfLastShippedOp() >= timeStampNextToReplicate) { > return 0; > } else { > return EnvironmentEdgeManager.currentTime() - > timeStampNextToReplicate; > } > } > > It has a if condition to check whether there are actual edits to > replicate to avoid false alarming, which is added by HBASE-21505. > > The code for branch-1.4 is completely different, and since all hbase > 1.x version have been EOL for quite some time, I'm not sure what is > the easier way to fix the problem, maybe you need to read the code a > bit more carefully to see how to add the above check in 1.x code line. > > Thanks. > > Valli <kmanimeka...@gmail.com> 于2023年8月17日周四 23:14写道: > > > > Hi Duo Zhang > > > > Its just metrics. Because in that cluster, there is no active write. So > we > > don't have any data to replicate to the another cluster. > > > > > > On Wed, 16 Aug 2023 at 08:01, 张铎(Duo Zhang) <palomino...@gmail.com> > wrote: > > > > > Is this just a metrics issue or is there an actual replication lag? > > > > > > Valli <kmanimeka...@gmail.com> 于2023年8月11日周五 22:51写道: > > > > > > > > Hello HBase Community, > > > > > > > > We recently upgraded our HBase cluster from version 1.2.6 to 1.4.14 > and > > > > have encountered an issue with replication lag in our Disaster > Recovery > > > > (DR) cluster. We have two clusters in our setup: an active write > cluster > > > > and a DR cluster that receives replication from the active cluster. > The > > > > replication lag in the DR cluster has been building up, even though > there > > > > are no direct writes to it. > > > > > > > > Here's a brief overview of the problem: > > > > - We have an active write cluster with no replication lag. > > > > - The DR cluster only receives replication from the active cluster > and > > > > doesn't have direct writes. > > > > - Replication lag builds up in the DR cluster over time, even though > > > there > > > > is no active write. > > > > - When a 'put' call is made in the DR cluster, the replication lag > > > reduces > > > > momentarily, but then starts building up . > > > > > > > > We have experienced similar kind of issue in 1.4.9 version in another > > > > cluster. We used the below patch for it. > > > > > > > > https://issues.apache.org/jira/browse/HBASE-22784 > > > > > > > > But 1.4.14 version contains above patch but still we experience > issue. > > > > > > > > If there are any specific configurations or adjustments we should be > > > making > > > > to address this problem. It's important for us to maintain a > reliable DR > > > > setup, and any guidance or insights you can provide would be greatly > > > > appreciated. > > > > > > > > If anyone has experienced a similar issue after upgrading HBase or > has > > > any > > > > recommendations on how to troubleshoot and resolve replication lag > in a > > > DR > > > > cluster, please share your thoughts. > > > > > > > > Thank you in advance for your time and assistance. Your expertise and > > > > insights are invaluable to us as we work to resolve this issue and > > > maintain > > > > the stability of our HBase setup. > > > > > > > > Best regards, > > > > Manimekalai K > > > > -- > > > > *Regards,* > > > > *Manimekalai K* > > > >