Is there an HBase utility to dump the contents of ZooKeeper? The data in that path is not directly readable from ZooKeeper... I probably need to decode it somehow
Thanks, Hamado Dene Il mercoledì 18 settembre 2024 alle ore 16:26:20 CEST, 张铎(Duo Zhang) <palomino...@gmail.com> ha scritto: It is a bit strange that the positions for all the very old WAL files are -1? I skimmed the code for branch-2.5, it seems they should only be set to 0. Could you please try to dump the znode content for recording the position of the given WAL file? The path on zookeeper should be something like /hbase/replication/rs/<queue>/<file> Thanks. Hamado Dene <hamadod...@yahoo.com.invalid> 于2024年9月18日周三 15:16写道: > > > I did some investigations, and the WALs seem to be readable without any > issues... One strange thing I noticed is that the WALs are very old... they > are 1 year older than the current date. > > -rw-r--r-- 2 hbase hadoop 42594304 2023-10-09 08:27 > /hbase/oldWALs/rzv-db09-hd.xxxx%2C16020%2C1674973354605.1696810476448 > -rw-r--r-- 2 hbase hadoop 13622784 2023-10-09 08:26 > /hbase/oldWALs/rzv-db10-hd.xxxx%2C16020%2C1674973984596.1696810895708 > -rw-r--r-- 2 hbase hadoop 15872 2023-10-09 08:26 > /hbase/oldWALs/rzv-db12-hd.xxxx%2C16020%2C1674973371058.1696813278286 > -rw-r--r-- 2 hbase hadoop 15007744 2023-10-09 08:26 > /hbase/oldWALs/rzv-db13-hd.xxxx%2C16020%2C1684871532555.1696811057371 > -rw-r--r-- 2 hbase hadoop 20684288 2023-10-09 08:26 > /hbase/oldWALs/rzv-db14-hd.xxxx%2C16020%2C1674973593505.1696810047993 > > the current date is > Wed Sep 18 09:06:17 CEST 2024 > > But the log date is October 09 of 2023 > > Could this be the cause of the issue? > > Hamado Dene > > Il lunedì 16 settembre 2024 alle ore 16:37:12 CEST, Hamado Dene ><hamadod...@yahoo.com> ha scritto: > > > I deduced that it was one of the old WALs because, from the UI, I see that > these old WALs are not being replicated. However, I'll do another round of > checks to see if I can find something more. Would enabling debug help me find > more information? > > Thanks again for your help. > > > Replication Status > > - Current Log > - Replication Delay > > | PeerId | WalGroup | Current Log | Size | Queue Size | Offset | > | replicav3 | rzv-db10-hd.xxxx%2C16020%2C1674973984596 | > hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db10-hd.xxxx%2C16020%2C1674973984596.1696810895708 > | 13.0 M | 1 | -1 | > | replicav3 | rzv-db12-hd.xxxx%2C16020%2C1726056192276 | > hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db12-hd.xxxxx,16020,1726056192276/rzv-db12-hd.xxxxxx%2C16020%2C1726056192276.1726495470091 > | 0 | 1 | 98.0 M > > | > > > > > Replication Status > > - Current Log > - Replication Delay > > | PeerId | WalGroup | Current Log | Size | Queue Size | Offset | > | replicav3 | rzv-db10-hd.xxxx%2C16020%2C1726056520723 | > hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db10-hd.xxxx,16020,1726056520723/rzv-db10-hd.rozzano.diennea.lan%2C16020%2C1726056520723.1726495461864 > | 0 | 1 | 4.9 M | > | replicav3 | rzv-db14-hd.xxxxn%2C16020%2C1674973593505 | > hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db14-hd.rozzano.diennea.lan%2C16020%2C1674973593505.1696810047993 > | 19.7 M | 1 | -1 | > > > > > Replication Status > > - Current Log > - Replication Delay > > | PeerId | WalGroup | Current Log | Size | Queue Size | Offset | > | replicav3 | rzv-db11-hd.xxxx%2C16020%2C1726063232272 | > hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db11-hd.rozzano.diennea.lan,16020,1726063232272/rzv-db11-hd.rozzano.diennea.lan%2C16020%2C1726063232272.1726495580356 > | 0 | 1 | 16.8 M | > | replicav3 | rzv-db12-hd.xxx%2C16020%2C1674973371058 | > hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db12-hd.rozzano.diennea.lan%2C16020%2C1674973371058.1696813278286 > | 15.5 K | 1 | -1 | > > > > Replication Status > > - Current Log > - Replication Delay > > | PeerId | WalGroup | Current Log | Size | Queue Size | Offset | > | replicav3 | rzv-db09-hd.xxxx%2C16020%2C1674973354605 | > hdfs://rozzanohadoopcluster/hbase/oldWALs/rzv-db09-hd.rozzano.diennea.lan%2C16020%2C1674973354605.1696810476448 > | 40.6 M | 1 | -1 | > | replicav3 | rzv-db14-hd.xxx%2C16020%2C1726066551699 | > hdfs://rozzanohadoopcluster/hbase/WALs/rzv-db14-hd.rozzano.diennea.lan,16020,1726066551699/rzv-db14-hd.rozzano.diennea.lan%2C16020%2C1726066551699.1726496170126 > | 0 | 1 | 7.9 M | > > Il lunedì 16 settembre 2024 alle ore 16:11:19 CEST, 张铎(Duo Zhang) ><palomino...@gmail.com> ha scritto: > > The staktrace you posted is messed up so it is not easy to find out > which file actually blocks the replication progress... > > Could you please double check the WAL file which blocks the > replication? Is it really one of these old WAL files? > > Thanks. > > Hamado Dene <hamadod...@yahoo.com.invalid> 于2024年9月16日周一 21:57写道: > > > > Thanks for your response. > > If I try to read the WALs with the following command: > > hbase org.apache.hadoop.hbase.wal.WALPrettyPrinter > > /hbase/oldWALs/rzv-db13-hd.xxxx%2C16020%2C1684871532555.1696811057371 > > I don't get any error... The file seems to be read correctly. In fact, at > > the end of the reading, something like the following is printed: > > > > cell total size sum: 136edit heap size: 312position: 15007544```" > > > > > > Thanks, > > > > Il lunedì 16 settembre 2024 alle ore 14:51:02 CEST, 张铎(Duo Zhang) > ><palomino...@gmail.com> ha scritto: > > > > Have you tried to read these WAL files by WALPrettyPrinter? What is > > the error from WALPrettyPrinter while reading these files? > > > > Hamado Dene <hamadod...@yahoo.com.invalid> 于2024年9月16日周一 16:15写道: > > > > > > Checking the WALs on HDFS, there are very old WALs, from a year ago... > > > Does anyone have any idea how to handle this issue in production? > > > > > > -rw-r--r-- 2 hbase hadoop 20684288 2023-10-09 08:26 > > > /hbase/oldWALs/rzv-db14-hd.xxxx%2C16020%2C1674973593505.1696810047993 > > > -rw-r--r-- 2 hbase hadoop 15007744 2023-10-09 08:26 > > > /hbase/oldWALs/rzv-db13-hd.xxxx%2C16020%2C1684871532555.1696811057371 > > > -rw-r--r-- 2 hbase hadoop 15872 2023-10-09 08:26 > > > /hbase/oldWALs/rzv-db12-hd.xxxx%2C16020%2C1674973371058.1696813278286 > > > -rw-r--r-- 2 hbase hadoop 42594304 2023-10-09 08:27 > > > /hbase/oldWALs/rzv-db09-hd.xxxx%2C16020%2C1674973354605.1696810476448-rw-r--r-- > > > 2 hbase hadoop 13622784 2023-10-09 08:26 > > > /hbase/oldWALs/rzv-db10-hd.xxxx%2C16020%2C1674973984596.1696810895708 > > > Il giovedì 12 settembre 2024 alle ore 09:30:46 CEST, Hamado Dene > > ><hamadod...@yahoo.com> ha scritto: > > > > > > Hi community,Could anyone kindly assist me in resolving this issue I'm > > >facing? > > > Thank you in advance! > > > Hamado Dene > > > Il mercoledì 11 settembre 2024 alle ore 16:26:55 CEST, Hamado Dene > > ><hamadod...@yahoo.com> ha scritto: > > > > > > Hi HBase Community, > > > We are currently facing an issue in our production environment with HBase > > > replication, and I would greatly appreciate any guidance or suggestions > > > the community may have > > > > > > We are running HBase version 2.5.8, and in the logs, we consistently > > > encounter the following warning: > > > > > > > > > > > > 024-09-11T15:51:11,468 WARN > > > [RS_CLAIM_REPLICATION_QUEUE-regionserver/rzv-db09-hd:16020-0.replicationSource,replicav3-rzv-db13-hd.xxxx,16020,1684871532555-rzv-db09-hd.xxxx,16020,1696832789107-rzv-db09-hd.xxxx,16020,1696833033289-rzv-db13-hd.xxxx,16020,1722636062425-rzv-db13-hd.xxxx,16020,1722636803794-rzv-db12-hd.xxxx,16020,1722636800268.replicationSource.wal-reader.rzv-db13-hd.xxxx%2C16020%2C1684871532555,replicav3-rzv-db13-hd.xxxx,16020,1684871532555-rzv-db09-hd.xxxx,16020,1696832789107-rzv-db09-hd.xxxx,16020,1696833033289-rzv-db13-hd.xxxx,16020,1722636062425-rzv-db13-hd.xxxx,16020,1722636803794-rzv-db12-hd.xxxx,16020,1722636800268] > > > regionserver.ReplicationSourceWALReader: Failed to read stream of > > > replication entriesjava.io.EOFException: Cannot seek after EOF at > > > org.apache.hadoop.hdfs.DFSInputStream.seek(DFSInputStream.java:1682) > > > ~[hadoop-hdfs-client-2.10.2.jar:?] at > > > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:66) > > > ~[hadoop-common-2.10.2.jar:?] at > > > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.seekOnFs(ProtobufLogReader.java:527) > > > ~[hbase-server-2.5.8.jar:2.5.8] at > > > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.seek(ReaderBase.java:130) > > > ~[hbase-server-2.5.8.jar:2.5.8] at > > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.seek(WALEntryStream.java:408) > > > ~[hbase-server-2.5.8.jar:2.5.8] at > > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:339) > > > ~[hbase-server-2.5.8.jar:2.5.8] at > > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:308) > > > ~[hbase-server-2.5.8.jar:2.5.8] at > > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:298) > > > ~[hbase-server-2.5.8.jar:2.5.8] at > > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:172) > > > ~[hbase-server-2.5.8.jar:2.5.8] at > > > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102) > > > ~[hbase-server-2.5.8.jar:2.5.8] at > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.tryAdvanceStreamAndCreateWALBatch(ReplicationSourceWALReader.java:258) > > > ~[hbase-server-2.5.8.jar:2.5.8] at > > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:145) > > > ~[hbase-server-2.5.8.jar:2.5.8] > > > > > > > > > This error appears to stem from the replication WAL reader, and the > > > "Cannot seek after EOF" message suggests a failure to read the > > > replication entries. We suspect this may be affecting the replication > > > flow between our region servers. > > > > > > Has anyone encountered this problem before, or does anyone have insights > > > into potential causes and solutions? > > > > > > > > > Thank you in advance for your assistance! > > > > > > Hamado Dene > > >