[ https://issues.apache.org/jira/browse/HBASE-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135584#comment-17135584 ]
Pankaj Kumar commented on HBASE-24558: -------------------------------------- Yeah, we have 2000 active region in a RS > Classic (ZK-based) distributed WAL splitter is inscrutable in operation > ----------------------------------------------------------------------- > > Key: HBASE-24558 > URL: https://issues.apache.org/jira/browse/HBASE-24558 > Project: HBase > Issue Type: Bug > Components: MTTR > Reporter: Michael Stack > Priority: Major > > Scenario: decent sized cluster (700 nodes) and WALs backing up (TBD: A > Replication Peer was in place) such that many RSs had hundreds of WALs. Next, > a power outage-like event where we have to cold-start. Master comes up and > files SCPs for old server instances; hundreds. SCP finds there are hundreds > of WALs and dutifully asks the wal-splitter to do its job. Meantime it goes > to sleep until completion (See HBASE-24545); no status in log, just loads of > complaint in log about STUCK procedure worker. Operator sees a stalled master > dumping into its logs the occasional split completion and STUCK procedures > but thats it. Operator tries poking up in zk under '/hbase/splitWAL' to see > whats up but gets a spew back of one-line listing thousands of znodes. > Operator gets no indication of how we are progressing or of when all is done. > In my particular case, there were > 100k WALs backed-up. I could dig around > some and could see that 40k WALs were in-flight at any one time > (necessitating upping jute.maxbuffer for zk). I could do this: > {code}$ echo "ls /hbase/splitWAL" | hbase -Djute.maxbuffer=10000000 > org.apache.hadoop.hbase.zookeeper.ZKMainServer | sed -e 's/, /\n/g'|wc{code} > but it didn't help much as new WALs got added to the backlog. I could do the > below but it didn't help when WAL count > 100k: > {code}$ hdfs dfs -lsr /hbase/WALs/*|wc{code} > Is the count here all the WALs that need splitting (yes, I think). Its less > than zk count at any one time. How are they tied? Dunno. > This issue is about improving visibility in our distributed wal-splitter. In > particular the zk-based splitter visibility is poor (HBASE-21588 is a new > splitter built on procedures that probably has similar issue). We need to > have a count of overall outstanding tasks, of how long tasks are taking, and > of how well distributed the job is (see below). > An aside was that I had default for {code}HBASE_SPLIT_WAL_MAX_SPLITTER = > "hbase.regionserver.wal.max.splitters"{code} which is {code} > public static final int DEFAULT_HBASE_SPLIT_WAL_MAX_SPLITTER = 2; {code} so > in extremis, I upped the count to 10 but even then looking at random RS's > they were only getting work on occasion sometimes with hours passing between > split tasks as though the Master was bad at passing out the jobs: hard to > tell when no view on what is going on. > Below is from log of random RS... notice how sporadically it gets split work > though thousands of WALs to work on: > {code} > 89 2020-06-11 04:38:40,113 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@271eba8d > in 12309ms. Status = DONE > 90 2020-06-11 04:38:51,445 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6c9e0b6 > in 13230ms. Status = DONE > 91 2020-06-11 04:38:57,060 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@7f102aeb > in 16759ms. Status = DONE > 92 2020-06-11 09:17:12,762 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@28c31f7d > in 1181ms. Status = DONE > 93 2020-06-11 09:17:14,512 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@3ecd4959 > in 2224ms. Status = DONE > 94 2020-06-11 09:17:14,750 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@5a72b3ad > in 1345ms. Status = DONE > 95 2020-06-11 09:17:19,898 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6af0858d > in 5319ms. Status = DONE > 96 2020-06-11 09:17:21,558 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@1bca0d8b > in 890ms. Status = DONE > 97 2020-06-11 16:12:31,676 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@50a124bc > in 1193ms. Status = DONE > 98 2020-06-11 16:12:34,140 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@519c681c > in 2383ms. Status = DONE > 99 2020-06-11 16:12:35,943 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@4a47b062 > in 2278ms. Status = DONE > 100 2020-06-11 16:12:36,437 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@f9bcea > in 1497ms. Status = DONE > 101 2020-06-11 16:12:39,953 INFO > [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: > Worker ps1406.example.org,16020,1591849310794 done with task > org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@42d82ac4 > in 914ms. Status = DONE > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)