[ https://issues.apache.org/jira/browse/HBASE-27382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Kyle Purtell updated HBASE-27382: ---------------------------------------- Fix Version/s: 2.5.4 (was: 2.5.3) > Cluster completely down due to WAL splitting failing for hbase:meta table. > -------------------------------------------------------------------------- > > Key: HBASE-27382 > URL: https://issues.apache.org/jira/browse/HBASE-27382 > Project: HBase > Issue Type: Bug > Affects Versions: 2.5.0, 1.7.2, 2.4.14 > Reporter: Rushabh Shah > Assignee: Rushabh Shah > Priority: Major > Fix For: 2.6.0, 3.0.0-alpha-4, 2.4.17, 2.5.4 > > > We are running some version of 1.7.2 in our production environment. We > encountered this issue recently. > We colocate namenode and region server holding hbase:meta table on a set of 5 > master nodes. Co-incidentally active namenode and region server holding meta > table were on the same physical node and that node went down due to hardware > issue. We have sub optimal hdfs level timeouts configured so whenever active > namenode goes down, it takes around 12-15 minutes for hdfs client within > hbase to connect to new active namenode. So all the region servers were > having problems for about 15 minutes to connect to new active namenode. > Below are the sequence of events: > 1. Host running active namenode and hbase:meta went down at +2022-09-09 > 16:56:56,878+ > 2. HMaster started running ServerCrashProcedure at +2022-09-09 16:59:05,696+ > {noformat} > 2022-09-09 16:59:05,696 DEBUG [t-processor-pool2-t1] > procedure2.ProcedureExecutor - Procedure ServerCrashProcedure > serverName=<hmase-meta-RS>,61020,1662714013670, shouldSplitWal=true, > carryingMeta=true id=1 owner=dummy state=RUNNABLE:SERVER_CRASH_START added to > the store. > 2022-09-09 16:59:05,702 DEBUG [t-processor-pool2-t1] master.ServerManager - > Added=<hmase-meta-RS>,61020,1662714013670 to dead servers, submitted shutdown > handler to be executed meta=true > 2022-09-09 16:59:05,707 DEBUG [ProcedureExecutor-0] master.DeadServer - > Started processing <hmase-meta-RS>,61020,1662714013670; numProcessing=1 > 2022-09-09 16:59:05,712 INFO [ProcedureExecutor-0] > procedure.ServerCrashProcedure - Start processing crashed > <hmase-meta-RS>,61020,1662714013670 > {noformat} > 3. SplitLogManager created 2 split log tasks in zookeeper. > {noformat} > 2022-09-09 16:59:06,049 INFO [ProcedureExecutor-1] master.SplitLogManager - > Started splitting 2 logs in > [hdfs://<cluster-name>/hbase/WALs/<hmase-meta-RS>,61020,1662714013670-splitting] > for [<hmase-meta-RS>,61020,1662714013670] > 2022-09-09 16:59:06,081 DEBUG [main-EventThread] > coordination.SplitLogManagerCoordination - put up splitlog task at znode > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > 2022-09-09 16:59:06,093 DEBUG [main-EventThread] > coordination.SplitLogManagerCoordination - put up splitlog task at znode > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662739251611.meta > {noformat} > 4. The first split log task is more interesting: > +/hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta+ > 5. Since all the region servers were having problems connecting to active > namenode, SplitLogManager tried total of 4 times to assign this task (3 > resubmits, configured by hbase.splitlog.max.resubmit) and then finally gave > up. > {noformat} > ---------- try 1 --------------------- > 2022-09-09 16:59:06,205 INFO [main-EventThread] > coordination.SplitLogManagerCoordination - task > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > acquired by <RS-1>,61020,1662540522069 > ---------- try 2 --------------------- > 2022-09-09 17:01:06,642 INFO [ager__ChoreService_1] > coordination.SplitLogManagerCoordination - resubmitting task > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > 2022-09-09 17:01:06,666 DEBUG [main-EventThread] > coordination.SplitLogManagerCoordination - task not yet acquired > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > ver = 2 > 2022-09-09 17:01:06,715 INFO [main-EventThread] > coordination.SplitLogManagerCoordination - task > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > acquired by <RS-2>,61020,1662530684713 > ---------- try 3 --------------------- > 2022-09-09 17:03:07,643 INFO [ager__ChoreService_1] > coordination.SplitLogManagerCoordination - resubmitting task > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > 2022-09-09 17:03:07,687 DEBUG [main-EventThread] > coordination.SplitLogManagerCoordination - task not yet acquired > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > ver = 4 > 2022-09-09 17:03:07,738 INFO [main-EventThread] > coordination.SplitLogManagerCoordination - task > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > acquired by <RS-3>,61020,1662542355806 > ---------- try 4 --------------------- > 2022-09-09 17:05:08,684 INFO [ager__ChoreService_1] > coordination.SplitLogManagerCoordination - resubmitting task > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > 2022-09-09 17:05:08,717 DEBUG [main-EventThread] > coordination.SplitLogManagerCoordination - task not yet acquired > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > ver = 6 > 2022-09-09 17:05:08,757 INFO [main-EventThread] > coordination.SplitLogManagerCoordination - task > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > acquired by <RS-4>,61020,1662543537599 > ------- after all retries exhausted --------- > 2022-09-09 17:07:09,642 INFO [ager__ChoreService_1] > coordination.SplitLogManagerCoordination - Skipping resubmissions of task > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > because threshold 3 reached > {noformat} > 6. After it exhausts all the retries, HMaster now will wait indefinitely for > <RS-4> to complete the splitting. In this case the last region server died > and hmaster couldn't assign hbase:meta table to any other region server. > 7. We failed over the active hmaster and then the new active hmaster was able > to successfully assign hbase:meta to another region server. > Looking at the last region server logs (<RS-4>) > It acquired the task of splitting log at +2022-09-09 17:05:08,732+ and got > aborted at +2022-09-09 17:10:12,151+ > SplitLogWorker thread was terminated as a part of termination process. > {noformat} > 2022-09-09 17:05:08,732 INFO [-<RS-4>:61020] > coordination.ZkSplitLogWorkerCoordination - worker <RS-4>,61020,1662543537599 > acquired task > /hbase/splitWAL/WALs%2F<hmase-meta-RS>%2C61020%2C1662714013670-splitting%2F<hmase-meta-RS>%252C61020%252C1662714013670.meta.1662735651285.meta > 2022-09-09 17:10:12,151 FATAL [,queue=21,port=61020] > regionserver.HRegionServer - ABORTING region server > <RS-4>,61020,1662543537599: WAL sync failed, aborting to preserve WAL as > source of truth > 2022-09-09 17:10:12,185 INFO [-<RS-4>:61020] regionserver.SplitLogWorker - > SplitLogWorker interrupted. Exiting. > {noformat} > +Possible Improvements+ > 1. Can we remove the maximum number of retries config > hbase.splitlog.max.resubmit if table is hbase:meta table or for all the > tables ? > 2. If HMaster is not able to assign split wal tasks after > hbase.splitlog.max.resubmit retries, then should we just self abort ? > Any other suggestions are welcome. -- This message was sent by Atlassian Jira (v8.20.10#820010)