[ https://issues.apache.org/jira/browse/HBASE-19726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16317443#comment-16317443 ]
Duo Zhang commented on HBASE-19726: ----------------------------------- OK, got it. Let's keep writing the state to it until we make sure that no one depend on it. But anyway, I think we need to handle the retrying problem if we failed here. An ipc call could fail with any reason, not only a problematic zk. Maybe we should move the line which changes the state of the procedure to the place after we have done all the works? Thanks. > Failed to start HMaster due to infinite retrying on meta assign > --------------------------------------------------------------- > > Key: HBASE-19726 > URL: https://issues.apache.org/jira/browse/HBASE-19726 > Project: HBase > Issue Type: Bug > Reporter: Duo Zhang > > This is what I got at first, an exception when trying to write something to > meta when meta has not been onlined yet. > {noformat} > 2018-01-07,21:03:14,389 INFO org.apache.hadoop.hbase.master.HMaster: Running > RecoverMetaProcedure to ensure proper hbase:meta deploy. > 2018-01-07,21:03:14,637 INFO > org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure: Start pid=1, > state=RUNNABLE:RECOVER_META_SPLIT_LOGS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true > 2018-01-07,21:03:14,645 INFO org.apache.hadoop.hbase.master.MasterWalManager: > Log folder > hdfs://c402tst-community/hbase/c402tst-community/WALs/c4-hadoop-tst-st27.bj,38900,1515330173896 > belongs to an existing region server > 2018-01-07,21:03:14,646 INFO org.apache.hadoop.hbase.master.MasterWalManager: > Log folder > hdfs://c402tst-community/hbase/c402tst-community/WALs/c4-hadoop-tst-st29.bj,38900,1515330177232 > belongs to an existing region server > 2018-01-07,21:03:14,648 INFO > org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure: pid=1, > state=RUNNABLE:RECOVER_META_ASSIGN_REGIONS; RecoverMetaProcedure > failedMetaServer=null, splitWal=true; Retaining meta assignment to server=null > 2018-01-07,21:03:14,653 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized > subprocedures=[{pid=2, ppid=1, state=RUNNABLE:REGION_TRANSITION_QUEUE; > AssignProcedure table=hbase:meta, region=1588230740}] > 2018-01-07,21:03:14,660 INFO > org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: pid=2, > ppid=1, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=hbase:meta, region=1588230740 hbase:meta hbase:meta,,1.1588230740 > 2018-01-07,21:03:14,663 INFO > org.apache.hadoop.hbase.master.assignment.AssignProcedure: Start pid=2, > ppid=1, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure > table=hbase:meta, region=1588230740; rit=OFFLINE, location=null; > forceNewPlan=false, retain=false > 2018-01-07,21:03:14,831 INFO > org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta > (replicaId=0) location in ZooKeeper as > c4-hadoop-tst-st27.bj,38900,1515330173896 > 2018-01-07,21:03:14,841 INFO > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: Dispatch > pid=2, ppid=1, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure > table=hbase:meta, region=1588230740; rit=OPENING, > location=c4-hadoop-tst-st27.bj,38900,1515330173896 > 2018-01-07,21:03:14,992 INFO > org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher: Using > procedure batch rpc execution for > serverName=c4-hadoop-tst-st27.bj,38900,1515330173896 version=3145728 > 2018-01-07,21:03:15,593 ERROR > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl: Cannot get replica 0 > location for > {"totalColumns":1,"row":"hbase:meta","families":{"table":[{"qualifier":"state","vlen":2,"tag":[],"timestamp":1515330195514}]},"ts":1515330195514} > 2018-01-07,21:03:15,594 WARN > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: > Retryable error trying to transition: pid=2, ppid=1, > state=RUNNABLE:REGION_TRANSITION_FINISH; AssignProcedure table=hbase:meta, > region=1588230740; rit=OPEN, > location=c4-hadoop-tst-st27.bj,38900,1515330173896 > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 > action: IOException: 1 time, servers with issues: null > at > org.apache.hadoop.hbase.client.BatchErrors.makeException(BatchErrors.java:54) > at > org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.getErrors(AsyncRequestFutureImpl.java:1250) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:457) > at org.apache.hadoop.hbase.client.HTable.put(HTable.java:570) > at > org.apache.hadoop.hbase.MetaTableAccessor.put(MetaTableAccessor.java:1450) > at > org.apache.hadoop.hbase.MetaTableAccessor.putToMetaTable(MetaTableAccessor.java:1439) > at > org.apache.hadoop.hbase.MetaTableAccessor.updateTableState(MetaTableAccessor.java:1785) > at > org.apache.hadoop.hbase.MetaTableAccessor.updateTableState(MetaTableAccessor.java:1151) > at > org.apache.hadoop.hbase.master.TableStateManager.udpateMetaState(TableStateManager.java:183) > at > org.apache.hadoop.hbase.master.TableStateManager.setTableState(TableStateManager.java:69) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsOpened(AssignmentManager.java:1515) > at > org.apache.hadoop.hbase.master.assignment.AssignProcedure.finishTransition(AssignProcedure.java:271) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:320) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:86) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1456) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1225) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1735) > {noformat} > And then I got repeated exception like this infinitely > {noformat} > 2018-01-07,21:03:15,596 WARN > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: > Retryable error trying to transition: pid=2, ppid=1, > state=RUNNABLE:REGION_TRANSITION_FINISH; AssignProcedure table=hbase:meta, > region=1588230740; rit=OPEN, > location=c4-hadoop-tst-st27.bj,38900,1515330173896 > org.apache.hadoop.hbase.exceptions.UnexpectedStateException: Expected > [OFFLINE, CLOSED, SPLITTING, SPLIT, OPENING, FAILED_OPEN] so could move to > OPEN but current state=OPEN > at > org.apache.hadoop.hbase.master.assignment.RegionStates$RegionStateNode.transitionState(RegionStates.java:155) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsOpened(AssignmentManager.java:1513) > at > org.apache.hadoop.hbase.master.assignment.AssignProcedure.finishTransition(AssignProcedure.java:271) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:320) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:86) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1456) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1225) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1735) > {noformat} > This is a bit strange. Since we are assigning meta, why we need to write the > state to meta table? > I checked the code a bit. > In AssignProcedure.finishTransition, we will do this > {code} > env.getAssignmentManager().markRegionAsOpened(regionNode); > {code} > And in AssignmentManager.markRegionAsOpened, we will do this > {code} > if (isMetaRegion(hri)) { > master.getTableStateManager().setTableState(TableName.META_TABLE_NAME, > TableState.State.ENABLED); > setMetaInitialized(hri, true); > } > {code} > And in TableStateManager.setTableState, we will call udpateMetaState(a > typo...) to write something to meta. > I think this will lead to a dead lock? I do not think we need to put the > state of meta table to meta table? It is always enabled... > But I do not know why it worked when I tried to restart the cluster... Maybe > we do not enter this code path for a non-fresh cluster? -- This message was sent by Atlassian JIRA (v6.4.14#64029)