[ https://issues.apache.org/jira/browse/HBASE-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634569#comment-16634569 ]
stack commented on HBASE-21259: ------------------------------- Scenario is this: * meta has regions that reference a regionserver that is long gone. It was processed (or not if all MasterProcWALs have been removed) many restarts ago. * The table is borked. Some regions are not unassigned though their table is. * We run a mass unassign. Because table has many unassigned regions, it takes a while. * The first unassign queues a SCP for the long-dead server. It quickly runs through the SCP and finishes.. no logs to split. * Soon after, another scheduled unassign for same server is run. It queues an SCP (remember, if the unassign is against a server that is not online, we queue SCP and then wait on the SCP to wake the unassign so we do proper unassign cleanup in the handleRIT callback)... only in this case, the server is in the deadserver list and has been processed.... so this last assign just hangs for ever because the check for server state creates a new serverstatenode and new serverstatenodes default ONLINE. It is sort of wonky and not 'usual' but I've been trashing my cluster and then trying to repair with hbck2. This is how I run into the odd state reported above. A test would be sort of tough given the state is not normal. Thanks [~allan163] > [amv2] Revived deadservers; recreated serverstatenode > ----------------------------------------------------- > > Key: HBASE-21259 > URL: https://issues.apache.org/jira/browse/HBASE-21259 > Project: HBase > Issue Type: Bug > Components: amv2 > Affects Versions: 2.1.0 > Reporter: stack > Assignee: stack > Priority: Major > Fix For: 2.2.0, 2.1.1, 2.0.3 > > > On startup, I see servers being revived; i.e. their serverstatenode is > getting marked online even though its just been processed by > ServerCrashProcedure. It looks like this (in a patched server that reports on > whenever a serverstatenode is created): > {code} > 2018-09-29 03:45:40,963 INFO > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=3982597, > state=SUCCESS; ServerCrashProcedure > server=vb1442.halxg.cloudera.com,22101,1536675314426, splitWal=true, > meta=false in 1.0130sec > ... > 2018-09-29 03:45:43,733 INFO > org.apache.hadoop.hbase.master.assignment.RegionStates: CREATING! > vb1442.halxg.cloudera.com,22101,1536675314426 > java.lang.RuntimeException: WHERE AM I? > at > org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:1116) > at > org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:1143) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1464) > at > org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:200) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:369) > at > org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1716) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1494) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2022) > {code} > See how we've just finished a SCP which will have removed the > serverstatenode... but then we come across an unassign that references the > server that was just processed. The unassign will attempt to update the > serverstatenode and therein we create one if one not present. We shouldn't be > creating one. > I think I see this a lot because I am scheduling unassigns with hbck2. The > servers crash and then come up with SCPs doing cleanup of old server and > unassign procedures in the procedure executor queue to be processed still.... > but could happen at any time on cluster should an unassign happen get > scheduled near an SCP. -- This message was sent by Atlassian JIRA (v7.6.3#76005)