[ 
https://issues.apache.org/jira/browse/HBASE-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16634569#comment-16634569
 ] 

stack commented on HBASE-21259:
-------------------------------

Scenario is this:

 * meta has regions that reference a regionserver that is long gone. It was 
processed (or not if all MasterProcWALs have been removed) many restarts ago.
 * The table is borked. Some regions are not unassigned though their table is.
 * We run a mass unassign. Because table has many unassigned regions, it takes 
a while.
 * The first unassign queues a SCP for the long-dead server. It quickly runs 
through the SCP and finishes.. no logs to split.
 * Soon after, another scheduled unassign for same server is run. It queues an 
SCP (remember, if the unassign is against a server that is not online, we queue 
SCP and then wait on the SCP to wake the unassign so we do proper unassign 
cleanup in the handleRIT callback)... only in this case, the server is in the 
deadserver list and has been processed.... so this last assign just hangs for 
ever because the check for server state creates a new serverstatenode and new 
serverstatenodes default ONLINE.

It is sort of wonky and not 'usual' but I've been trashing my cluster and then 
trying to repair with hbck2. This is how I run into the odd state reported 
above.

A test would be sort of tough given the state is not normal.

Thanks [~allan163]

> [amv2] Revived deadservers; recreated serverstatenode
> -----------------------------------------------------
>
>                 Key: HBASE-21259
>                 URL: https://issues.apache.org/jira/browse/HBASE-21259
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 2.1.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 2.2.0, 2.1.1, 2.0.3
>
>
> On startup, I see servers being revived; i.e. their serverstatenode is 
> getting marked online even though its just been processed by 
> ServerCrashProcedure. It looks like this (in a patched server that reports on 
> whenever a serverstatenode is created):
> {code}
> 2018-09-29 03:45:40,963 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=3982597, 
> state=SUCCESS; ServerCrashProcedure 
> server=vb1442.halxg.cloudera.com,22101,1536675314426, splitWal=true, 
> meta=false in 1.0130sec
> ...
> 2018-09-29 03:45:43,733 INFO 
> org.apache.hadoop.hbase.master.assignment.RegionStates: CREATING! 
> vb1442.halxg.cloudera.com,22101,1536675314426
> java.lang.RuntimeException: WHERE AM I?
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:1116)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:1143)
>         at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1464)
>         at 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:200)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:369)
>         at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
>         at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:953)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1716)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1494)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:75)
>         at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:2022)
> {code}
> See how we've just finished a SCP which will have removed the 
> serverstatenode... but then we come across an unassign that references the 
> server that was just processed. The unassign will attempt to update the 
> serverstatenode and therein we create one if one not present. We shouldn't be 
> creating one.
> I think I see this a lot because I am scheduling unassigns with hbck2. The 
> servers crash and then come up with SCPs doing cleanup of old server and 
> unassign procedures in the procedure executor queue to be processed still.... 
>  but could happen at any time on cluster should an unassign happen get 
> scheduled near an SCP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to