[ https://issues.apache.org/jira/browse/HBASE-21288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649620#comment-16649620 ]
Allan Yang commented on HBASE-21288: ------------------------------------ {quote} Can this not get the server from the RegionNode? {quote} No, we need a MasterProcedureEnv passed in to get the AssignmentManager, which the toStringClassDetails() doesn't have. > HostingServer in UnassignProcedure is not accurate > -------------------------------------------------- > > Key: HBASE-21288 > URL: https://issues.apache.org/jira/browse/HBASE-21288 > Project: HBase > Issue Type: Sub-task > Components: amv2, Balancer > Affects Versions: 2.1.0, 2.0.2 > Reporter: Allan Yang > Assignee: Allan Yang > Priority: Major > Attachments: HBASE-21288.branch-2.0.001.patch, > HBASE-21288.branch-2.0.002.patch > > > We have a case that a region shows status OPEN on a already dead server in > meta table(it is hard to trace how this happen), meaning this region is > actually not online. But balance came and scheduled a MoveReionProcedure for > this region, which created a mess: > The balancer 'thought' this region was on the server which has the same > address(but with different startcode). So it schedules a MRP from this online > server to another, but the UnassignProcedure dispatch the unassign call to > the dead server according to regionstate, which then found the server dead > and schedule a SCP for the dead server. But since the UnassignProcedure's > hostingServer is not accurate, the SCP can't interrupt it. > So, in the end, the SCP can't finish since the UnassignProcedure has the > region' lock, the UnassignProcedure can not finish since no one wake it, thus > stuck. > Here is log, notice that the server of the UnassignProcedure is > 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was > dispatch to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964' > {code} > 2018-10-10 14:34:50,011 INFO [PEWorker-4] > assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12, > state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure > table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, > server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, > location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 > 2018-10-10 14:34:50,011 WARN [PEWorker-4] > assignment.RegionTransitionProcedure(230): Remote call failed > hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, > state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure > table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, > server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, > location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; > exception=NoServerDispatchException > org.apache.hadoop.hbase.procedure2.NoServerDispatchException: > hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, > state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure > table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, > server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584 > //Then a SCP was scheduled > 2018-10-10 14:34:50,012 WARN [PEWorker-4] master.ServerManager(635): > Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but > server not online > 2018-10-10 14:34:50,012 INFO [PEWorker-4] master.ServerManager(615): > Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. > ,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. > ,16000,1539088156164 > 2018-10-10 14:34:50,017 DEBUG [PEWorker-4] > procedure2.ProcedureExecutor(1089): Stored pid=14, > state=RUNNABLE:SERVER_CRASH_START, hasLock=false; ServerCrashProcedure > server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, > splitWal=true, meta=false > //The SCP did not interrupt the UnassignProcedure but schedule new > AssignProcedure for this region > 2018-10-10 14:34:50,043 DEBUG [PEWorker-6] > procedure.ServerCrashProcedure(250): Done splitting WALs pid=14, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure > server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, > splitWal=true, meta=false > 2018-10-10 14:34:50,054 INFO [PEWorker-8] > procedure2.ProcedureExecutor(1691): Initialized subprocedures=[{pid=15, > ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; > AssignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, > {pid=16, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; > AssignProcedure table=hbase:req_intercept_rule, > region=460481706415d776b3742f428a6f579b}, {pid=17, ppid=14, > state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure > table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}] > {code} > Here I also added a safe fence in balancer, if such regions are found, > balancing is skipped for safe.It should do no harm. -- This message was sent by Atlassian JIRA (v7.6.3#76005)