[ https://issues.apache.org/jira/browse/HBASE-21288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655827#comment-16655827 ]
Hudson commented on HBASE-21288: -------------------------------- Results for branch branch-2.1 [build #485 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/485/]: (x) *{color:red}-1 overall{color}* ---- details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/485//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/485//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/485//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > HostingServer in UnassignProcedure is not accurate > -------------------------------------------------- > > Key: HBASE-21288 > URL: https://issues.apache.org/jira/browse/HBASE-21288 > Project: HBase > Issue Type: Sub-task > Components: amv2, Balancer > Affects Versions: 2.1.0, 2.0.2 > Reporter: Allan Yang > Assignee: Allan Yang > Priority: Major > Fix For: 2.1.1, 2.0.3 > > Attachments: HBASE-21288.branch-2.0.001.patch, > HBASE-21288.branch-2.0.002.patch > > > We have a case that a region shows status OPEN on a already dead server in > meta table(it is hard to trace how this happen), meaning this region is > actually not online. But balance came and scheduled a MoveReionProcedure for > this region, which created a mess: > The balancer 'thought' this region was on the server which has the same > address(but with different startcode). So it schedules a MRP from this online > server to another, but the UnassignProcedure dispatch the unassign call to > the dead server according to regionstate, which then found the server dead > and schedule a SCP for the dead server. But since the UnassignProcedure's > hostingServer is not accurate, the SCP can't interrupt it. > So, in the end, the SCP can't finish since the UnassignProcedure has the > region' lock, the UnassignProcedure can not finish since no one wake it, thus > stuck. > Here is log, notice that the server of the UnassignProcedure is > 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was > dispatch to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964' > {code} > 2018-10-10 14:34:50,011 INFO [PEWorker-4] > assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12, > state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure > table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, > server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, > location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 > 2018-10-10 14:34:50,011 WARN [PEWorker-4] > assignment.RegionTransitionProcedure(230): Remote call failed > hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, > state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure > table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, > server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, > location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; > exception=NoServerDispatchException > org.apache.hadoop.hbase.procedure2.NoServerDispatchException: > hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, > state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure > table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, > server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584 > //Then a SCP was scheduled > 2018-10-10 14:34:50,012 WARN [PEWorker-4] master.ServerManager(635): > Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but > server not online > 2018-10-10 14:34:50,012 INFO [PEWorker-4] master.ServerManager(615): > Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. > ,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. > ,16000,1539088156164 > 2018-10-10 14:34:50,017 DEBUG [PEWorker-4] > procedure2.ProcedureExecutor(1089): Stored pid=14, > state=RUNNABLE:SERVER_CRASH_START, hasLock=false; ServerCrashProcedure > server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, > splitWal=true, meta=false > //The SCP did not interrupt the UnassignProcedure but schedule new > AssignProcedure for this region > 2018-10-10 14:34:50,043 DEBUG [PEWorker-6] > procedure.ServerCrashProcedure(250): Done splitting WALs pid=14, > state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure > server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, > splitWal=true, meta=false > 2018-10-10 14:34:50,054 INFO [PEWorker-8] > procedure2.ProcedureExecutor(1691): Initialized subprocedures=[{pid=15, > ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; > AssignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, > {pid=16, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; > AssignProcedure table=hbase:req_intercept_rule, > region=460481706415d776b3742f428a6f579b}, {pid=17, ppid=14, > state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure > table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}] > {code} > Here I also added a safe fence in balancer, if such regions are found, > balancing is skipped for safe.It should do no harm. -- This message was sent by Atlassian JIRA (v7.6.3#76005)