[jira] [Updated] (HBASE-21288) HostingServer in UnassignProcedure is not accurate

2018-10-18 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21288:
---
   Resolution: Fixed
Fix Version/s: 2.0.3
   2.1.1
   Status: Resolved  (was: Patch Available)

> HostingServer in UnassignProcedure is not accurate
> --
>
> Key: HBASE-21288
> URL: https://issues.apache.org/jira/browse/HBASE-21288
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, Balancer
>Affects Versions: 2.1.0, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.1.1, 2.0.3
>
> Attachments: HBASE-21288.branch-2.0.001.patch, 
> HBASE-21288.branch-2.0.002.patch
>
>
> We have a case that a region shows status OPEN on a already dead server in 
> meta table(it is hard to trace how this happen), meaning this region is 
> actually not online. But balance came and scheduled a MoveReionProcedure for 
> this region, which created a mess:
> The balancer 'thought' this region was on the server which has the same 
> address(but with different startcode). So it schedules a MRP from this online 
> server to another, but the UnassignProcedure dispatch the unassign call to 
> the dead server according to regionstate, which then found the server dead 
> and schedule a SCP for the dead server. But since the UnassignProcedure's 
> hostingServer is not accurate, the SCP can't interrupt it.
> So, in the end, the SCP can't finish since the UnassignProcedure has the 
> region' lock, the UnassignProcedure can not finish since no one wake it, thus 
> stuck.
> Here is log, notice that the server of the UnassignProcedure is 
> 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was 
> dispatch to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964'
> {code}
> 2018-10-10 14:34:50,011 INFO  [PEWorker-4] 
> assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
> location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964
> 2018-10-10 14:34:50,011 WARN  [PEWorker-4] 
> assignment.RegionTransitionProcedure(230): Remote call failed 
> hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
> location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; 
> exception=NoServerDispatchException
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584
> //Then a SCP was scheduled
> 2018-10-10 14:34:50,012 WARN  [PEWorker-4] master.ServerManager(635): 
> Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but 
> server not online
> 2018-10-10 14:34:50,012 INFO  [PEWorker-4] master.ServerManager(615): 
> Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. 
> ,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. 
> ,16000,1539088156164
> 2018-10-10 14:34:50,017 DEBUG [PEWorker-4] 
> procedure2.ProcedureExecutor(1089): Stored pid=14, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=false; ServerCrashProcedure 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, 
> splitWal=true, meta=false
> //The SCP did not interrupt the UnassignProcedure but schedule new 
> AssignProcedure for this region
> 2018-10-10 14:34:50,043 DEBUG [PEWorker-6] 
> procedure.ServerCrashProcedure(250): Done splitting WALs pid=14, 
> state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, 
> splitWal=true, meta=false
> 2018-10-10 14:34:50,054 INFO  [PEWorker-8] 
> procedure2.ProcedureExecutor(1691): Initialized subprocedures=[{pid=15, 
> ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; 
> AssignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, 
> {pid=16, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; 
> AssignProcedure table=hbase:req_intercept_rule, 
> region=460481706415d776b3742f428a6f579b}, {pid=17, ppid=14, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure 
> table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}]
> {code}
> Here I also added a 

[jira] [Updated] (HBASE-21288) HostingServer in UnassignProcedure is not accurate

2018-10-11 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21288:
---
Attachment: HBASE-21288.branch-2.0.002.patch

> HostingServer in UnassignProcedure is not accurate
> --
>
> Key: HBASE-21288
> URL: https://issues.apache.org/jira/browse/HBASE-21288
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, Balancer
>Affects Versions: 2.1.0, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21288.branch-2.0.001.patch, 
> HBASE-21288.branch-2.0.002.patch
>
>
> We have a case that a region shows status OPEN on a already dead server in 
> meta table(it is hard to trace how this happen), meaning this region is 
> actually not online. But balance came and scheduled a MoveReionProcedure for 
> this region, which created a mess:
> The balancer 'thought' this region was on the server which has the same 
> address(but with different startcode). So it schedules a MRP from this online 
> server to another, but the UnassignProcedure dispatch the unassign call to 
> the dead server according to regionstate, which then found the server dead 
> and schedule a SCP for the dead server. But since the UnassignProcedure's 
> hostingServer is not accurate, the SCP can't interrupt it.
> So, in the end, the SCP can't finish since the UnassignProcedure has the 
> region' lock, the UnassignProcedure can not finish since no one wake it, thus 
> stuck.
> Here is log, notice that the server of the UnassignProcedure is 
> 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was 
> dispatch to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964'
> {code}
> 2018-10-10 14:34:50,011 INFO  [PEWorker-4] 
> assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
> location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964
> 2018-10-10 14:34:50,011 WARN  [PEWorker-4] 
> assignment.RegionTransitionProcedure(230): Remote call failed 
> hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
> location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; 
> exception=NoServerDispatchException
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584
> //Then a SCP was scheduled
> 2018-10-10 14:34:50,012 WARN  [PEWorker-4] master.ServerManager(635): 
> Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but 
> server not online
> 2018-10-10 14:34:50,012 INFO  [PEWorker-4] master.ServerManager(615): 
> Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. 
> ,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. 
> ,16000,1539088156164
> 2018-10-10 14:34:50,017 DEBUG [PEWorker-4] 
> procedure2.ProcedureExecutor(1089): Stored pid=14, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=false; ServerCrashProcedure 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, 
> splitWal=true, meta=false
> //The SCP did not interrupt the UnassignProcedure but schedule new 
> AssignProcedure for this region
> 2018-10-10 14:34:50,043 DEBUG [PEWorker-6] 
> procedure.ServerCrashProcedure(250): Done splitting WALs pid=14, 
> state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, 
> splitWal=true, meta=false
> 2018-10-10 14:34:50,054 INFO  [PEWorker-8] 
> procedure2.ProcedureExecutor(1691): Initialized subprocedures=[{pid=15, 
> ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; 
> AssignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, 
> {pid=16, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; 
> AssignProcedure table=hbase:req_intercept_rule, 
> region=460481706415d776b3742f428a6f579b}, {pid=17, ppid=14, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure 
> table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}]
> {code}
> Here I also added a safe fence in balancer, if such regions are found, 
> balancing is skipped for safe.It should do no harm.



--
This 

[jira] [Updated] (HBASE-21288) HostingServer in UnassignProcedure is not accurate

2018-10-10 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21288:
---
Description: 
We have a case that a region shows status OPEN on a already dead server in meta 
table(it is hard to trace how this happen), meaning this region is actually not 
online. But balance came and scheduled a MoveReionProcedure for this region, 
which created a mess:
The balancer 'thought' this region was on the server which has the same 
address(but with different startcode). So it schedules a MRP from this online 
server to another, but the UnassignProcedure dispatch the unassign call to the 
dead server according to regionstate, which then found the server dead and 
schedule a SCP for the dead server. But since the UnassignProcedure's 
hostingServer is not accurate, the SCP can't interrupt it.
So, in the end, the SCP can't finish since the UnassignProcedure has the 
region' lock, the UnassignProcedure can not finish since no one wake it, thus 
stuck.

Here is log, notice that the server of the UnassignProcedure is 
'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was dispatch 
to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964'

{code}
2018-10-10 14:34:50,011 INFO  [PEWorker-4] 
assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964
2018-10-10 14:34:50,011 WARN  [PEWorker-4] 
assignment.RegionTransitionProcedure(230): Remote call failed 
hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; 
exception=NoServerDispatchException
org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584

//Then a SCP was scheduled
2018-10-10 14:34:50,012 WARN  [PEWorker-4] master.ServerManager(635): 
Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but 
server not online
2018-10-10 14:34:50,012 INFO  [PEWorker-4] master.ServerManager(615): 
Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. 
,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. ,16000,1539088156164
2018-10-10 14:34:50,017 DEBUG [PEWorker-4] procedure2.ProcedureExecutor(1089): 
Stored pid=14, state=RUNNABLE:SERVER_CRASH_START, hasLock=false; 
ServerCrashProcedure server=hb-uf6oyi699w8h700f0-003.hbase.rds. 
,16020,1539076734964, splitWal=true, meta=false

//The SCP did not interrupt the UnassignProcedure but schedule new 
AssignProcedure for this region
2018-10-10 14:34:50,043 DEBUG [PEWorker-6] procedure.ServerCrashProcedure(250): 
Done splitting WALs pid=14, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, 
hasLock=true; ServerCrashProcedure server=hb-uf6oyi699w8h700f0-003.hbase.rds. 
,16020,1539076734964, splitWal=true, meta=false
2018-10-10 14:34:50,054 INFO  [PEWorker-8] procedure2.ProcedureExecutor(1691): 
Initialized subprocedures=[{pid=15, ppid=14, 
state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure 
table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, {pid=16, ppid=14, 
state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure 
table=hbase:req_intercept_rule, region=460481706415d776b3742f428a6f579b}, 
{pid=17, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; 
AssignProcedure table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}]
{code}


Here I also added a safe fence in balancer, if such regions are found, 
balancing is skipped for safe.It should do no harm.

  was:
We have a case that a region shows status OPEN on a already dead server in meta 
table(it is hard to trace how this happen), meaning this region is actually not 
online. But balance came and scheduled a MoveReionProcedure for this region, 
which created a mess:
The balancer 'thought' this region was on the server which has the same 
address(but with different startcode). So it schedules a MRP from this online 
server to another, but the UnassignProcedure dispatch the unassign call to the 
dead server according to regionstate, which then found the server dead and 
schedulre a SCP for the dead server. But since the UnassignProcedure's 
hostingServer is not accurate, the SCP can't interrupt it.
So, in the 

[jira] [Updated] (HBASE-21288) HostingServer in UnassignProcedure is not accurate

2018-10-10 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21288:
---
Attachment: HBASE-21288.branch-2.0.001.patch

> HostingServer in UnassignProcedure is not accurate
> --
>
> Key: HBASE-21288
> URL: https://issues.apache.org/jira/browse/HBASE-21288
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, Balancer
>Affects Versions: 2.1.0, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21288.branch-2.0.001.patch
>
>
> We have a case that a region shows status OPEN on a already dead server in 
> meta table(it is hard to trace how this happen), meaning this region is 
> actually not online. But balance came and scheduled a MoveReionProcedure for 
> this region, which created a mess:
> The balancer 'thought' this region was on the server which has the same 
> address(but with different startcode). So it schedules a MRP from this online 
> server to another, but the UnassignProcedure dispatch the unassign call to 
> the dead server according to regionstate, which then found the server dead 
> and schedulre a SCP for the dead server. But since the UnassignProcedure's 
> hostingServer is not accurate, the SCP can't interrupt it.
> So, in the end, the SCP can't finish since the UnassignProcedure has the 
> region' lock, the UnassignProcedure can finish since no one wake it, thus 
> stuck.
> Here is log, notice that the server of the UnassignProcedure is 
> 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was 
> dispatch to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964'
> {code}
> 2018-10-10 14:34:50,011 INFO  [PEWorker-4] 
> assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
> location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964
> 2018-10-10 14:34:50,011 WARN  [PEWorker-4] 
> assignment.RegionTransitionProcedure(230): Remote call failed 
> hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
> location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; 
> exception=NoServerDispatchException
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584
> //Then a SCP was scheduled
> 2018-10-10 14:34:50,012 WARN  [PEWorker-4] master.ServerManager(635): 
> Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but 
> server not online
> 2018-10-10 14:34:50,012 INFO  [PEWorker-4] master.ServerManager(615): 
> Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. 
> ,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. 
> ,16000,1539088156164
> 2018-10-10 14:34:50,017 DEBUG [PEWorker-4] 
> procedure2.ProcedureExecutor(1089): Stored pid=14, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=false; ServerCrashProcedure 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, 
> splitWal=true, meta=false
> //The SCP did not interrupt the UnassignProcedure but schedule new 
> AssignProcedure for this region
> 2018-10-10 14:34:50,043 DEBUG [PEWorker-6] 
> procedure.ServerCrashProcedure(250): Done splitting WALs pid=14, 
> state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, 
> splitWal=true, meta=false
> 2018-10-10 14:34:50,054 INFO  [PEWorker-8] 
> procedure2.ProcedureExecutor(1691): Initialized subprocedures=[{pid=15, 
> ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; 
> AssignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, 
> {pid=16, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; 
> AssignProcedure table=hbase:req_intercept_rule, 
> region=460481706415d776b3742f428a6f579b}, {pid=17, ppid=14, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure 
> table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}]
> {code}
> Here I also added a safe fence in balancer, if such regions are found, 
> balancing is skipped for safe.It should do no harm.



--
This message was sent by Atlassian JIRA

[jira] [Updated] (HBASE-21288) HostingServer in UnassignProcedure is not accurate

2018-10-10 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21288:
---
Status: Patch Available  (was: Open)

> HostingServer in UnassignProcedure is not accurate
> --
>
> Key: HBASE-21288
> URL: https://issues.apache.org/jira/browse/HBASE-21288
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2, Balancer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21288.branch-2.0.001.patch
>
>
> We have a case that a region shows status OPEN on a already dead server in 
> meta table(it is hard to trace how this happen), meaning this region is 
> actually not online. But balance came and scheduled a MoveReionProcedure for 
> this region, which created a mess:
> The balancer 'thought' this region was on the server which has the same 
> address(but with different startcode). So it schedules a MRP from this online 
> server to another, but the UnassignProcedure dispatch the unassign call to 
> the dead server according to regionstate, which then found the server dead 
> and schedulre a SCP for the dead server. But since the UnassignProcedure's 
> hostingServer is not accurate, the SCP can't interrupt it.
> So, in the end, the SCP can't finish since the UnassignProcedure has the 
> region' lock, the UnassignProcedure can finish since no one wake it, thus 
> stuck.
> Here is log, notice that the server of the UnassignProcedure is 
> 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584' but it was 
> dispatch to 'hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964'
> {code}
> 2018-10-10 14:34:50,011 INFO  [PEWorker-4] 
> assignment.RegionTransitionProcedure(252): Dispatch pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
> location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964
> 2018-10-10 14:34:50,011 WARN  [PEWorker-4] 
> assignment.RegionTransitionProcedure(230): Remote call failed 
> hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584; rit=CLOSING, 
> location=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; 
> exception=NoServerDispatchException
> org.apache.hadoop.hbase.procedure2.NoServerDispatchException: 
> hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964; pid=13, ppid=12, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f, 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539153278584
> //Then a SCP was scheduled
> 2018-10-10 14:34:50,012 WARN  [PEWorker-4] master.ServerManager(635): 
> Expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964 but 
> server not online
> 2018-10-10 14:34:50,012 INFO  [PEWorker-4] master.ServerManager(615): 
> Processing expiration of hb-uf6oyi699w8h700f0-003.hbase.rds. 
> ,16020,1539076734964 on hb-uf6oyi699w8h700f0-001.hbase.rds. 
> ,16000,1539088156164
> 2018-10-10 14:34:50,017 DEBUG [PEWorker-4] 
> procedure2.ProcedureExecutor(1089): Stored pid=14, 
> state=RUNNABLE:SERVER_CRASH_START, hasLock=false; ServerCrashProcedure 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, 
> splitWal=true, meta=false
> //The SCP did not interrupt the UnassignProcedure but schedule new 
> AssignProcedure for this region
> 2018-10-10 14:34:50,043 DEBUG [PEWorker-6] 
> procedure.ServerCrashProcedure(250): Done splitting WALs pid=14, 
> state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, hasLock=true; ServerCrashProcedure 
> server=hb-uf6oyi699w8h700f0-003.hbase.rds. ,16020,1539076734964, 
> splitWal=true, meta=false
> 2018-10-10 14:34:50,054 INFO  [PEWorker-8] 
> procedure2.ProcedureExecutor(1691): Initialized subprocedures=[{pid=15, 
> ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; 
> AssignProcedure table=hbase:acl, region=267335c85766c62479fb4a5f18a1e95f}, 
> {pid=16, ppid=14, state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; 
> AssignProcedure table=hbase:req_intercept_rule, 
> region=460481706415d776b3742f428a6f579b}, {pid=17, ppid=14, 
> state=RUNNABLE:REGION_TRANSITION_QUEUE, hasLock=false; AssignProcedure 
> table=hbase:namespace, region=ec7a965e7302840120a5d8289947c40b}]
> {code}
> Here I also added a safe fence in balancer, if such regions are found, 
> balancing is skipped for safe.It should do no harm.



--
This message was sent by Atlassian JIRA