[ 
https://issues.apache.org/jira/browse/HBASE-21078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588069#comment-16588069
 ] 

stack commented on HBASE-21078:
-------------------------------

Took a while. First attempt passed (12 hour run). Second failed with similar 
issue to above.

The split finishes here:

2018-08-21 12:33:27,117 INFO  [PEWorker-12] procedure2.ProcedureExecutor: 
Finished pid=564, state=SUCCESS, hasLock=false; SplitTableRegionProcedure 
table=IntegrationTestBigLinkedList, parent=c641daacdaeeb8e2c58eec40996b7d16, 
daughterA=0d197d0c63860ed157e61d26c1904684, 
daughterB=4ece9a55daa229aa324e3c3dc09bdb17 in 20.1210sec

In same milli, the move is scheduled:

2018-08-21 12:33:27,117 INFO  [PEWorker-16] procedure.MasterProcedureScheduler: 
pid=573, ppid=567, state=RUNNABLE:MOVE_REGION_UNASSIGN, hasLock=false; 
MoveRegionProcedure hri=c641daacdaeeb8e2c58eec40996b7d16, 
source=ve0530.halxg.cloudera.com,16020,1534878526440, 
destination=ve0530.halxg.cloudera.com,16020,1534878526440 checking lock on 
c641daacdaeeb8e2c58eec40996b7d16

The move is part of reopen prompted by modify table so there tends to be a lag 
between queu'ing and then the first move step, the unassign.

Adding check onlines. The exception is because a RegionNode has had its region 
location cleared because region is 'offline'/split. Add in checks into MP 
before running UP and skip to end of MP if not online. Also add in the queu'ing 
step to UP. In here check if online. If not, we'll skip to the AP. In AP, it 
already looks for split parent.

This sort of condition where a shared RegionStateNode is being written/read by 
two procedures is the sort of behavior the work in HBASE-20881 guards against. 
For now, adding in checks.



> [amv2] CODE-BUG NPE in RTP doing Unassign
> -----------------------------------------
>
>                 Key: HBASE-21078
>                 URL: https://issues.apache.org/jira/browse/HBASE-21078
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 2.0.1
>            Reporter: stack
>            Assignee: stack
>            Priority: Major
>             Fix For: 2.0.2
>
>
> Saw this is a run against tip of branch-2.0. The region had just finished 
> being split when the move goes to run.
> {code}
> 2018-08-18 16:55:14,908 INFO  [PEWorker-2] procedure2.ProcedureExecutor: 
> Finished pid=2028, state=SUCCESS, hasLock=false; SplitTableRegionProcedure 
> table=IntegrationTestBigLinkedList, parent=c3f199b5af62ae2ff8f8b6426b21d95d, 
> daughterA=31ccbf098ae615ce30f28ec84c956b8f, 
> daughterB=1890b4c96736f223f31efef11c817c90 in 9.0090sec
> 2018-08-18 16:55:14,908 INFO  [PEWorker-16] 
> procedure.MasterProcedureScheduler: pid=2038, ppid=2030, 
> state=RUNNABLE:MOVE_REGION_UNASSIGN, hasLock=false; MoveRegionProcedure 
> hri=c3f199b5af62ae2ff8f8b6426b21d95d, 
> source=ve0540.halxg.cloudera.com,16020,1534632630737, 
> destination=ve0540.halxg.cloudera.com,16020,1534632630737 checking lock on 
> c3f199b5af62ae2ff8f8b6426b21d95d
> 2018-08-18 16:55:14,958 INFO  [PEWorker-16] procedure2.ProcedureExecutor: 
> Initialized subprocedures=[{pid=2095, ppid=2038, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=false; UnassignProcedure 
> table=IntegrationTestBigLinkedList, region=c3f199b5af62ae2ff8f8b6426b21d95d, 
> server=ve0540.halxg.cloudera.com,16020,1534632630737}]
> 2018-08-18 16:55:15,008 INFO  [PEWorker-3] 
> procedure.MasterProcedureScheduler: pid=2095, ppid=2038, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=false; UnassignProcedure 
> table=IntegrationTestBigLinkedList, region=c3f199b5af62ae2ff8f8b6426b21d95d, 
> server=ve0540.halxg.cloudera.com,16020,1534632630737 checking lock on 
> c3f199b5af62ae2ff8f8b6426b21d95d
> 2018-08-18 16:55:15,085 ERROR [PEWorker-3] procedure2.ProcedureExecutor: 
> CODE-BUG: Uncaught runtime exception: pid=2095, ppid=2038, 
> state=RUNNABLE:REGION_TRANSITION_DISPATCH, hasLock=true; UnassignProcedure 
> table=IntegrationTestBigLinkedList, region=c3f199b5af62ae2ff8f8b6426b21d95d, 
> server=ve0540.halxg.cloudera.com,16020,1534632630737
> java.lang.NullPointerException
>   at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
>   at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:1097)
>   at 
> org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:1125)
>   at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1477)
>   at 
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:204)
>   at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:345)
>   at 
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
>   at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:873)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1556)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1344)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
>   at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1854)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to