[
https://issues.apache.org/jira/browse/HBASE-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477995#comment-16477995
]
stack commented on HBASE-20202:
-------------------------------
Thanks for asking [~sergey.soldatov]
bq. What will happen if master crashed during UnassignProcedure execution and
region went to close state?
Its a basic tenet of AMv2 that a proc completes; no pre-emption by another
(That'll be the provenance of an hbck2).
To answer your question explicitly, if master crashes toward of
unassignprocedure, then the unassign is not finished. On new master, it will
notice this and re-run the step that was doing the close step.... That is what
should happen sir.
bq. Recently we had a case when new master was trying to recover from the crash
and got stuck because the region was in CLOSED state and this check prevented
to create the procedure.
Yeah. If the UP is stuck... the Move won't work. Fail fast is better, no?
> [AMv2] Don't move region if its a split parent or offlined
> ----------------------------------------------------------
>
> Key: HBASE-20202
> URL: https://issues.apache.org/jira/browse/HBASE-20202
> Project: HBase
> Issue Type: Sub-task
> Components: amv2
> Affects Versions: 2.0.0-beta-2
> Reporter: stack
> Assignee: stack
> Priority: Critical
> Fix For: 2.0.0
>
> Attachments: HBASE-20202.branch-2.001.patch,
> HBASE-20202.branch-2.002.patch, HBASE-20202.branch-2.003.patch,
> HBASE-20202.branch-2.003.patch
>
>
> Found this one running ITBLLs. We'd just finished splitting a region
> 91655de06786f786b0ee9c51280e1ee6 and then a move for it comes in. The move
> fails in an interesting way. The location has been removed from the
> regionnode kept by the Master. HBASE-20178 adds macro checks on context. Need
> to add a few checks to the likes of MoveRegionProcedure so we don't try to
> move an offlined/split parent.
> {code}
> 2018-03-14 10:21:45,678 INFO [PEWorker-2] procedure2.ProcedureExecutor:
> Finished pid=3177, state=SUCCESS; SplitTableRegionProcedure
> table=IntegrationTestBigLinkedList, parent=91655de06786f786b0ee9c51280e1ee6,
> daughterA=b67bf6b79eaa83de788b0519f782ce8e,
> daughterB=99cf6ddb38cad08e3aa7635b6cac2e7b in 10.0210sec
> 2018-03-14 10:21:45,679 INFO [PEWorker-15]
> procedure.MasterProcedureScheduler: pid=3194, ppid=3193,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=IntegrationTestBigLinkedList, region=af198ca64b196fb3d2f5b3e815b2dad0,
> server=ve0530.halxg.cloudera.com,16020,1521007509855,
> IntegrationTestBigLinkedList,\xAA\xAA\xAA\xAA\xAA\xAA\xAA\xA0,1521047891276.af198ca64b196fb3d2f5b3e815b2dad0.
> 2018-03-14 10:21:45,680 INFO [PEWorker-5]
> procedure.MasterProcedureScheduler: pid=3187,
> state=RUNNABLE:MOVE_REGION_UNASSIGN; MoveRegionProcedure
> hri=IntegrationTestBigLinkedList,\x0C0\xC3\x0C0\xC3\x0C0,1521045713137.91655de06786f786b0ee9c51280e1ee6.,
> source=ve0530.halxg.cloudera.com,16020,1521007509855,
> destination=ve0528.halxg.cloudera.com,16020,1521047890874,
> IntegrationTestBigLinkedList,\x0C0\xC3\x0C0\xC3\x0C0,1521045713137.91655de06786f786b0ee9c51280e1ee6.
> 2018-03-14 10:21:45,680 INFO [PEWorker-15] assignment.RegionStateStore:
> pid=3194 updating hbase:meta
> row=IntegrationTestBigLinkedList,\xAA\xAA\xAA\xAA\xAA\xAA\xAA\xA0,1521047891276.af198ca64b196fb3d2f5b3e815b2dad0.,
> regionState=CLOSING
> 2018-03-14 10:21:45,680 INFO [PEWorker-5] procedure2.ProcedureExecutor:
> Initialized subprocedures=[{pid=3195, ppid=3187,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=IntegrationTestBigLinkedList, region=91655de06786f786b0ee9c51280e1ee6,
> server=ve0530.halxg.cloudera.com,16020,1521007509855}]
> 2018-03-14 10:21:45,683 INFO [PEWorker-15]
> assignment.RegionTransitionProcedure: Dispatch pid=3194, ppid=3193,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=IntegrationTestBigLinkedList, region=af198ca64b196fb3d2f5b3e815b2dad0,
> server=ve0530.halxg.cloudera.com,16020,1521007509855; rit=CLOSING,
> location=ve0530.halxg.cloudera.com,16020,1521007509855
> 2018-03-14 10:21:45,752 INFO [PEWorker-15]
> procedure.MasterProcedureScheduler: pid=3195, ppid=3187,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=IntegrationTestBigLinkedList, region=91655de06786f786b0ee9c51280e1ee6,
> server=ve0530.halxg.cloudera.com,16020,1521007509855,
> IntegrationTestBigLinkedList,\x0C0\xC3\x0C0\xC3\x0C0,1521045713137.91655de06786f786b0ee9c51280e1ee6.
> 2018-03-14 10:21:45,753 ERROR [PEWorker-15] procedure2.ProcedureExecutor:
> CODE-BUG: Uncaught runtime exception: pid=3195, ppid=3187,
> state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure
> table=IntegrationTestBigLinkedList, region=91655de06786f786b0ee9c51280e1ee6,
> server=ve0530.halxg.cloudera.com,16020,1521007509855
> java.lang.NullPointerException
>
>
>
> at
> java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
> at
> org.apache.hadoop.hbase.master.assignment.RegionStates.getOrCreateServer(RegionStates.java:934)
> at
> org.apache.hadoop.hbase.master.assignment.RegionStates.addRegionToServer(RegionStates.java:962)
> at
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1548)
>
>
> at
> org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:197)
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:304)
>
>
> at
> org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:86)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:845)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1452)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1221)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1740)
>
>
>
> 2018-03-14 10:21:45,835 DEBUG [RSProcedureDispatcher-pool3-t22]
> ipc.NettyRpcConnection: Connecting to
> ve0530.halxg.cloudera.com/10.17.240.23:16020
> {code}
> Will work on this after HBASE-20178
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)